Friday, April 26, 2013

Thoughts on Architecture: Fault Tolerance, Reliability, etc

Let's turn our thoughts to infrastructure (or Architecture).  Your thousand dollar home monitoring considers "architecture" doesn't it?  You know, security, availability, reliability, those boring things, right?

Home Alone uses SSL to connect from the home base to the "cloud" server.  We don't want people peeking at your data (or capturing plain text passwords). I don't know if the other monitors do secure connections, but *everything* (in the internet-of-things) should.

The home base software is designed with failure in mind. There will be internet outages. They happen. There will be moments (or minutes or hours) that the you could have service interruption. The home base software is designed to log sensor data to "persistent storage" (flash or hard disk) until a good connection is made to the server. If that connection goes away, data is still retained in persistent storage until the connection comes back.

I mentioned "fault tolerance" before. While there is no real recovery from total hardware failure (if the base station hardware dies, it dies -- there is no redundancy there unless you want to supply a backup base station), I do everything I can to make sure that data isn't lost.  Above, I mentioned logging sensor data to "persistent storage" when there isn't internet connectivity. Well, the system actually logs data regardless of connectivity. As soon as sensor data is collected (and filtered) it is written to non-volatile storage. The data is only deleted when it has been "confirmed" successfully received by the server.

Bad connection? No data is lost.  Home base loses power suddenly? Any logged data is not lost (which is probably most if not all sensor data collected up to the moment of power loss).

On the internet side, the data is also persisted. It is never thrown away. It is archived. Pick a day; it can be "played back".

There are plans for server "fault tolerance" too. I plan to have mirrors (US east and US west). Data will be replicated between the servers. A server outage (Amazon East Coast I am looking at you) won't result in total failure.

Architecture matters.

2 comments:

  1. But by what standard or requirement do you measure success? Is the goal to go from good to better or is it to achieve a measurable standard?

    Comcast and Vonage have avoided setting any firm standards by which we, as customers, can measure success. It has been my experience that this works well until it does not. If you ask these companies to investigate the performance issue, it ends at "it is working now..."

    How will you measure success? How do you create success on the squishy Comcast like foundation?

    ReplyDelete
  2. My success is measured independently of the carrier (Comcast, FIOS, etc). That is, *my* notion of success. The user's notion of success is whether or not they receive timely and accurate information. (Messages in vs Messages out). I log messages on both sides (i.e. base station and cloud). I can measure against what goes in and what comes out.

    My test bed conduit is my FIOS connection. It is pretty rock steady (I've had very few noticeable outages in the past few years). The baseline requirement for internet connection is 24 hours auto-connectivity. If the connection flakes out every once in a while, that is fine. So as long as the connection is automatically re-established in a timely manner. The message payload is tiny (less than 50 bytes per event). The base station doesn't send megabytes of data.

    But, what dictates an acceptable level of flakiness? I don't have a standard measure, but there is a general market rule of thumb: If I'm streaming a movie and my provider keeps dropping the connection, I am a very unhappy customer. While many people are locked into a single provider market now, this will change.

    For the time being, however, if you have 3G connectivity (or even lowly EDGE), that is sufficient for Home Alone to work.

    ReplyDelete