When Ma Bell Dressed to the Five 9s

Recently I received an email from a faculty member asking about the reliability of various technology systems on campus. This is a common concern and from a faculty perspective, quite reasonable. Unless you’ve had to stand in front of a group of students while trying to identify why your digital image won’t load on Blackboard or your You Tube video won’t play, you can’t know the sense of rising panic resulting from a dozen thoughts rushing through your head.

“What if I can’t get this to work?”
“How will I make my point?”
“I’ve only 50 minutes of class and I’ve already lost 10 of them!”
“I’m losing students’ attention”
“Am I missing something obvious; do I look like an idiot?”
“IT sucks!”

And even if someone steps up to help, the distraction is at best, an interruption to the flow of the classroom where focus and attention are essential conditions of intellectual discovery.

Or maybe your students reported that an out-of-class assignment or resource wasn’t available. The resulting delay or impact on students’ learning prompts further exasperation as you ascertain whether this was a wide spread issues or an isolated case. Either way, a redo is required. Again, lost time.

One’s degree of tolerance for uncertainty can help determine which technologies to use in class, and how – a topic that deserves its own post – however, given the ubiquitous nature of technology, it is worth reflecting on expectations for technology’s reliability and availability. The same day I received the email asking about the reliability of our campus systems, within hours I saw a post in my Google Reader regarding the outage of Amazon’s servers. The impact was broad, affecting such cloud-based services as:

If one thing is certain, Amazon has plowed far more financial, technical and human resources into their technology service infrastructure than we do in Higher Education.  Similarly, Google – no small player either – has had the occasional service outage, including:

  • February 27/28, 2011: Gmail outage
  • September 24, 2009: Gmail outage
  • September 1, 2009: Gmail outage
  • May 14, 2009: Google network outage
  • March 9, 2009: Gmail outage
  • August 7, 2008: Gmail and Google Apps outage

Traditionally, the availability of technology services is measured by the standard of Five Nines. Five Nines means that 99.999% of the time a given technology system is available. For example, over the course of a year, a technology-based service that meets this standard would be unavailable for a mere 5.26 minutes or less, (excluding planned downtime for maintenance). Interpreting the impact of Five Nines is not as straightforward as is sounds. The specific time that a system becomes unavailable may be more critical than the percentage of time it IS available. If the 5.26 minutes of unavailability occurs during your class period when you’ve planned to Skype-in Warren Buffet to discuss the Role of Philanthropy in the Pubic Sphere (thanks to a connection facilitated by a friend of a friend’s friend’s Banker’s wife’s high school teacher), those lost 5.26 minutes aren’t a trifle.

The Five Nines standard is an artifact of the reliability of the Plain Old Telephone Systems (POTS). As so often happens with technology, we tend to overlay our experience (and therefore, expectations) with old technologies on to new technologies. The result can be that we make naïve assumptions, a point raised by Randall Stross, a business professor at San Jose State University who asks,

“Can we realistically expect that such availability will ever come to Internet services?”

He correctly points out that “Perceived reliability is determined by the least reliable service in the chain.” And the chain runs long. Among the reasons you may perceive that a campus system is unavailable, say, for example, accessing email or blueView or Blackboard, are:

  1. The server crashed, i.e., a physical machine or component failed
  2. The server crashed i.e., a software system file corruption or process hangs
  3. A network router or switch fails
  4. Demand exceeds present capacity, i.e. we don’t have sufficient computing power to meet the volume of people using the system (like when trying to cram 50 people into a phone booth
  5. A link is broken, i.e. a developer or programmer (ours or the vendor’s) changed a piece of code
  6. A user’s authentication level changed, i.e. the business rules about who gets access to what don’t match reality
  7. A firewall or port or network rule was changed
  8. The browser or client is outdated, i.e. a conflict between old and new code
  9. An virus, malware, or bot is interfering, i.e. a conflict between bad and good code
  10. A user error occurred, i.e., fat fingers, bad information, lack of training, etc.

And these are just the first 10 possible reasons that come to mind.

There are two things that we, as a community, can do to increase predictability that a system will be reliable. Firstly, to effectively diagnose what caused the loss of a system’s availability we need data (while the unavailable system may be a problem to you – to us it is a symptom of a bigger problem). Data helps us identify patters, for example, is the time of day significant? Or the number of users? If you experience a loss of services, reporting it to the Support Center is enormously useful. And reporting via email is most effective (unless it is the email system that is out!) because it insures a time and date stamp and allows us to follow-up for more information, for example, if the problem is experienced only by a limited number of users.

The second thing that can be done is that DTS can do a better job of communicating current service quality. When Amazon’s servers when down their Health Dashboard reported the status.

AWS Service Health Dashboard

DTS is working to implement a similar dashboard that will indicate the current availability of key campus services. This will also allow us to compile historical patterns, based on objective data, to report availability. This is critical since the end user perceives availability. Identifying cases where a system is perceived as unavailable, but in fact is available, allows us to build stronger user support and communication services because we can target training needs, process-based improvements or necessary capacity improvements.

Look for more information this summer on our DTS Dashboard and until then, report outages to: supportcenter@drake.edu

 

 

Tags: , ,
Categories: CIO Blog, Services

Leave a Reply