We recently blogged about some statistics we have collected on different reasons why a website can be down. Our statistics mostly covered software and network issues with facts about the most common problems.
With this in mind we found it interesting when we encountered a survey by Aperture Research Institute done back in April 2007 that asked the question “Which causes of outage have you experienced?”
According to the survey, human errors are behind 57.3% of all data center outages. The second most common reason was improper failover with 43.7%.
Above: Diagram from the Aperture survey.
Another interesting fact from the survey is that 21% of the participants did not know how many servers their racks could handle power wise. This is a statement from the conclusion of the survey:
In many cases the widespread use of high-density equipment is putting a strain on data center infrastructure, massively reducing any margin for error…
Unfortunately the survey does not go into more detail about the type of errors involved, and of course human errors can be hard to categorize.
What you can do to minimize the risk of human errors
To minimize the risk of the “human factor” affecting operations, it is important to have up-to-date documentation on everything connected to your data center and manuals on how different critical operations should be performed.
Manuals and documentation together with scheduled tests should help you avoid many of the problems and outages described in this survey.
About the survey
More than 100 data center professionals from a broad selection of industries participated in the survey, which also covered the topics of data center power usage and the use of blade servers.