Improving Operational Reliability via the Data Center Incident Reporting Network

November 17, 2017

Electronic Powerhouse - Vertiv National Distributor

We recently released a report ranking the World’s Most Critical Industries, where we found some interesting comparisons to the data center industry. Although not reported, we found one stark contrast between the data center industry and the seven critical industries we identified: the lack of an incident reporting framework within the data center industry. This lack of framework could be due in large part to the difference in maturity of the data center industry compared to critical industries such as utilities or transportation.

However, there is some good news to report – the United Kingdom Data Centre Interest Group (UKDCIG), a not-for-profit organization focused on data center technologies, best practices and policies, recently announced the formation of the Data Center Incident Reporting Network (DCIRN). 

What is the Data Center Incident Reporting Network, why would I care, and what is in it for me?

Perhaps the best way to address these questions is to step back and think of the last experience you had traveling on a commercial airline. Despite any delays, crowded airplane seating or baggage issues, you arrived at your destination safely. Your safe arrival can be attributed to over 100 years of shared learnings from incidents, accidents and errors.

The aviation and maritime industries have a well-established practice of anonymously sharing incident reports. The time has come for the data industry to apply a similar practice. With the launch of the DCIRN, we have the beginning framework for the public reporting of data center incidents. And, we may have an Americas base soon via the Infrastructure Masons.

What is an incident, and why should the data center industry worry about reporting them?

Let’s start with the latter – why? Simply put, a data center or network outage may not only be an inconvenience; it could also have a detrimental financial impact. As new advancements such as Edge, IOT, Smart Cities and Smart Cars proliferate the economy, a data center or network incident will have the potential to impact life, similar to the risk posed by a potential aviation accident. We may not be able to avoid these risks indefinitely, but we can certainly push that unfortunate date much further into the future.

The simplified answer as to what constitutes an incident is anything that happens within your data center or network infrastructure that you would not want to happen again. The incident itself may not have caused an outage, but it still could have had an undesirable impact on operations. Unfortunately, there are hundreds, if not thousands, of unreported incidents per year in the data center industry. Many of these incidents are essentially the same, but with no network for sharing information, learning experiences are not passed along.

There is a learning curve* when people, mechanical and electrical systems, machines and software converge in the operation of a complex system such as a data center. The faster we can accelerate our path into the lowest portions of the bathtub curve, the more likely we will be able to run the facility with few incidents. When the proper learning, training and infrastructure design and operation are in place, we should be well equipped to survive any single or multiple incidents.

*The Learning Curve

For more on the nature of complex man-machine systems, read Managing Risk, the Human Element by Duffey + Saull. In short, it describes the main implication to data centers, which is that even tier IV data centers can and will experience an incident. Proper design of facilities and understanding the human element can prevent a single incident or errors in planning, design, operation and training from turning into a disaster.

Steps to Ensure Your Data Center Survives a Failure Scenario

  • Stay current on industry trade publications, online journals, forums, and blogs.
  • Attend industry events.
  • Participate with industry associations such as AFCOM, 7x24, UK DCIG, The Green Grid, Infrastructure Masons.
  • Engage your existing infrastructure hardware and service providers for additional training, technical or service advisory bulletins and assistance in framing incident response training based on their broader experience.

However, the simplest – and currently free – step you can take is to join the DCIRN. The first quarterly incident bulletin you receive may contain that one useful tip that saves you from an easily preventable outage.