Data Center Reliability and Availability

Data center reliability refers to the measurement of the time it takes between failures happening in your data center. All software and hardware have different failure points so it can be extremely difficult to predict when failure will occur.

Phrases of Data Center

Most components have three different phases that they go through:

  1. Burn in
  2. Normal Operation Time
  3. Failure

Most failures are known to occur during the burn-in period. Once the component reaches normal operating mode less occurrences happen. As with anything as the component ages failure is more likely to happen.

Electrical Storm Failure

Failures are also due to the area where you live. For example if your data center is located in an area with lots of electrical storms failure may occur more rapidly. Any extreme type of environmental condition will put stress on your components making them more vulnerable to failure.

Reliability Failure Rate

While reliability measures the failure rates of software and hardware components, availability measures the percentage of time the component is operating at its correct rate.

The availability rate has to be calculated and uses the following formula:

Availability = mean time to failure (MTTF) / (mean time to recovery (MTTR) + MTTF)

By knowing these rates you will have a good idea of when any particular component can be expected to fail. Items such as connectors and boards have a longer life expectancy. While boards and disks a much shorter span. You then have two choices; you can improve the MTTF or reduce the MTTR. It is also vitally important to have a good recovery method in place for when a failure does occur.

Unexpected Power Outage

Of course unexpected power outages have to be taken into account when determining data center reliability and availability formulas. If you are located in North America certain areas will see more outages along with blackout periods during the summer months.

Major Delays in Power

Once all the availability and reliability data is produced you can then proceed to take precautions to prevent any major delays in power from happening. This can be in the form of a backup system which will take over if power is suddenly lost. Serviceability is known as the time it takes to get your system back up and running after a failure has occurred.  This includes identifying and locating the component and getting back online.


You can also keep up to date with current trends and technology by visiting Data Center Talk where we keep you informed on important changes as they occur.

