Structured testing of data centres

Most data centre testing is focused on individual data halls rather than the entire facility. As a result components such as power, cooling, generators and back-up batteries for UPS can be missed out or not properly tested. Dave Wolfenden of Mafi Mushkila and Karl Sullivan of Optimum Power Services have joined forces to test the entire data centre not just part of it.

A need for structured testing not random guesswork

Any form of testing needs a structured approach. It doesn’t matter whether you are testing software, hardware, a data centre or doing the MOT on a car. Without a structured approach it is easy to miss things that could later turn out to be a major challenge.
Historically when testing data centres most of the attention has been on the data halls. In many respects that made and still makes perfect sense. This is where customer equipment is housed and where the majority of changes are made. The problem is that there are very few testing companies that have chosen to cover the entire data centre from generators to switches and internal lighting. 
Cloud computing and data sovereignty have intensified the building and refurbishment of data centres. Given the demand, Mafi Mushkila and Optimum Power Services have announced they are to collaborate on testing and have developed a four stage approach to testing.
A four stage approach to comprehensive facility testing
In order to properly test a facility it is necessary to start from the outside with the capital plant such as chillers, coolers, generators, transformers, UPS and electrical switchgear. From there the attention moves to the data halls. The reason for this is that the main power circuits into the premises rarely change. The same is true of the generators and often of the batteries for UPS emergency backup.

 What does change regularly is the inside components such as the hardware in the racks and even the racks themselves. The re-emergence of the mainframe as a two rack unit and its increasing uptake alongside the rapidly growing converged appliance market means that devices are now drawing more power than ever before, creating new heat spot problems. This means that internal testing is no longer a one-off, instead it needs to be a continual process.

Level 1: Commissioning and testing the cooling capital plant
The type of cooling chosen will determine the cooling equipment that is to be used. IT is key to test the equipment beyond the normal temperature range in which it will operate. For example, if you expect to be cooling a heat load of 50C, you will need to heat the coolant be that water, glycol or an alternative to at least that temperature using, for example, industrial boilers.
With the extra heat that is generated by new generations of hardware and greater density of deployment, it is important to consider testing substantially beyond the initial expected heat load. Once the coolant has been heated to the required temperature measure how long it takes the cooling and chiller systems to cool it. Depending on the type of cooling, external temperatures can affect the time taken for the coolant to drop to the required temperature.
One area that is often neglected is inline temperature testing. Temperature is often measured at the level of the equipment, rack, enclosure and even data hall. However, installing temperature monitors inside the air system itself will show if there is an imbalance in the way air is being moved through the system allowing engineers to quickly remediate.

Level 2: Testing the power systems
Generator testing can often be very difficult. You need to create representative load banks for the equipment the generators will have to power. These need to be spread around the facility to provide a reasonable representation of how the loading is expected to look once the data centre opens. The goal is to test both the generators and spot any power distribution issues that might occur. With data halls now designed for different power loads good testing processes will mean testing beyond the expected power per square metre to ensure that there is enough headroom for new generations of converged hardware.  

UPS testing also requires sufficient load to fully test the ability of the batteries to provide power to the data halls and equipment for the required length of time. A common mistake is not spreading the load throughout the facility to ensure that any power loss is properly factored in to the UPS.

An area that must not be overlooked is the impact of the power testing on the switchgear. Running the testing at maximum load for an extended time will show if there is risk of damage to the switchgear that has been installed. It will also indicate a fire risk so temperatures across critical components must be monitored constantly.
Level 3: The IT systems in the data hall
This is where the most mistakes are made. You don’t want equipment that creates spot heat or which cannot be run for extended periods without a risk of failure or even fire. The heat generated must match the expected IT load as much as possible and be deployed at different heights in the data centre.

The correct approach is to use industrial heating systems placed throughout the facility that run at different loads. Typically these range from 15kw and 22kw. However, changes in data centre loads mean that 50kw and 100kw is now being used to simulate denser equipment such as converged systems and racks of hyper-converged servers. 

Additional testing around both electrical and cooling systems now takes place. It will identify whether the effectiveness of the internal airflow, in and out, is balanced. It also ensures the electrical distribution systems and backup electrical systems are working correctly.
Level 4: Granular level testing of the data halls 
This part of the test programme focuses on the deployment scenarios for hardware. The facility owner needs to provide as much detail as possible on what is likely to be deployed and where. The type of system and how heavily it is used will allow testing to simulate load as close as possible to real life. 

If aisle containment is to be used, build temporary aisles and test the cooling to ensure the input and output air systems work correctly. If the racks and other infrastructure are available these should be installed and used for all the power and cooling tests.

Use rack mounted server emulators to provide the load for different types of IT equipment such as blade servers, racks servers, converged systems and storage systems. They allow a more accurate test of where heat is generated and how it is removed from the data hall. It is essential that at this point the testing of power and cooling systems includes the entire facility from chillers and generators to floor tiles and power strips.

Always compare the testing against the computational fluid dynamic model built to predict how the air will move in the data centre. Minor differences are common and often caused by changes in how infrastructure is placed. Significant differences highlight errors in the planning of the facility or the testing.

This provides a final chance to correct things before handing the facility over to the customer.

This article is just a brief look at a four level approach to testing a data centre.

The key takeaway is that you must test the entire facility in order to understand where problems may be waiting to strike. As data centres move towards being hyper-scale facilities where they are heavily automated, the need to get the power and cooling tests right first time and to identify the weaknesses or potential bottlenecks becomes essential.