MC // Thermal Management

What’s Hiding in High-Density Deployments?

Three things to look for to avoid costly consequences

SCROLL

The modern data center is a complex mix of moving parts, systems, and people. Among the various components and systems, every detail must be meticulously planned, tested, and maintained to handle the needs of today’s business and tomorrow’s technologies.

At the heart of it all — amid the servers, storage devices, and network connectivity infrastructure — is the cooling technology. Data center developers and operators have invested significant time and resources into both air- and liquid-cooled technologies to maximize energy efficiency and keep the various cabinets, cages, and large-scale deployments running smoothly.

Yet, these systems and designs are being driven to their limits. The average rack density increased by a full kilowatt in just the past year, due largely to the rise in AI and other compute-heavy demands. The increase in rack density coupled with an efficient data center layout can drive down overall build costs by reducing the total space required for deployments. And, while it’s customary for designers to run a computational fluid dynamics (CFD) model/analysis to prove the heat management is adequate, those models may not tell the entire story about modern facilities, where power densities of 400 W per square foot and 50 kW per rack are not unheard of.

Three Hidden Heat Risks

In the majority of critical facilities, UPSs with battery backup prevent power fluctuations from dropping workloads or damaging hardware while maintaining power until an emergency generator can turn on. This approach protects IT equipment, but it generally leaves mechanical systems behind because of cost considerations and budgetary constraints.

There are myriad reasons for generators to kick on — during testing, preventative maintenance, or repairs. These instances are precisely why redundant units are provided in concurrently maintainable facilities.

Regardless of the situation, backup generators can take up to 10 seconds to come online and restore power. While that may not seem like a long interval in traditional data center deployments, it can have unforeseen and wide-reaching consequences in high-density environments.

With that in mind, here are three concerns CFD analysis won’t show.

1. Thermal Runaway

Heat generated from high-density data center deployments can be intense.

Critical cooling systems need to have enough capacity to fully recover ideal room temperatures following an outage. Much like a sprinter who gets a poor jump off the starting line and never catches up to the gold medal winner, under-sizing the cooling system will exacerbate problems in high-density deployments. Data centers with a myopic focus on protecting just the IT load with a UPS do so at the expense of understanding the limitations of their mechanical systems. Relying solely on CFD analysis for design insights leaves facilities exposed to greater risk of business interruption caused by excessive heat buildup — a concept known as “thermal runaway.”

That’s because the IT load protected by a UPS does not stop producing heat while waiting for power to be restored to the facility. Without any way to remove it from the space, it will continue to build up and overwhelm the mechanical systems to a point of no return. And it can happen very quickly.

Designers need to have a more holistic understanding of the dynamics of cooling systems than what CFD and other surface-level analyses can provide.

It’s imperative for data center operators and tenants to thoughtfully and thoroughly examine high-density deployments from every perspective, asking targeted questions about whether the mechanical systems are equipped to support the environments. Accounting for design variations, like thermal storage and mechanical UPS power, may mitigate potential issues before they become full-blown problems.

2. Wasted Energy

Sometimes, data centers can have heat issues not related to thermal runaway. Instead, they’ll have problems with overcooling from a “kitchen sink” approach to managing hot spots.

That is, rather than targeted mitigation through advanced delivery methods for specific high-density racks, they’ll simply flood the room with more cold air, serving every rack, whether it’s needed or not. This is like treating the symptoms without diagnosing the problem.

Overcooling has been a recognizable issue for years, leading to millions of wasted kilowatt hours and countless dollars. While cooling systems have evolved and become more efficient, poor airflow management, controls resolution, and leaky containment structures all lead to unrealized energy savings and additional energy use.

When such a situation exists, close-coupled cooling will not only help support high-density racks but will also improve the PUE of the entire room. Rear door heat exchangers (RDHx), in-row cooling units (IRC), and direct liquid cooling (DLC) are fairly simple strategies to efficiently and effectively address concentrated heat loads.

3. Losing Control

While CFD analysis can show how the room responds on the whole to an increase in equipment power density from, say, 5 to 15 kW, it won’t tell you how quickly your mechanical systems can recover after an interruption.

Every disconnection coordinated with local utilities — the majority of which are facilitated by an open transition — can result in brief periods when cooling systems go offline. That gap, be it only 10 to 20 seconds, is routinely overlooked by CFD analyses but can lead to a host of problems.

Practically, mechanical systems take longer to recover than other systems. Moving parts, like fans, pumps, and compressors, need to spin to a halt, and equipment controllers — which themselves fail with some regularity — need to do an internal “wellness check” before they can restart. Conversely, the generator engine needs to come up to speed before it starts to produce electricity.

As a result, it's important to commission the entire mechanical system, including the control scheme and sequences, to make sure everything comes back online immediately after power is restored. Controllers themselves can lose their connection. It’s equally important to remember and plan for equipment, like chillers and fans, to be capable of running independently of the building automation system (BAS) to avoid unexpected or unsupervised failures that can lead to costly SLA and SLO violations, interruptions in mission critical applications, and physical damage to the hardware itself.

See the Light or Feel the Heat

Data center technologies continue to evolve quickly, and, with each passing generation, they become increasingly complex to manage and maintain. Common CFD analyses can show how a room performs in its “steady-state” condition — how it will respond over the course of several minutes to hours. But, the stakes in high-density environments are substantially higher with timescales that are significantly shorter. Owners and operators should request more computationally intense transient analyses to better understand and plan for the potential peak temperature a cabinet may reach and identify potential departures from ASHRAE allowable conditions.

In response, data center developers, operators, and their supply chain partners must continue to push the envelope with new remedies and strategies for dealing with the challenges of modern facilities. Those strategies must include shrinking outage windows to just seconds. High-density environments require well-devised plans for heat mitigation during often-overlooked short interruptions to avoid getting burned — literally and figuratively speaking.

Lead Image: {Photographer sdecoret]/[iStock] via Getty Images.

Brian Medina

Brian Medina is director of strategy and development for Stack Infrastructure.

Dress shirt, Eyebrow, Forehead, Chin, Sleeve, Collar, Cheek, Mouth, Lip, Ear

February 2021

Black-and-white, Font, Line, Text