Data centers are evolving to stay competitive, maximize performance and safeguard infrastructure. In the race to innovate, investments within these facilities are being made in isolation.
Too often, priority is given to a technology’s performance capabilities without considering its relationship with the larger data center technology ecosystem or potential safety, staffing or energy usage implications. These piecemeal, disconnected investments create blind spots or vulnerabilities for data centers, resulting in higher costs, operational inefficiencies and added risk that undercuts performance.
To get the most from new technology investments while minimizing risk, data center operators should consider the capabilities of new technologies along with operational, financial, commercial, security and environmental, health and safety implications.
Seeing the big picture
Data centers today face a set of interconnected challenges:
• Deploying high-performance computing (HPC) to power AI infrastructure, enable large language model (LLM) training and inference at scale, and maximize graphics processing unit (GPU) utilization for the most demanding workloads.
• Using cooling systems that keep up with growing power demands.
• Establishing operational visibility across diverse facility systems to spot and address issues.
A holistic approach that leverages expertise and considers the impact of new technologies from multiple perspectives can help data center teams make more strategic investments in the areas that are critical to their operations today and also sets them up for future improvements as operations evolve. Areas where a holistic approach can make meaningful impacts include:
Aligning HPC to specific data center workloads: High-performance computing (HPC) platforms are delivered with generalized configurations that are standard defaults for a variety of customers. Unfortunately, AI workloads are anything but generic.
To better align HPC platforms with the unique needs of their workloads, some data centers use optimization services. These services precisely match hardware capabilities with actual AI production patterns, such as training, inference at scale or hybrid recommender systems. This fine-tuning turns HPC platforms into true AI factories optimized for throughput, efficiency and scalable production workloads.
Some of the benefits that HPC optimization delivers include:
• Two to ten times improved performance on AI training jobs when systems are properly aligned, depending on baseline configuration and workload type.
• Extended system lifespan through thermal envelope control and smarter workload distribution.
• Faster time-to-insight, especially for LLMs and deep natural language processing (NLP) use cases.
• Improved return on investment per watt, per rack and per GPU.
• The creation of scalable multi-tenant environments with isolated GPU slices that don’t degrade Service Level Objectives.
• Visibility and predictability through full-stack observability and regression-aware tuning.
Going “under the hood” of HPC platforms sounds risky but a service provider that’s experienced in HPC optimization – and fluent in areas like AI platform telemetry, GPU-level tuning and orchestration stack integration – will keep the process entirely within vendor-supported configurations and tooling and keep platforms within their warranties.
Choosing the appropriate cooling system: Liquid cooling is required in today’s high-density data centers, which generate too much heat for traditional air-cooling systems.
Multiple liquid-cooling options are available. To select the right one, data center teams must consider factors like each technology’s performance ranges, deployment demands and maintenance requirements.
For example, rear-door heat exchangers provide an efficient and complete liquid cooling solution. When supplied with the correct water temperature and flow as specified by the manufacturer and paired with appropriate containment, they can remove essentially 100% of the heat generated by IT equipment, and the air they discharge from the cabinet is the same as the room’s ambient temperature. However, this technology is typically limited to 85kW to 90kW per rack.
Direct-to-chip cooling can support up to 100kW per rack and in some cases go as high as 120kW under optimal conditions. However, this isn’t a full-cabinet cooling solution. It only cools the chip, so another solution is needed to cool the rest of the cabinet. This becomes more important as rack densities continue to increase with innovation and new chipsets potentially reach densities as high as 250kW per rack.
Immersion cooling is another option, but it remains limited in use today because it slows maintenance. Equipment must be lifted from the cooling fluid and then dried before work can be done.
Data center teams should anticipate potential issues associated with their liquid cooling system up front. This includes verifying if a building has the chilled water capacity, supply temperature range, and delta T to support their proposed liquid cooling system. Cooling design directly impacts allowable rack TDP, fan performance, and overall power budgets — which in turn can influence network fabric architecture.
Optimizing operational visibility: Almost as vital as thinking about how different hardware will come together in a data center is deciding how all the data from that hardware will be integrated and used.
Rather than monitoring every data stream separately, teams can use a modern data center infrastructure management (DCIM) platform with IT telemetry ingestion to get a single, bird’s eye view of all data center operations. A DCIM platform aggregates data from every information technology and operation technology system and device in a data center and helps operators make sense of it, all in a single, integrated experience.
A DCIM platform can deliver useful insights into areas like a data center’s energy usage or proactive maintenance needs to optimize facility operations. And with command-and-control capabilities, the platform can ease operators’ jobs by allowing them to manage multiple data center functions in one place.
Because the purpose of a DCIM platform is to connect data across disparate technologies, data center teams should choose a platform that is open and vendor agnostic. It should also have built-in integration with facility systems and API connectors for other vendor software.
Empowering the data center ecosystem
Individual technologies don’t create competitive data centers. That only happens through smart, coordinated integration of multiple technologies. By thinking holistically about each new investment, data centers can unleash the performance capabilities of new technologies while reducing risk, maximizing investment and better positioning their operations for future changes and new demands.