Self-build on premise data centres has been the preference for many High-Performance Computing (HPC) applications. However, the tide is turning. The demand for more compute power and relatively short refresh cycles is now proving difficult for many when it comes to financial returns and logistics (parallel build) as well as upgrading of plant and utilities to site (power).
Additionally, increasing dependence on the availability of platforms is driving towards critical services environments making it operationally intensive for in-house teams to maintain and upgrade these platforms – the facilities required often cost as much as the computers themselves.
However, finding alternatives can be challenging. Colocation providers capable of providing suitable environments, especially when powering and cooling of these highly-dense and complex platforms, are few and far between in the UK and many parts of Europe. Furthermore, the majority of colocation providers have little experience of HPC and their business models do not support the custom builds required. The cooling required demands bespoke build and engineering skills.
The public or private cloud or using a combination of both offer further possibilities. The public cloud is growing in popularity as a delivery model for certain HPC applications such as manufacturing and life sciences and may be fine for standard workloads: Highly parallelised codes or ensembles and where there is a high tolerance for individual job failure and execution locality is not a prerequisite. Nevertheless, administrators will still need to have in-depth knowledge of the cloud provider’s architecture on a case-by-case basis to ensure it’s the right fit for the application concerned.
Additionally, cloud may present issues with data protection, control, privacy and security for HPC use cases. There could also be compute performance, I/O and communications limitations. HPC is considerably more complex as there is a need for different CPU and GPU server capabilities; highly engineered interconnects between all the various systems and resources; storage latencies to be maintained in the low milli, micro or even nanoseconds. All this requires highly specialised workload orchestration.
New HPC considerations
Power
The ultimate limitation for most on premise or commercial data centres will be the availability of sufficient power. Highly concentrated power to rack in ever smaller footprints is critical as dense HPC equipment needs high power densities, far more than the average colocation facility in Europe typically offers. The average colocation power per rack is circa 5kWs and rarely exceeds 20kWs,
compared to HPC platforms which typically draw around 30kWs and upwards. However, Vantage is seeing densities rise to 40, 50, with some installations in excess of 100kWs.
While it is unusual to have a data centre which is overprovisioned on power versus space, that’s exactly what's needed for HPC. Typical data centres will quickly exhaust their power and be left at low space occupancy.
It is essential to check if the colocation facility can provide that extra power now – not just promise it for the future – and whether it charges a premium price for routing more power to your system. Furthermore, check the multi-cabled power aggregation systems required include sufficient power redundancy.
Critical Services
While previously many HPC users were happy to tolerate outages on their early generation platforms, organisations are becoming increasingly reliant on HPC for mainstream activity - implying a more urgent need for critical services hosting to accommodate them. This is not necessarily provided in typical colocation facilities looking to move up from general purpose applications and services to supporting true HPC environments.
There will always be some form of immediate failover power supply in place which is then replaced by auxiliary power from diesel generators. However, such immediate power provision is expensive, particularly when there is a continuous high draw, as is required by HPC. UPS and auxiliary power systems must be capable of supporting all workloads running in the facility at the same time, along with overhead and enough redundancy to deal with any failure within the emergency power supply system itself.
Cooling
Increasingly, inhouse solutions will have constraints as densities continue to go up. HPC requires highly targeted cooling and simple computer room air conditioning (CRAC) or free air cooling systems (such as swamp or adiabatic coolers) typically do not have the capabilities required. Furthermore, hot and cold aisle cooling systems are increasingly inadequate for addressing the heat created by larger HPC environments which will require specialised and often custom built cooling systems and procedures.
In reality, many data centres are 'productised' to a single plant architecture or are simply not laid out to support successive bespoke builds. This makes implementing HPC a challenge when each compute platform has different and specialised cooling requirements.This places increased emphasis on having on-site engineering personnel on hand with knowledge in designing and building bespoke cooling systems such as direct liquid cooling for highly efficient heat removal and avoiding on board hot spots. This will reduce the problems of high temperatures without excessive air circulation which is both expensive and noisy.
Sustainability
As the HPC market grows, so do the implications of running such energy-intensive and complex infrastructure. To achieve sustainability, data centre industry leaders such as Vantage and HPC vendors are prioritising ways to reduce CO2 impact and even decarbonize HPC.
Fibre Connectivity/Latency
The majority of commercial data centres have far higher levels of diverse fibre connectivity compared to 'on-premise' campuses. Basic public connectivity solutions will generally not be sufficient for HPC systems.
Ensuring connectivity through multiple diverse connections from the facility is crucial along with specialised connections to public clouds, especially in the case of hybrid cloud solutions. These bypass the public internet to enable more consistent and secure interactions between the HPC platform and other workloads the organisation may be operating.
Location
The physical location of the data centre will impact directly on rack space costs and power availability. In the case of colocation there are often considerable differences in rack space rents between regional facilities and those based in or around large metro areas such as London. Perhaps of more concern to HPC users, most data centres in and around London are severely power limited.
Vantage CWL1 in Wales does not have such challenges. Our colocation facility offers abundant sustainable power, scalability, low-latency connectivity, on-site engineering skills and flexibility to fulfil almost any High-Performance Compute requirement - including custom data hall environments and bespoke cooling. https://vantage-dc-cardiff.co.uk/hosting/#high-performance-computing