Datadog has introduced GPU Monitoring, now available to customers globally. The product is designed to address challenges organisations face in managing rising AI-related costs.
“GPU instances account for 14 percent of compute costs—which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways. While these companies can see their costs climbing, they can’t chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways,” said Yanbing Li, Chief Product Officer at Datadog.
The launch comes as companies seek more effective ways to manage GPU spending linked to AI workloads. Many organisations face difficulties allocating GPU costs across business units, and limited workload context can make budgeting and planning more complex.
GPU Monitoring aims to provide a unified view across AI infrastructure, linking GPU fleet health, cost, and performance to the teams using those resources. This supports faster troubleshooting of slower workloads and aims to improve cost visibility.
As AI deployments scale, managing compute resources increasingly involves broader organisational planning, particularly where capacity is misallocated or where training and inference workloads are affected by cost or performance constraints. Many organisations currently work with fragmented visibility into GPU usage. GPU Monitoring is intended to consolidate this view.
Existing GPU monitoring tools typically provide basic hardware health metrics but may not show cross-team resource contention, reasons for failed workloads, or identify underused devices. This can slow investigations and lead to overprovisioning as a precaution, contributing to higher resource usage.
By connecting GPU fleet telemetry with workload data, GPU Monitoring provides a shared view for platform engineering and machine learning teams.
- Scale AI without overspending: Usage insights help guide capacity planning, support decisions on new GPU purchases versus reallocation, and improve cost predictability.
- Accelerate AI delivery: Linking performance issues to specific GPUs and processes helps identify bottlenecks more quickly.
- Avoid costly disruptions: Early detection of unhealthy GPUs can help reduce the risk of broader system failures.
- Maximise ROI on GPU spend: Visibility into utilisation enables teams to identify underused or overprovisioned resources and adjust allocation accordingly.
Overall, GPU Monitoring is positioned as a tool to improve visibility and resource management for AI workloads across organisations.