As the dimensions and complexity of AI infrastructure grows, information heart operators want steady visibility into elements together with efficiency, temperature and energy utilization. These insights allow information heart operators to actively monitor and modify information heart configurations throughout large-scale, distributed methods — validating that these methods are working at their highest effectivity and reliability.
NVIDIA is growing a software program resolution for visualizing and monitoring fleets of NVIDIA GPUs — giving cloud companions and enterprises an insights dashboard that may assist them increase GPU uptime throughout computing infrastructures.
The providing is an opt-in, customer-installed service that screens GPU utilization, configuration and errors. It is going to embrace an open-source consumer software program agent — a part of NVIDIA’s ongoing assist of open, clear software program that helps clients get probably the most from their GPU-powered methods.
With the service, information heart operators will have the ability to:
- Monitor spikes in energy utilization to maintain inside power budgets whereas maximizing efficiency per watt.
- Monitor utilization, reminiscence bandwidth and interconnect well being throughout the fleet.
- Detect hotspots and airflow points early to keep away from thermal throttling and untimely element ageing.
- Verify constant software program configurations and settings to make sure reproducible outcomes and dependable operation.
- Spot errors and anomalies to determine failing components early.
These capabilities will help enterprises and cloud suppliers visualize their GPU fleet, deal with system bottlenecks and optimize productiveness for greater return on funding.
This non-compulsory service gives real-time monitoring by every GPU system speaking and sharing GPU metrics with the exterior cloud service. NVIDIA GPUs would not have {hardware} monitoring expertise, kill switches and backdoors.
Open-Supply Agent Affords Insights for Information Middle House owners
The service will function a consumer software program agent that the client can set up to stream node-level GPU telemetry information to a portal hosted on NVIDIA NGC. Clients will have the ability to visualize their GPU fleet utilization in a dashboard, globally or by compute zones — teams of nodes enrolled in the identical bodily or cloud places.

The consumer tooling agent can be slated to be open sourced, offering transparency and auditability. It’ll provide a working instance for the way clients can incorporate NVIDIA instruments into their very own options for monitoring GPU infrastructure — whether or not for essential compute clusters or total fleets.
The software program gives perception into an organization’s GPU stock however can not modify GPU configurations or underlying operations. It gives read-only telemetry information that’s buyer managed and customizable.
The service may even allow clients to generate experiences that element GPU fleet info.
As AI purposes develop in quantity and complexity, fashionable AI infrastructure administration is evolving to maintain tempo. Ensuring that AI information facilities are operating at peak well being is important as AI revolutionizes each business and utility. This software program service is right here to assist.
Register for NVIDIA GTC, happening March 16-19 in San Jose, California, to be taught extra.
See discover relating to software program product info.