From the course: NVIDIA Certified Associate AI Infrastructure and Operations (NCA-AIIO) Cert Prep

Unlock this course with a free trial

Join today to access over 25,300 courses taught by industry experts.

Data Center GPU Manager (DCGM)

Data Center GPU Manager (DCGM)

The next utility you can use for monitoring your GPU is DCGM, which stands for Data Center GPU Manager. This is not by default available. So you may have to install it. And I would include a link about installation process of DCGM. Let's talk a little bit more on that. So its primary purpose is to provide you enterprise-scale GPU health monitoring and diagnostics information. It can work with multi-node GPU cluster, allows continuous monitoring, and has alerting features also in that. What it monitors? It monitors GPU health metrics, utilization pattern, power and thermal data, memory bandwidth, PCIe throughput, error rates which are happening. So these all are monitored by it. So I've included a link for an article which will give you a basic idea on how to get started with GPU manager. So you need to install dcgm. That is where you will provide the location for download. And then it will run as a service. So you would then install the package using this particular command. So that…

Contents