NVIDIA Corp. Robert Alexander http://www.nvidia.com/ NVIDIA Corp. Peter Phaal InMon Corp. August 2012 sFlow NVML GPU Structures Copyright Notice Copyright (C) NVIDIA Corp. (2012). All Rights Reserved. Abstract This memo describes an sFlow version 5 structures to report on NVIDIA GPU related data. Table of Contents 1. Overview ...................................................... 1 2. sFlow Datagram Extension ...................................... 1 3. References .................................................... 2 4. Author's Addresses ............................................ 2 1. Overview This document describes additional structures that allow an sFlow agent to export information from NVIDIA GPUs via the NVIDIA Management Library (NVML) [1]. sFlow version 5 is an extensible protocol that allows the addition of new data without impacting existing collectors. This document does not change the sFlow version 5 protocol [2], it simply defines additional, optional, data structures through which NVIDIA GPUs can report monitoring metrics. 2. sFlow Datagram Extension Graphics Processing Units (GPUs) are a type of computer hardware commonly used to render graphics or accelerate High Performance Computing (HPC) jobs. Defining standard sFlow structures simplifies management of GPU enabled clusters by providing metrics that define FINAL nvidia.com [Page 1] FINAL sFlow NVML GPU Structure August 2012 GPU performance, status and health. The sFlow Host Structures [3] specification defines performance metrics for hosts. The nvidia_gpu extends the set of host metrics to include GPU performance. /* NVIDIA GPU statistics */ /* opaque = counter_data; enterprise = 5703, format=1 */ struct nvidia_gpu { unsigned int device_count; /* see nvmlDeviceGetCount */ unsigned int processes; /* see nvmlDeviceGetComputeRunningProcesses */ unsigned int gpu_time; /* total milliseconds in which one or more kernels was executing on GPU sum across all devices */ unsigned int mem_time; /* total milliseconds during which global device memory was being read/written sum across all devices */ unsigned hyper mem_total; /* sum of framebuffer memory across devices see nvmlDeviceGetMemoryInfo */ unsigned hyper mem_free; /* sum of free framebuffer memory across devices see nvmlDeviceGetMemoryInfo */ unsigned int ecc_errors; /* sum of volatile ECC errors across devices see nvmlDeviceGetTotalEccErrors */ unsigned int energy; /* sum of millijoules across devices see nvmlDeviceGetPowerUsage */ unsigned int temperature; /* maximum temperature in degrees Celsius across devices see nvmlDeviceGetTemperature */ unsigned int fan_speed; /* maximum fan speed in percent across devices see nvmlDeviceGetFanSpeed */ } 3. References [1] "NVIDIA Management Library", http://devel- oper.nvidia.com/cuda/nvidia-management-library-nvml [2] Phaal, P. and Lavine, M., "sFlow Version 5", http://www.sflow.org/sflow_version_5.txt, July 2006 [3] Phaal, P. and Jordan, R., "sFlow Host Structures", http://www.sflow.org/sflow_host.txt July 2010 4. Author's Address Robert Alexander NVIDIA Corp. FINAL nvidia.com [Page 2] FINAL sFlow NVML GPU Structure August 2012 2701 San Tomas Expressway Santa Clara, CA 95050 EMail: ralexander@nvidia.com Peter Phaal InMon Corp. 580 California Street, 5th Floor San Francisco, CA 94104 Phone: (415) 283-3263 EMail: peter.phaal@inmon.com FINAL nvidia.com [Page 3]