Skip to content

Using GPUs

There are two things to note for using GPUs on this system,

  • The partition (and qos) should either be zen3_0512_a100x2 or zen2_0256_a40x2.
  • The --gres=gpu:2 flag that books the two cards avaialble on a single node of the said two partitions.

See an example below. Python jobs especially, will need the above. (Note that the python jobs will not be inherently parallel or use GPU unless explicitly coded in the script.)

#!/bin/bash
#
#  usage: sbatch ./gpu_test.scrpt          
#
#SBATCH -J A40     
#SBATCH -N 1                           #use -N only if you use both GPUs on the nodes, otherwise leave this line out
#SBATCH --partition zen2_0256_a40x2    #For A100 cards, use ```--partition zen3_0512_a100x2```
#SBATCH --qos zen2_0256_a40x2
#SBATCH --gres=gpu:2                   #or --gres=gpu:1 if you only want to use half a node

module purge
module load cuda/9.1.85

The command nvidia-smi, the output of which shown below, is used to look at the state of the GPU cards on the system. An example is shown below for the A100 card, with no jobs running. When under use, the GPU Memory Usage as well as GPU-Util metrics will increase accordingly, and is also an effective tool to judge if the application/code is utilizing the GPU or not.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:01:00.0 Off |                  Off |
| N/A   42C    P0              35W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:81:00.0 Off |                  Off |
| N/A   40C    P0              38W / 250W |      4MiB / 40960MiB |      4%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The following indicates the two graphic card properties, currently available.

A100 Graphic card specifications

The device query is a script given by CUDA samples on github to acquire hardware properties. See device-query for the steps to run this. The result will be as follows,

./deviceQuery Starting... 

CUDA Device Query (Runtime API) version (CUDART static linking) 

Detected 2 CUDA Capable device(s) 

Device 0: "NVIDIA A100-PCIE-40GB" 
CUDA Driver Version / Runtime Version          12.3 / 12.3 
CUDA Capability Major/Minor version number:    8.0 
Total amount of global memory:                 40338 MBytes (42297786368 bytes) 
(108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores 
GPU Max Clock rate:                            1410 MHz (1.41 GHz) 
Memory Clock rate:                             1215 Mhz 
Memory Bus Width:                              5120-bit 
L2 Cache Size:                                 41943040 bytes 
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers 
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers 
Total amount of constant memory:               65536 bytes 
Total amount of shared memory per block:       49152 bytes 
Total shared memory per multiprocessor:        167936 bytes 
Total number of registers available per block: 65536 
Warp size:                                     32 
Maximum number of threads per multiprocessor:  2048 
Maximum number of threads per block:           1024 
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535) 
Maximum memory pitch:                          2147483647 bytes 
Texture alignment:                             512 bytes 
Concurrent copy and kernel execution:          Yes with 3 copy engine(s) 
Run time limit on kernels:                     No 
Integrated GPU sharing Host Memory:            No 
Support host page-locked memory mapping:       Yes 
Alignment requirement for Surfaces:            Yes 
Device has ECC support:                        Disabled 
Device supports Unified Addressing (UVA):      Yes 
Device supports Managed Memory:                Yes 
Device supports Compute Preemption:            Yes 
Supports Cooperative Kernel Launch:            Yes 
Supports MultiDevice Co-op Kernel Launch:      Yes 
Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0 
Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: "NVIDIA A100-PCIE-40GB" 
CUDA Driver Version / Runtime Version          12.3 / 12.3 
CUDA Capability Major/Minor version number:    8.0 
Total amount of global memory:                 40338 MBytes (42297786368 bytes) 
(108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores 
GPU Max Clock rate:                            1410 MHz (1.41 GHz) 
Memory Clock rate:                             1215 Mhz 
Memory Bus Width:                              5120-bit 
L2 Cache Size:                                 41943040 bytes 
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers 
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers 
Total amount of constant memory:               65536 bytes 
Total amount of shared memory per block:       49152 bytes 
Total shared memory per multiprocessor:        167936 bytes 
Total number of registers available per block: 65536 
Warp size:                                     32 
Maximum number of threads per multiprocessor:  2048 
Maximum number of threads per block:           1024 
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535) 
Maximum memory pitch:                          2147483647 bytes 
Texture alignment:                             512 bytes 
Concurrent copy and kernel execution:          Yes with 3 copy engine(s) 
Run time limit on kernels:                     No 
Integrated GPU sharing Host Memory:            No 
Support host page-locked memory mapping:       Yes 
Alignment requirement for Surfaces:            Yes 
Device has ECC support:                        Disabled 
Device supports Unified Addressing (UVA):      Yes 
Device supports Managed Memory:                Yes 
Device supports Compute Preemption:            Yes 
Supports Cooperative Kernel Launch:            Yes 
Supports MultiDevice Co-op Kernel Launch:      Yes 
Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0 
Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
> Peer access from NVIDIA A100-PCIE-40GB (GPU0) -> NVIDIA A100-PCIE-40GB (GPU1) : Yes 
> Peer access from NVIDIA A100-PCIE-40GB (GPU1) -> NVIDIA A100-PCIE-40GB (GPU0) : Yes 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2 
Result = PASS 

A40 Graphic card specifications

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA A40" CUDA Driver Version / Runtime Version 12.3 / 12.3 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 45413 MBytes (47619112960 bytes) (084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores GPU Max Clock rate: 1740 MHz (1.74 GHz) Memory Clock rate: 7251 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA A40" CUDA Driver Version / Runtime Version 12.3 / 12.3 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 45413 MBytes (47619112960 bytes) (084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores GPU Max Clock rate: 1740 MHz (1.74 GHz) Memory Clock rate: 7251 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 161 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from NVIDIA A40 (GPU0) -> NVIDIA A40 (GPU1) : Yes Peer access from NVIDIA A40 (GPU1) -> NVIDIA A40 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2 Result = PASS

Info

The device query program to determine the hardware details more accurately can be run as follows, (The steps below assume the user is on a node on the cuda-zen partition.)
1. git clone https://github.com/NVIDIA/cuda-samples.git
2. cd Samples/1_Utilities/deviceQuery
3. module load cuda/<latest_version>
4. export CUDA_PATH=$CUDA_HOME
5. make
6. ./deviceQuery