Using GPUs¶

VSC-5 is eqipped with 2 types of GPUs nodes:

2x Nvidia A40 with 48GB RAM in a Zen2 node with 16 cores and 256GB RAM
2x Nvidia A100 with 40GB RAM in a Zen3 node with 128 cores and 512GB RAM

Note

There are no GPUs on VSC-4

To use GPUs, the correct partition, QoS and GPU settings are needed in he job submission.
Partition and QoS need to be the same and are:
zen2_0256_a4-x2 for the A40 and zen3_0512_a100x2 for the A100
In both queues, the required amount of GPUs per node needs to be specified with the --gres parameter:
To get one GPU, use --gres=gpu:1 and to beth both, use --gres=gpu:2

Warning

You can only get more than one node if you request both GPUs

See an example below.

#!/bin/bash
#
#  usage: sbatch ./gpu_test.scrpt          
#
#SBATCH -J A100     
#SBATCH -N 1                           #use -N only if you use both GPUs on the nodes, otherwise leave this line out
#SBATCH --partition zen3_0512_a100x2   #For A40 cards, use --partition zen2_0256_a40x2
#SBATCH --qos zen3_0512_a100x2
#SBATCH --gres=gpu:2                   #or --gres=gpu:1 if you only want to use half a node

module purge
module load cuda/9.1.85

The command nvidia-smi is used to look at the state of the GPU cards on the system. An example is shown below for the A100 card, with no jobs running. When under use, the GPU Memory Usage as well as GPU-Util metrics will increase accordingly. This is also an effective tool to judge if the application/code is utilizing the GPU or not.

nvidia-smi - A100

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:01:00.0 Off |                  Off |
| N/A   42C    P0              35W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:81:00.0 Off |                  Off |
| N/A   40C    P0              38W / 250W |      4MiB / 40960MiB |      4%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The following describes the properties of the currently available graphics cards:

A100 Graphic card specifications

The device query is a script given by CUDA samples on github to acquire hardware properties.

./deviceQuery Starting... 

CUDA Device Query (Runtime API) version (CUDART static linking) 

Detected 2 CUDA Capable device(s) 

Device 0: "NVIDIA A100-PCIE-40GB" 
CUDA Driver Version / Runtime Version          12.3 / 12.3 
CUDA Capability Major/Minor version number:    8.0 
Total amount of global memory:                 40338 MBytes (42297786368 bytes) 
(108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores 
GPU Max Clock rate:                            1410 MHz (1.41 GHz) 
Memory Clock rate:                             1215 Mhz 
Memory Bus Width:                              5120-bit 
L2 Cache Size:                                 41943040 bytes 
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers 
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers 
Total amount of constant memory:               65536 bytes 
Total amount of shared memory per block:       49152 bytes 
Total shared memory per multiprocessor:        167936 bytes 
Total number of registers available per block: 65536 
Warp size:                                     32 
Maximum number of threads per multiprocessor:  2048 
Maximum number of threads per block:           1024 
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535) 
Maximum memory pitch:                          2147483647 bytes 
Texture alignment:                             512 bytes 
Concurrent copy and kernel execution:          Yes with 3 copy engine(s) 
Run time limit on kernels:                     No 
Integrated GPU sharing Host Memory:            No 
Support host page-locked memory mapping:       Yes 
Alignment requirement for Surfaces:            Yes 
Device has ECC support:                        Disabled 
Device supports Unified Addressing (UVA):      Yes 
Device supports Managed Memory:                Yes 
Device supports Compute Preemption:            Yes 
Supports Cooperative Kernel Launch:            Yes 
Supports MultiDevice Co-op Kernel Launch:      Yes 
Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0 
Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: "NVIDIA A100-PCIE-40GB" 
CUDA Driver Version / Runtime Version          12.3 / 12.3 
CUDA Capability Major/Minor version number:    8.0 
Total amount of global memory:                 40338 MBytes (42297786368 bytes) 
(108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores 
GPU Max Clock rate:                            1410 MHz (1.41 GHz) 
Memory Clock rate:                             1215 Mhz 
Memory Bus Width:                              5120-bit 
L2 Cache Size:                                 41943040 bytes 
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers 
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers 
Total amount of constant memory:               65536 bytes 
Total amount of shared memory per block:       49152 bytes 
Total shared memory per multiprocessor:        167936 bytes 
Total number of registers available per block: 65536 
Warp size:                                     32 
Maximum number of threads per multiprocessor:  2048 
Maximum number of threads per block:           1024 
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535) 
Maximum memory pitch:                          2147483647 bytes 
Texture alignment:                             512 bytes 
Concurrent copy and kernel execution:          Yes with 3 copy engine(s) 
Run time limit on kernels:                     No 
Integrated GPU sharing Host Memory:            No 
Support host page-locked memory mapping:       Yes 
Alignment requirement for Surfaces:            Yes 
Device has ECC support:                        Disabled 
Device supports Unified Addressing (UVA):      Yes 
Device supports Managed Memory:                Yes 
Device supports Compute Preemption:            Yes 
Supports Cooperative Kernel Launch:            Yes 
Supports MultiDevice Co-op Kernel Launch:      Yes 
Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0 
Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
> Peer access from NVIDIA A100-PCIE-40GB (GPU0) -> NVIDIA A100-PCIE-40GB (GPU1) : Yes 
> Peer access from NVIDIA A100-PCIE-40GB (GPU1) -> NVIDIA A100-PCIE-40GB (GPU0) : Yes 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2 
Result = PASS

A40 Graphic card specifications

The device query is a script given by CUDA samples on github to acquire hardware properties.

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA A40"
CUDA Driver Version / Runtime Version          12.3 / 12.3
CUDA Capability Major/Minor version number:    8.6
Total amount of global memory:                 45413 MBytes (47619112960 bytes)
(084) Multiprocessors, (128) CUDA Cores/MP:    10752 CUDA Cores
GPU Max Clock rate:                            1740 MHz (1.74 GHz)
Memory Clock rate:                             7251 Mhz
Memory Bus Width:                              384-bit
L2 Cache Size:                                 6291456 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total shared memory per multiprocessor:        102400 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Enabled
Device supports Unified Addressing (UVA):      Yes
Device supports Managed Memory:                Yes
Device supports Compute Preemption:            Yes
Supports Cooperative Kernel Launch:            Yes
Supports MultiDevice Co-op Kernel Launch:      Yes
Device PCI Domain ID / Bus ID / location ID:   0 / 65 / 0
Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA A40"
CUDA Driver Version / Runtime Version          12.3 / 12.3
CUDA Capability Major/Minor version number:    8.6
Total amount of global memory:                 45413 MBytes (47619112960 bytes)
(084) Multiprocessors, (128) CUDA Cores/MP:    10752 CUDA Cores
GPU Max Clock rate:                            1740 MHz (1.74 GHz)
Memory Clock rate:                             7251 Mhz
Memory Bus Width:                              384-bit
L2 Cache Size:                                 6291456 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total shared memory per multiprocessor:        102400 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Enabled
Device supports Unified Addressing (UVA):      Yes
Device supports Managed Memory:                Yes
Device supports Compute Preemption:            Yes
Supports Cooperative Kernel Launch:            Yes
Supports MultiDevice Co-op Kernel Launch:      Yes
Device PCI Domain ID / Bus ID / location ID:   0 / 161 / 0
Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA A40 (GPU0) -> NVIDIA A40 (GPU1) : Yes
> Peer access from NVIDIA A40 (GPU1) -> NVIDIA A40 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2
Result = PASS

Info

The device query program to determine the hardware details more accurately can be run as follows:
(The steps below assume the user is on a node on the cuda-zen partition.)

1. git clone https://github.com/NVIDIA/cuda-samples.git
2. cd Samples/1_Utilities/deviceQuery
3. module load cuda/<latest_version>
4. export CUDA_PATH=$CUDA_HOME
5. make
6. ./deviceQuery