Using GPUs¶
VSC-5 is eqipped with 2 types of GPUs nodes:
- 2x Nvidia A40 with 48GB RAM in a Zen2 node with 16 cores and 256GB RAM
- 2x Nvidia A100 with 40GB RAM in a Zen3 node with 128 cores and 512GB RAM
Note
There are no GPUs on VSC-4
To use GPUs, the correct partition, QoS and GPU settings are needed in he job submission.
Partition and QoS need to be the same and are:
zen2_0256_a4-x2
for the A40 and zen3_0512_a100x2
for the A100
In both queues, the required amount of GPUs per node needs to be specified with the --gres parameter:
To get one GPU, use --gres=gpu:1
and to beth both, use --gres=gpu:2
Warning
You can only get more than one node if you request both GPUs
See an example below.
#!/bin/bash
#
# usage: sbatch ./gpu_test.scrpt
#
#SBATCH -J A100
#SBATCH -N 1 #use -N only if you use both GPUs on the nodes, otherwise leave this line out
#SBATCH --partition zen3_0512_a100x2 #For A40 cards, use --partition zen2_0256_a40x2
#SBATCH --qos zen3_0512_a100x2
#SBATCH --gres=gpu:2 #or --gres=gpu:1 if you only want to use half a node
module purge
module load cuda/9.1.85
The command nvidia-smi
is used to look at the state of the GPU cards on the system. An example is shown below for the A100 card, with no jobs running. When under use, the GPU Memory Usage as well as GPU-Util metrics will increase accordingly. This is also an effective tool to judge if the application/code is utilizing the GPU or not.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:01:00.0 Off | Off |
| N/A 42C P0 35W / 250W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:81:00.0 Off | Off |
| N/A 40C P0 38W / 250W | 4MiB / 40960MiB | 4% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The following describes the properties of the currently available graphics cards:
A100 Graphic card specifications
The device query is a script given by CUDA samples on github to acquire hardware properties.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100-PCIE-40GB"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40338 MBytes (42297786368 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA A100-PCIE-40GB"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40338 MBytes (42297786368 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA A100-PCIE-40GB (GPU0) -> NVIDIA A100-PCIE-40GB (GPU1) : Yes
> Peer access from NVIDIA A100-PCIE-40GB (GPU1) -> NVIDIA A100-PCIE-40GB (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2
Result = PASS
A40 Graphic card specifications
The device query is a script given by CUDA samples on github to acquire hardware properties.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A40"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45413 MBytes (47619112960 bytes)
(084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA A40"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 45413 MBytes (47619112960 bytes)
(084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores
GPU Max Clock rate: 1740 MHz (1.74 GHz)
Memory Clock rate: 7251 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 161 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA A40 (GPU0) -> NVIDIA A40 (GPU1) : Yes
> Peer access from NVIDIA A40 (GPU1) -> NVIDIA A40 (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2
Result = PASS
Info
The device query program to determine the hardware details more accurately can be run as follows:
(The steps below assume the user is on a node on the cuda-zen partition.)
1. git clone https://github.com/NVIDIA/cuda-samples.git
2. cd Samples/1_Utilities/deviceQuery
3. module load cuda/<latest_version>
4. export CUDA_PATH=$CUDA_HOME
5. make
6. ./deviceQuery