Using GPUs¶
There are two things to note for using GPUs on this system,
- The partition (and qos) should either be
zen3_0512_a100x2
orzen2_0256_a40x2
. - The
--gres=gpu:2
flag that books the two cards avaialble on a single node of the said two partitions.
See an example below. Python jobs especially, will need the above. (Note that the python jobs will not be inherently parallel or use GPU unless explicitly coded in the script.)
#!/bin/bash
#
# usage: sbatch ./gpu_test.scrpt
#
#SBATCH -J A40
#SBATCH -N 1 #use -N only if you use both GPUs on the nodes, otherwise leave this line out
#SBATCH --partition zen2_0256_a40x2 #For A100 cards, use ```--partition zen3_0512_a100x2```
#SBATCH --qos zen2_0256_a40x2
#SBATCH --gres=gpu:2 #or --gres=gpu:1 if you only want to use half a node
module purge
module load cuda/9.1.85
The command nvidia-smi
, the output of which shown below, is used to look at the state of the GPU cards on the system. An example is shown below for the A100 card, with no jobs running. When under use, the GPU Memory Usage as well as GPU-Util metrics will increase accordingly, and is also an effective tool to judge if the application/code is utilizing the GPU or not.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:01:00.0 Off | Off |
| N/A 42C P0 35W / 250W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:81:00.0 Off | Off |
| N/A 40C P0 38W / 250W | 4MiB / 40960MiB | 4% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The following indicates the two graphic card properties, currently available.
A100 Graphic card specifications
The device query is a script given by CUDA samples on github to acquire hardware properties. See device-query for the steps to run this. The result will be as follows,
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100-PCIE-40GB"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40338 MBytes (42297786368 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA A100-PCIE-40GB"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 40338 MBytes (42297786368 bytes)
(108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores
GPU Max Clock rate: 1410 MHz (1.41 GHz)
Memory Clock rate: 1215 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 41943040 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA A100-PCIE-40GB (GPU0) -> NVIDIA A100-PCIE-40GB (GPU1) : Yes
> Peer access from NVIDIA A100-PCIE-40GB (GPU1) -> NVIDIA A100-PCIE-40GB (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2
Result = PASS
A40 Graphic card specifications
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A40" CUDA Driver Version / Runtime Version 12.3 / 12.3 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 45413 MBytes (47619112960 bytes) (084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores GPU Max Clock rate: 1740 MHz (1.74 GHz) Memory Clock rate: 7251 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA A40" CUDA Driver Version / Runtime Version 12.3 / 12.3 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 45413 MBytes (47619112960 bytes) (084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores GPU Max Clock rate: 1740 MHz (1.74 GHz) Memory Clock rate: 7251 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 161 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Peer access from NVIDIA A40 (GPU0) -> NVIDIA A40 (GPU1) : Yes Peer access from NVIDIA A40 (GPU1) -> NVIDIA A40 (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2 Result = PASS
Info
The device query program to determine the hardware details more accurately can be run as follows, (The steps below assume the user is on a node on the cuda-zen partition.)
1. git clone https://github.com/NVIDIA/cuda-samples.git
2. cd Samples/1_Utilities/deviceQuery
3. module load cuda/<latest_version>
4. export CUDA_PATH=$CUDA_HOME
5. make
6. ./deviceQuery