Node Configuration¶
Each compute node is configured to have 8 NUMA nodes and each NUMA node has 24 physical cores. As hyperthreading is enabled there are 48 logical cores per NUMA node and consequently each server possesses 384 virtual cores in total.
$ lscpu | grep NUMA
NUMA node(s): 8
NUMA node0 CPU(s): 0-23,192-215
NUMA node1 CPU(s): 24-47,216-239
NUMA node2 CPU(s): 48-71,240-263
NUMA node3 CPU(s): 72-95,264-287
NUMA node4 CPU(s): 96-119,288-311
NUMA node5 CPU(s): 120-143,312-335
NUMA node6 CPU(s): 144-167,336-359
NUMA node7 CPU(s): 168-191,360-383
GPU Nodes:¶
Each GPU node has four NVIDIA H100 GPUs, each of which is attached to a particular NUMA node. The exact attachemnt can be seen from this command:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV6 NV6 NV6 SYS SYS SYS PXB SYS SYS 72-95,264-287 3 N/A
GPU1 NV6 X NV6 NV6 SYS SYS PXB SYS SYS SYS 48-71,240-263 2 N/A
GPU2 NV6 NV6 X NV6 SYS SYS SYS SYS SYS PXB 144-167,336-359 6 N/A
GPU3 NV6 NV6 NV6 X SYS SYS SYS SYS PXB SYS 96-119,288-311 4 N/A
NIC0 SYS SYS SYS SYS X PIX SYS SYS SYS SYS
NIC1 SYS SYS SYS SYS PIX X SYS SYS SYS SYS
NIC2 SYS PXB SYS SYS SYS SYS X SYS SYS SYS
NIC3 PXB SYS SYS SYS SYS SYS SYS X SYS SYS
NIC4 SYS SYS SYS PXB SYS SYS SYS SYS X SYS
NIC5 SYS SYS PXB SYS SYS SYS SYS SYS SYS X
From the above matrix one can see that the GPUs are attached to NUMA nodes 2,3,4, and 6, and for optimal perfomance processes should be pinned to these NUMA domains.
WEKA Filesystem¶
To achieved the best possible performance for WEKA filesystem it is required to reserve a certain number of cores for exclusive access.
GPU Nodes¶
4 virtual cores on each NUMA node -- in total 32 virtual cores -- are reserved for WEKA. Therefore 352 virtual (176 physical) cores are available for SLURM jobs.
CPU Nodes¶
4 virtual cores on NUMA node 7 are reseverd for WEKA. Hence 380 virtual (190 physial) cores are available for SLURM jobs.
Overview Table¶
| node type | physical/virtual cores total | physical/virtual cores/NUMA |
|---|---|---|
| CPU | 190/380 | NUMA 0-6: 24/48 NUMA 7: 22/44 |
| GPU | 176/352 | NUMA 0-7: 22/44 |
Tip
The SLURM configuration of a specific node is best seen with this scontrol command:
# scontrol show node n3014-001
NodeName=n3014-001 CoresPerSocket=24
CPUAlloc=0 CPUEfctv=380 CPUTot=384 CPULoad=0.00
AvailableFeatures=nogpu,hca_mlx5_0
ActiveFeatures=nogpu,hca_mlx5_0
Gres=np_zen4_0768:380
NodeAddr=n3014-001 NodeHostName=n3014-001
RealMemory=770000 AllocMem=0 FreeMem=N/A Sockets=8 Boards=1
CoreSpecCount=2 CPUSpecList=380-383 MemSpecLimit=7000
State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=None SlurmdStartTime=None
LastBusyTime=2025-10-21T11:46:18 ResumeAfterTime=None
CfgTRES=cpu=380,mem=770000M,billing=380,gres/np_zen4_0768=100