How to use GPU nodes
About GPU nodes
Nibbler has in total 7 compute nodes. 5 regular nodes and 2 GPU nodes with 8 x GPU (a40) devices. The GPU nodes are placed in slurm partitions called a40
.
GPU jobs can be submitted to Slurm with either sbatch or srun commands, containing either:
gres=gpu:a40:#
argument - where#
is the number of specific node of nodes to be reserved. For example, a job that uses 1 node with 2 GPU's, would usegres=gpu:a40:2
, wherea40
is the type of GPU card requested. Alternatively you can also provide--gpus-per-node=a40:#
, where#
is again the number of GPUs requested per node.
Note that users can request only a number of entire GPUs and hence NOT partial GPU resources. For example, you can request 1, 2 or more GPU(s), but you cannot request 1 GPU with a specific amount of GPU cores
or GPU memory
.
The selection of individual resources is possible on newer GPUs that support MIG feature. This feature is available only on newer types, like A30 and A100, but not on our A40.
Examples 1 and 2: Submitting a batch and an interactive job
Example 1 shows how to submit the gpu_test.sbatch
file, requesting 2 GPU devices and executing a command that prints the information of the allocated GPU devices:
$ cat gpu_test.sbatch
#!/bin/bash
#SBATCH --gres=gpu:a40:2
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test.out
#SBATCH --error=gpu_test.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1gb
#SBATCH --nodes=1
#SBATCH --export=NONE
nvidia-smi
$ sbatch gpu_test.sbatch
Or can be use as the argument on the command line
$ sbatch --gres=gpu:a40:2 my.sbatch
Example 2 runs the same GPU example in an interactive session using srun
.
$ mkdir -p /groups/umcg-GROUP/tmpXX/projects/${USER}/gpu_test
$ cd /groups/umcg-GROUP/tmpXX/projects/${USER}/gpu_test
$ srun --qos=interactive-short --gres=gpu:a40:2 --time=01:00:00 --pty bash -i
$ echo ${SLURM_GPUS_ON_NODE}
2
Replace GROUP
and tmpXX
placeholders with correct values for group and tmp filesystem. The returned value is the number of GPU's available for the job. Use the nvidia-smi
command to see more information about the GPU devices that are available inside the job.
$ nvidia-smi
Mon Jul 24 12:45:15 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:00:08.0 Off | 0 |
| 0% 30C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:00:09.0 Off | 0 |
| 0% 29C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Note that like any other interactive jobs, this one is also limited to one interactive job per user.
As you can tell from the example above, once the job has started, the environment variable called SLURM_GPUS_ON_NODE
is created. It contains the number of available GPU's of the currently running job. The value from the example above would be set to 2
. Furthermore, you can access only the two that are assigned to the job. This means you won't be able to use any other GPU's on the node. This is limited by SLURM's control groups and prevents users consuming resources that they have not requested.
To show the current jobs and how much GPU's they are using
$ squeue -o "%.10i %.20j %.10u %.2t %.10M %.5D %.15R %.15b %.15P"
JOBID NAME USER ST TIME NODES NODELIST(REASON) TRES_PER_NODE PARTITION
1234 bash umcg-user1 R 8:27 1 nb-vcompute05 gres:gpu:a40:2 gpu_a40
1235 somejob umcg-user2 R 2:09:14 1 nb-vcompute05 N/A gpu_a40
1236 somejob umcg-user3 R 2:19:53 1 nb-vcompute04 N/A gpu_a40
...
and the GPU's used are available on the column before the last.
Example 2: Build and run CUDA source sample
This example shows how to build and run CUDA code sample (version 12.2.0 compiled with CUDA/12.2.0) in the interactive job.
First you must be sure that you have a driver version same or higher as the samples version, that is:
[ driver version ] and [ CUDA version ] >= [ samples version ]
To check the driver and cuda version, run nvidia-smi
on the compute node and check version numbers - they should be printed on the top-middle and top-right corner of the output.
[nibbler ~]$ # replace the YYY with apropriate values
[nibbler ~]$ mkdir -p /groups/umcg-YYY/tmpYY/users/umcg-YYY/cuda_samples
[nibbler ~]$ cd /groups/umcg-YYY/tmpYY/users/umcg-YYY/cuda_samples
[nibbler cuda_samples]$ wget https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.2.tar.gz -O - | tar -xz
[nibbler UnifiedMemoryPerf]$ cd cuda-samples-12.2/Samples/6_Performance/UnifiedMemoryPerf
[nibbler UnifiedMemoryPerf]$ # increase the matrix size, so that the calulation takes long enough to capture on nvidia-smi
[nibbler UnifiedMemoryPerf]$ sed -i 's/maxSampleSizeInMb = 64/maxSampleSizeInMb = 1024/' matrixMultiplyPerf.cu
[nibbler UnifiedMemoryPerf]$ srun --qos=interactive-short --gpus-per-node=a40:2 --mem=20G --time=01:00:00 --pty bash -i
[nb-node-b02 UnifiedMemoryPerf]$ ml CUDA/12.2.0 # load CUDA compiler and libraries
[nb-node-b02 UnifiedMemoryPerf]$ make # compile the current example
[nb-node-b02 UnifiedMemoryPerf]$ # run test on second device (note first device is '0',second is '1' etc.)
[nb-node-b02 UnifiedMemoryPerf]$ ./UnifiedMemoryPerf -device=1 > gpu_test.log &
[nb-node-b02 UnifiedMemoryPerf]$ nvidia-smi
Mon Jul 24 12:53:08 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:00:08.0 Off | 0 |
| 0% 34C P0 77W / 300W | 7MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:00:09.0 Off | 0 |
| 0% 41C P0 96W / 300W | 376MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 15455 C ./UnifiedMemoryPerf 260MiB |
+---------------------------------------------------------------------------------------+
Example 3: Tensorflow inside Apptainer
This example shows how to run a Tensorflow python job inside Apptainer, using 1 node with 2 GPU devices and CUDA module.
This example shows how to download the latest GPU tensorflow container image and execute some test job inside it.
To run this example
- create the working directory on the
tmp
filesystem and navigate into itbash [nibbler ~]$ # create a working directory and set it as a working directory [nibbler ~]$ mkdir /groups/umcg-YYY/tmpYY/users/umcg-YYY/gpu_apptainer_test [nibbler ~]$ cd /groups/umcg-YYY/tmpYY/users/umcg-YYY/gpu_apptainer_test
- Create two file
- a
apptainer_tensorflow.slurm
- a job description file for the SLURM queuing system - a
training.py
- a simplest Tensorflow traning example script, containing only 30 lines
- a
Where apptainer_tensorflow.slurm
file contains
#!/bin/bash
#SBATCH --gres=gpu:a40:2
#SBATCH --job-name=apptainer_tf
#SBATCH --output=apptainer_tf.out
#SBATCH --error=apptainer_tf.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --export=NONE
## Environment
# Load latest CUDA environment module
ml CUDA
### Running
# run tensorflow .sif image that is saved in the /apps/data/containers/ and execute the training.py script
[nibbler gpu_apptainer_test]$ apptainer run -B $(pwd) --nv /apps/data/containers/tensorflow-2.13.0-gpu.sif python training.py
and the training.py
file contains
## From https://www.tensorflow.org/tutorials/quickstart/beginner
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
predictions
tf.nn.softmax(predictions).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)
probability_model = tf.keras.Sequential([
model,
tf.keras.layers.Softmax()
])
probability_model(x_test[:5])
This basic training python script example
- loads MNIST database of handwritten digits,
- builds a neural network machine learning model that classifies images,
- trains this neural network and
- evaluates the accuracy of the model.