How to use GPU nodes

About GPU nodes

Nibbler has in total 7 compute nodes. There are 5 regular nodes and 2 GPU nodes with 8 x GPU (a40) devices.

GPU jobs can be submitted to Slurm with either sbatch or srun commands, containing:

  • Either gres=gpu:a40:# argument, where # is the number of the specified GPUs to be reserved.
    For example, a job that needs 1 node with 2 GPUs, would use gres=gpu:a40:2, where a40 is the type of GPU card requested.
  • Or --gpus-per-node=a40:#, where # is again the number of GPUs requested per node.

Note that users can request only a number of entire GPUs and hence NOT partial GPU resources. For example, you can request 1, 2 or more GPU(s), but you cannot request 1 GPU with a specific amount of GPU cores or GPU memory. (The selection of individual resources is possible on newer GPUs that support MIG feature, but this feature is not available for the GPUs on Nibbler.)

Examples 1 and 2: Submitting a batch and an interactive job

Example 1 shows how to submit the gpu_test.sbatch file, requesting 2 GPU devices and executing a command that prints the information of the allocated GPU devices:

    $ cat gpu_test.sbatch
    #!/bin/bash
    #SBATCH --gres=gpu:a40:2
    #SBATCH --job-name=gpu_test
    #SBATCH --output=gpu_test.out
    #SBATCH --error=gpu_test.err
    #SBATCH --time=01:00:00
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=1gb
    #SBATCH --nodes=1
    #SBATCH --export=NONE

    nvidia-smi
    $ sbatch gpu_test.sbatch

Or can be use as the argument on the command line

    $ sbatch --gres=gpu:a40:2 my.sbatch

Example 2 runs the same GPU example in an interactive session using srun.

    $ mkdir -p /groups/umcg-GROUP/tmpXX/projects/${USER}/gpu_test
    $ cd /groups/umcg-GROUP/tmpXX/projects/${USER}/gpu_test
    $ srun --qos=interactive-short --gres=gpu:a40:2 --time=01:00:00 --pty bash -i
    $ echo ${SLURM_GPUS_ON_NODE}
    2  

Replace GROUP and tmpXX placeholders with correct values for group and tmp filesystem. The returned value is the number of GPUs available for the job. Use the nvidia-smi command to see more information about the GPU devices that are available inside the job.

    $ nvidia-smi 
    Mon Jul 24 12:45:15 2023
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A40                     On  | 00000000:00:08.0 Off |                    0 |
    |  0%   30C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    |   1  NVIDIA A40                     On  | 00000000:00:09.0 Off |                    0 |
    |  0%   29C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+

    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |  No running processes found                                                           |
    +---------------------------------------------------------------------------------------+

Note that like any other interactive jobs, this one is also limited to one interactive job per user.

As you can tell from the example above, once the job has started, the environment variable called SLURM_GPUS_ON_NODE is created. It contains the number of available GPUs of the currently running job. The value from the example above would be set to 2. Furthermore, you can access only the two that are assigned to the job. This means you won't be able to use any other GPUs on the node. This is limited by SLURM's control groups and prevents users consuming resources that they have not requested.

To list the current jobs and how much GPUs they are using you can use the squeue command like this

squeue -o "%.10i %.20j %.10u %.2t %.10M %.5D %.15R %.15b"

which will list the type and number of GPUs used in the last column like this:

        JOBID    NAME     USER        ST      TIME  NODES  NODELIST(REASON)  TRES_PER_NODE
         1234    bash     umcg-user1  R       8:27      1  nb-vcompute05     gres:gpu:a40:2
         1235    somejob  umcg-user2  R    2:09:14      1  nb-vcompute05     N/A
         1236    somejob  umcg-user3  R    2:19:53      1  nb-vcompute04     N/A
         ...

Example 3: Build and run CUDA source sample

This example shows how to build and run CUDA code sample (version 12.2.0 compiled with CUDA/12.2.0) in the interactive job.

First you must be sure that you have a driver version same or higher as the samples version, that is:

    [ driver version ] and [ CUDA version ] >= [ samples version ]

To check the driver and cuda version, run nvidia-smi on the compute node:

    # Replace the YYY with apropriate values.
    mkdir -p /groups/umcg-YYY/tmpYY/users/umcg-YYY/cuda_samples
    cd /groups/umcg-YYY/tmpYY/users/umcg-YYY/cuda_samples
    wget https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.2.tar.gz -O - | tar -xz
    cd cuda-samples-12.2/Samples/6_Performance/UnifiedMemoryPerf
    # Increase the matrix size, so that the calulation takes long enough to capture with nvidia-smi.
    sed -i 's/maxSampleSizeInMb = 64/maxSampleSizeInMb = 1024/' matrixMultiplyPerf.cu
    srun --qos=interactive-short --gpus-per-node=a40:2 --mem=20G --time=01:00:00 --pty bash -i
    ml CUDA/12.2.0          # load CUDA compiler and libraries
    make                    # compile the current example
    # Run test on second device (note first device is '0',second is '1' etc.)
    ./UnifiedMemoryPerf -device=1 > gpu_test.log &
    nvidia-smi

Check version numbers; they should be printed on the top-middle and top-right corner of the output, which should look like this:

    Mon Jul 24 12:53:08 2023
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A40                     On  | 00000000:00:08.0 Off |                    0 |
    |  0%   34C    P0              77W / 300W |      7MiB / 46068MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    |   1  NVIDIA A40                     On  | 00000000:00:09.0 Off |                    0 |
    |  0%   41C    P0              96W / 300W |    376MiB / 46068MiB |    100%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+

    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |    1   N/A  N/A     15455      C   ./UnifiedMemoryPerf                         260MiB |
    +---------------------------------------------------------------------------------------+

Example 4: Tensorflow inside Apptainer

This example shows how to run a Tensorflow python job inside Apptainer, using 1 node with 2 GPU devices and CUDA module.

This example shows how to download the latest GPU tensorflow container image and execute some test job inside it.

To run this example

  1. create the working directory on the tmp filesystem and navigate into it
   mkdir /groups/umcg-YYY/tmpYY/users/umcg-YYY/gpu_apptainer_test
   cd /groups/umcg-YYY/tmpYY/users/umcg-YYY/gpu_apptainer_test
  1. Create two files
    1. apptainer_tensorflow.slurm - a job description file for the SLURM queuing system.
    2. training.py - a simple Tensorflow traning example script, containing only 30 lines.

Where apptainer_tensorflow.slurm file contains

#!/bin/bash
#SBATCH --gres=gpu:a40:2
#SBATCH --job-name=apptainer_tf
#SBATCH --output=apptainer_tf.out
#SBATCH --error=apptainer_tf.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --export=NONE

## Environment
# Load latest CUDA environment module
ml CUDA

### Running
# run tensorflow .sif image that is saved in the /apps/data/containers/ and execute the training.py script
[nibbler gpu_apptainer_test]$ apptainer run -B $(pwd) --nv /apps/data/containers/tensorflow-2.13.0-gpu.sif python training.py

and the training.py file contains

## From https://www.tensorflow.org/tutorials/quickstart/beginner
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
predictions
tf.nn.softmax(predictions).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test,  y_test, verbose=2)
probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])
probability_model(x_test[:5])

This basic training python script example

  1. loads MNIST database of handwritten digits,
  2. builds a neural network machine learning model that classifies images,
  3. trains this neural network and
  4. evaluates the accuracy of the model.

Additional documentation

  1. TensorFlow 2 quickstart for beginners
  2. NVidia Tesla documentation
  3. CUDA Code Samples
  4. Apptainer GPU Support documentation