Run Jobs on the GPU Partition
GPU Partition Overview
Yen Slurm has three GPU nodes:
yen-gpu1has 64 threads, 256 G of RAM and 4 A30 NVIDIA GPU'syen-gpu2andyen-gpu3each have 64 threads, 256 G of RAM and 4 A40 NVIDIA GPU's
The A30 NVIDIA GPU's have 24 G of GPU RAM while the A40 NVIDIA GPU's have 48 G of GPU RAM.
GPU Partition
To work with these GPU nodes on Yen Slurm, you can submit jobs to the Slurm scheduler targeting the gpu partition. You can use the command sinfo -p gpu to get more information about this partition:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 1-00:00:00 1 mix yen-gpu3
gpu up 1-00:00:00 2 idle yen-gpu[1-2]
Job Time and GPU Limit
There is a limit of 1 day runtime and 4 GPU's per user.
Usage Example with Python
This guide will detail how to run a short Python example using both PyTorch and Keras for deep learning training. CUDA, PyTorch and Tensorflow/Keras are installed already so you do not have to install them yourself.
Loading Modules
To use either PyTorch or Keras, you can simply load the pre-installed module on the system.
ml pytorch
ml tensorflow
source activate
Once the chosen module is loaded, you will be in an environment running Python 3.10 that already includes the relevant packages installed.
Create Python Script
This example uses the MNIST dataset for image classification, and consists of a simple fully connected neural network with one hidden layer.
Save the following code to a new file called mnist.py:
| mnist.py | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
| mnist.py | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
Submit Script to Yen Slurm
You can now write a Slurm submission script to run mnist.py on a GPU node on yen-slurm. An example submission script train-gpu.slurm looks like:
#!/bin/bash
# Example slurm script to train pytorch DL model on Yen GPU
#SBATCH -J train-gpu
#SBATCH -p gpu
#SBATCH -c 20
#SBATCH -N 1
#SBATCH -t 1- # limit of 1 day runtime
#SBATCH -G 1 # limit of 4 GPUs per user
#SBATCH -o train-gpu-%j.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@stanford.edu
# load pytorch module
ml pytorch
# run training script on GPU
python mnist.py
#!/bin/bash
# Example slurm script to train keras DL model on Yen GPU
#SBATCH -J train-gpu
#SBATCH -p gpu
#SBATCH -c 20
#SBATCH -N 1
#SBATCH -t 1- # limit of 1 day runtime
#SBATCH -G 1 # limit of 4 GPUs per user
#SBATCH -o train-gpu-%j.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@stanford.edu
# For safety, we deactivate any conda env that might be activated on interactive yens before submission and purge all loaded modules
source deactivate
module purge
# load tensorflow module
ml tensorflow
# activate the tensorflow conda env
source activate
# run training on GPU
python mnist.py
This script asks for one GPU on the gpu partition (-p gpu) and 20 CPU cores on the GPU node (-c 20) for 1 day (-t 1-).
You can submit the job to the gpu partition with:
sbatch train-gpu.slurm
Monitor your job:
squeue -u $USER
You should see something like:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
190526 gpu train-gp user R 0:25 1 yen-gpu1
Once the job is running, you gain the ability to connect to the node that the job is running on:
ssh yen-gpu1
Upon connecting to the GPU node, you can monitor GPU utilization:
nvidia-smi
You should see that one of the four GPU's is being utilized (under GPU-Util column) and the process running on the GPU is python:
Tue May 9 15:58:57 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:17:00.0 Off | 0 |
| N/A 34C P0 33W / 165W| 1073MiB / 24576MiB | 3% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:65:00.0 Off | 0 |
| N/A 31C P0 31W / 165W| 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:CA:00.0 Off | 0 |
| N/A 32C P0 29W / 165W| 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A30 On | 00000000:E3:00.0 Off | 0 |
| N/A 33C P0 28W / 165W| 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3927692 C python 1070MiB |
+---------------------------------------------------------------------------------------+
Once the job is done, look at the output file:
cat train-gpu*.out
The output should look similar to:
[1] loss: 0.553
[2] loss: 0.265
[3] loss: 0.210
[4] loss: 0.175
[5] loss: 0.149
[6] loss: 0.129
[7] loss: 0.114
[8] loss: 0.101
[9] loss: 0.091
[10] loss: 0.083
Accuracy on the test set: 97 %
...
Epoch 14/15
422/422 [==============================] - 2s 5ms/step - loss: 0.0361 - accuracy: 0.9882 - val_loss: 0.0297 - val_accuracy: 0.9920
Epoch 15/15
422/422 [==============================] - 2s 5ms/step - loss: 0.0336 - accuracy: 0.9891 - val_loss: 0.0274 - val_accuracy: 0.9922
Test loss: 0.02380027435719967
Test accuracy: 0.9922000169754028
Make Module into a Jupyter Kernel
We can also add the PyTorch or Keras environment to the interactive Yen's JupyterHub. Although each module will fall back to use CPU on JupyterHub since GPU is not available, you can still utilize JupyterHub notebooks for visualization and other pre-/post-training tasks. For model training and inference, it is better to utilize the GPU nodes on yen-slurm.
You can start by listing all of your available JupyterHub kernels:
jupyter kernelspec list
To add the environment, load the module and then create a JupyterHub kernel from the venv using the following commands:
ml pytorch
python -m ipykernel install --user --name pytorch212 --display-name 'PyTorch 2.1.2'
module load tensorflow
source activate
python -m ipykernel install --user --name tensorflow211 --display-name 'Tensorflow 2.11'
When you launch a new Python notebook, you should see the just created PyTorch 2.1.2 or Tensorflow 2.11 notebook kernel that you can start up.
Once you start up the notebook, make sure you can import torch (for PyTorch) or tensorflow (for Keras). Note that CUDA is not available here since the interactive Yens do not have GPUs.

