Train machine learning models on GPU

If you are running machine learning / deep learning algorithms, you might benefit from running them on the GPU. Currently, the Yens have one GPU node one GPU node with 4 GPU’s but there are other GPU resources that are available for Stanford research computing community as well.

Sherlock HPC

Sherlock also has over 700+ GPUs that we can take advantage of when training machine learning / deep learning models. See this page for getting started using Sherlock.

Login to Sherlock HPC system:

ssh <$USER>@sherlock.stanford.edu

Enter your SUNet ID and Duo authenticate to login.

Install miniconda

To install miniconda in /oak/stanford/projects/<your-lab>/<$USER>/miniconda, use this shell script. You can save this file to your home directory and call it install_miniconda.sh for example. When you run the script from your home directory, it will ask to provide a path where to install miniconda. Make sure to paste your oak path instead of default home installation (for example, /oak/stanford/projects/<your-lab>/<$USER>/miniconda). We want to avoid making a bunch of conda environments in home because with machine learning / data science / deep learning packages it is possible to run out of space just by making a few conda environments, so use Oak or other path where you have large amounts of space:

#!/bin/sh

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

chmod +x Miniconda3-latest-Linux-x86_64.sh

./Miniconda3-latest-Linux-x86_64.sh

conda config --set auto_activate_base false

See this page to learn more about Oak and to purchase space for your lab.

To run the script from home,

sh install_miniconda.sh

After miniconda is installed, add path to conda bin to your bash profile so that conda and python executables are found:

# add path to miniconda
export PATH=$PATH:/oak/stanford/projects/<your-lab>/<$USER>/miniconda3/bin

Source bash profile to apply changes:

source ~/.bash_profile

Create conda env

To install deep learning python packages to work on GPU, we need to have CUDA loaded. Load cuda module:

module load cuda/11.0.3

Make sure conda is found and is the one that you want:

which conda

should return:

/oak/stanford/projects/<your-lab>/<$USER>/miniconda3/bin/conda

Create a new conda env and install with conda. Note that pip and conda do not play nice in this case and we will stick to using only conda to properly manage the packages.

conda create -n tf-gpu python=3.8
source activate tf-gpu
conda install tensorflow-gpu keras pandas scikit-learn

Note: pip does not properly manage keras and tensorflow GPU based packages. So, we have to use conda install instead to install GPU based packages that we want.

Then we can activate this conda env in the slurm submission script.

Keras example

We will run this python script on the Sherlock GPU to train a simple MNIST convnet. Save the following to mnist.py

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# build the model
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

print(model.summary())

# train the model
batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

# evaluate the trained model
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Run slurm script

The submission script keras-gpu.slurm looks like:

#!/bin/bash

# Example slurm script to run keras DL models on Sherlock GPU

#SBATCH -J train-gpu
#SBATCH -p gpu
#SBATCH -c 20
#SBATCH -N 1
#SBATCH -t 1-
#SBATCH -G 1
#SBATCH -o train-gpu-%j.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@stanford.edu

source $HOME/.bash_profile
module load cuda/11.0.3
source activate tf-gpu

python mnist.py

This script is asking for one GPU on gpu partition and 20 CPU cores on one node for 1 day.

Submit the job to the gpu partition with:

sbatch keras-gpu.slurm

Monitor your job:

squeue -u $USER

You should see something like:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          20372833       gpu train-gp nrapstin  R       0:05      1 sh03-12n07

Once the job is running, connect to the node and monitor your GPU utilization:

ssh sh03-12n07

Once you connect to the GPU node, load the cuda module there and monitor GPU utilization while the job is running:

module load cuda/11.0.3
watch nvidia-smi

You should see that the GPU is being utilized (under GPU-Util column):

Every 2.0s: nvidia-smi
Tue Nov 15 13:43:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:C4:00.0 Off |                  N/A |
|  0%   44C    P2   114W / 260W |  10775MiB / 11264MiB |     40%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4876      C   python                          10772MiB |
+-----------------------------------------------------------------------------+

Once the job is done, look at the output file:

cat train-gpu*.out

The output should look similar to:

/etc/profile.d/z99_srcc.sh: line 183: SHERLOCK: readonly variable
/etc/profile.d/z99_srcc.sh: line 319: PI_SCRATCH: readonly variable
2021-03-15 11:52:52.231386: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
2021-03-15 12:08:05.866384: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-15 12:08:05.907107: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-15 12:08:06.000808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:c4:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.62GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-03-15 12:08:06.000917: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-03-15 12:08:08.793249: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-15 12:08:08.793435: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-03-15 12:08:12.168920: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-03-15 12:08:12.517624: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-03-15 12:08:13.562000: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-03-15 12:08:13.966972: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-03-15 12:08:16.409172: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-03-15 12:08:16.412824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-03-15 12:08:16.430127: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-15 12:08:16.433601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:c4:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.62GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-03-15 12:08:16.433698: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-03-15 12:08:16.433743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-15 12:08:16.433772: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-03-15 12:08:16.433799: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-03-15 12:08:16.433826: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-03-15 12:08:16.433875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-03-15 12:08:16.433904: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-03-15 12:08:16.433931: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-03-15 12:08:16.437218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-03-15 12:08:16.437288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-03-15 12:08:22.623609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-15 12:08:22.623693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-03-15 12:08:22.623711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-03-15 12:08:22.626838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10073 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:c4:00.0, compute capability: 7.5)
2021-03-15 12:08:22.668291: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-15 12:08:23.407609: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-03-15 12:08:23.531488: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500025000 Hz
2021-03-15 12:08:24.419584: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-03-15 12:08:25.272119: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0
_________________________________________________________________
dense (Dense)                (None, 10)                16010
=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/15
422/422 [==============================] - 6s 5ms/step - loss: 0.7651 - accuracy: 0.7647 - val_loss: 0.0860 - val_accuracy: 0.9765
Epoch 2/15
422/422 [==============================] - 1s 2ms/step - loss: 0.1198 - accuracy: 0.9634 - val_loss: 0.0576 - val_accuracy: 0.9842
Epoch 3/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0892 - accuracy: 0.9728 - val_loss: 0.0464 - val_accuracy: 0.9875
Epoch 4/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0716 - accuracy: 0.9773 - val_loss: 0.0420 - val_accuracy: 0.9878
Epoch 5/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0609 - accuracy: 0.9815 - val_loss: 0.0380 - val_accuracy: 0.9903
Epoch 6/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0585 - accuracy: 0.9819 - val_loss: 0.0390 - val_accuracy: 0.9888
Epoch 7/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0474 - accuracy: 0.9853 - val_loss: 0.0394 - val_accuracy: 0.9885
Epoch 8/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0493 - accuracy: 0.9850 - val_loss: 0.0334 - val_accuracy: 0.9907
Epoch 9/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0417 - accuracy: 0.9870 - val_loss: 0.0322 - val_accuracy: 0.9912
Epoch 10/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0364 - accuracy: 0.9883 - val_loss: 0.0318 - val_accuracy: 0.9918
Epoch 11/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0396 - accuracy: 0.9883 - val_loss: 0.0297 - val_accuracy: 0.9913
Epoch 12/15
422/422 [==============================] - 1s 2ms/step - loss: 0.0372 - accuracy: 0.9881 - val_loss: 0.0284 - val_accuracy: 0.9910
Epoch 13/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0344 - accuracy: 0.9891 - val_loss: 0.0305 - val_accuracy: 0.9913
Epoch 14/15
422/422 [==============================] - 3s 7ms/step - loss: 0.0324 - accuracy: 0.9896 - val_loss: 0.0284 - val_accuracy: 0.9910
Epoch 15/15
422/422 [==============================] - 1s 3ms/step - loss: 0.0299 - accuracy: 0.9899 - val_loss: 0.0277 - val_accuracy: 0.9923
Test loss: 0.02463490329682827
Test accuracy: 0.9919000267982483

If you see errors and warnings related to GPU and cuda libraries not found, it is likely that the keras and tensorflow did not install correctly. Make sure you load the cuda module then try to remove and reinstall the conda environment.

Google Colab

Google Colab lets you run Jupyter notebooks in the cloud on CPU or adding GPU and TPU accelerators. We will use a free Colab tier but if you need more time or specific hardware, paid Colab pro may be a good option.

Navigate to Colab website and check out an example Jupyter notebook that uses a GPU for machine learning training.

An advantage of using Colab is that common machine learning packages like keras, tensorflow and xgboost are already installed with GPU support so there is no python environment setup that is needed on premise systems like Sherlock.

Colab connects to your Google Drive so you will to upload your data and Jupyter notebooks to Google Drive to open them in Colab environment. Once your notebook is in your Google Drive, right click on it and select Open with option then select Google Colaboratory. If you do not see Google Colaboratory in drop-down, you can connect to it (do this once and Colab should be aviable in the future).

Switch to using a GPU

After the notebook is open in Colab, it will be executed on the CPU by default. So, first we need to select an accelerator if we want to run code on the GPU/TPU.

Go to Edit -> Notebook settings. Then select GPU from the drop-down menu.

That’s it! The notebook can now run on GPU. You can run the code there for up to 12 hours. If the notebook is inactive, it will be disconnected from Colab. But you can simply reconnect to it if that happens.

Put data on Google Drive

Google Colab is connected to Google Drive thus all of the data and Jupyter notebooks will need to be uploaded onto Google Drive.

For example, if I put my data into my Google Drive under new directory called projects, then I copy my data inside this project directory in a subdirecotry called data, so that I can reference the data inside the colab notebooks like so:

path_to_data = 'drive/MyDrive/projects/data'

Run the notebook

Now we are ready to run the notebook cell by cell. Make sure the paths where you are saving the outputs and reading in data are set correctly.