11. Submit Jobs to the Scheduler

Serial Run via Slurm

Once the process is capable of running on the interactive Yen command line, we need to create a submission slurm script that will run our python script.

The slurm submission script has two major components:

  • Metadata around your job, and the resources the job needs such as time, cores and RAM
  • The commands necessary to run your process

To run our python script, we can use the following slurm submission script, ocr-serial.slurm. Use vi to edit this file to include your email address:

#!/bin/bash

# Example of running python script in a batch mode

#SBATCH -J ocr
#SBATCH -p normal
#SBATCH -c 1                           # CPU cores (up to 256)
#SBATCH -t 30:00
#SBATCH -o ocr-serial-%j.out
#SBATCH --mem=10G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@stanford.edu

# For safety, we deactivate any conda env that might be activated on interactive yens before submission and purge all loaded modules
source deactivate
module purge

# Load software
module load anaconda3

# Activate the environment
source activate /zfs/gsb/intermediate-yens/conda/ocr

# Run python script
python ocr-serial.py

The #SBATCH options is how a user supplies arguments to the slurm scheduler such as the number of CPU cores and script run time.

  • -J is the name of the job that will appear in queue of jobs
  • -p is the partition (with normal being the default) - choose normal or dev partition
  • -c is the number of CPU cores (note that cores are hyperthreaded so requesting 1 or 2 “cores” in the slurm script will result in allocation of 1 or 2 hyperthreads and you will see that 2 CPUs were allocated; requesting 3 or 4 “cores” will allocate 4 CPU’s).
  • -t is the amount of time that your job is requesting (up to 2 days on normal, up to 2 hours on dev and up to 7 days on long)
  • --mem is the amount of total memory for your job (each CPU core has 4 G of RAM)
  • -o specifies the name of the output file. Since the job will run in a batch mode, nothing will be printed to the screen as the job is running. But all of the standard output such as print statements in your script will be appended to a text file. We can optionally add %j to the name of the output text file so we generate a new output file for each submitted job. If we do not include %j, the output file will be overwritten if you submit the same slurm script.
  • --mail-type specifies when you want to get slurm generated emails (such as when the job starts, finishes and crashes). ALL will send an email every time the status of the job changes.
  • --mail-user is where you add your email address if you want email notifications for this job.

After the slurm options, we tell the scheduler what software and scripts we want to run. Similarly to running software on the interactive yens, we first need to load the appropriate module such as the anaconda module and activate the environment where we installed all the python packages before running the python script.

Last line in the script is what actually runs the code.

Let’s submit the slurm script to the Yen-Slurm server by running sbatch command:

$ sbatch ocr-serial.slurm

You should see a similar output:

Submitted batch job 40293

After you submit the job, let’s look at the queue of jobs:

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             40293    normal      ocr nrapstin  R       0:02      1 yen10

You can monitor the out file as the job is running to add the print statements interactively to the screen:

$ tail -f ocr-serial-*.out

After the job is finished, look at the output:

$ cat ocr-serial-*.out

You should see:

100%|██████████| 10/10 [03:59<00:00, 23.97s/it]
image number to process:  1000
processed 100 images out of 1000
processed 200 images out of 1000
processed 300 images out of 1000
processed 400 images out of 1000
processed 500 images out of 1000
processed 600 images out of 1000
processed 700 images out of 1000
processed 800 images out of 1000
processed 900 images out of 1000
processed 1000 images out of 1000
running with 1 core(s) took: 239.79051423072815 seconds

Check that the resulting csv table and the number of rows are as expected.

Look at the slurm emails that you received from your job (one when the job starts and another when the job finishes). Slurm emails are useful as they detail exit status of the job (Completed Exit Code 0 is normal; any other code is Failed job), time in the queue, cores requested, CPU and memory efficiency.

Cancelling jobs

If you made a mistake and want to cancel the job, you can do so with scancel command. The scancel JOBID command will cancel your job. You can find the unique numeric JOBID of your job with squeue.

You can also cancel all of your running and pending jobs with scancel -u $USER where $USER is your SUNet ID.