Intermediate Yens
11. Submit Jobs to the Scheduler
Serial Run via Slurm
Once the process is capable of running on the interactive Yen command line, we need to create a submission slurm script that will run our python script.
The slurm submission script has two major components:
- Metadata around your job, and the resources the job needs such as time, cores and RAM
- The commands necessary to run your process
To run our python script, we can use the following slurm submission script, ocr-serial.slurm
. Use vi to edit this file
to include your email address:
#!/bin/bash
# Example of running python script in a batch mode
#SBATCH -J ocr
#SBATCH -p normal
#SBATCH -c 1 # CPU cores (up to 256)
#SBATCH -t 30:00
#SBATCH -o ocr-serial-%j.out
#SBATCH --mem=10G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your_email@stanford.edu
# For safety, we deactivate any conda env that might be activated on interactive yens before submission and purge all loaded modules
source deactivate
module purge
# Load software
module load anaconda3
# Activate the environment
source activate /zfs/gsb/intermediate-yens/conda/ocr
# Run python script
python ocr-serial.py
The #SBATCH
options is how a user supplies arguments to the slurm scheduler such as the number of CPU cores and script run time.
-J
is the name of the job that will appear in queue of jobs-p
is the partition (withnormal
being the default) - choosenormal
ordev
partition-c
is the number of CPU cores (note that cores are hyperthreaded so requesting 1 or 2 “cores” in the slurm script will result in allocation of 1 or 2 hyperthreads and you will see that 2 CPUs were allocated; requesting 3 or 4 “cores” will allocate 4 CPU’s).-t
is the amount of time that your job is requesting (up to 2 days onnormal
, up to 2 hours ondev
and up to 7 days onlong
)--mem
is the amount of total memory for your job (each CPU core has 4 G of RAM)-o
specifies the name of the output file. Since the job will run in a batch mode, nothing will be printed to the screen as the job is running. But all of the standard output such as print statements in your script will be appended to a text file. We can optionally add%j
to the name of the output text file so we generate a new output file for each submitted job. If we do not include%j
, the output file will be overwritten if you submit the same slurm script.--mail-type
specifies when you want to get slurm generated emails (such as when the job starts, finishes and crashes).ALL
will send an email every time the status of the job changes.--mail-user
is where you add your email address if you want email notifications for this job.
After the slurm options, we tell the scheduler what software and scripts we want to run. Similarly to running software on the interactive yens, we first need to load the appropriate module such as the anaconda module and activate the environment where we installed all the python packages before running the python script.
Last line in the script is what actually runs the code.
Let’s submit the slurm script to the Yen-Slurm server by running sbatch
command:
$ sbatch ocr-serial.slurm
You should see a similar output:
Submitted batch job 40293
After you submit the job, let’s look at the queue of jobs:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
40293 normal ocr nrapstin R 0:02 1 yen10
You can monitor the out file as the job is running to add the print statements interactively to the screen:
$ tail -f ocr-serial-*.out
After the job is finished, look at the output:
$ cat ocr-serial-*.out
You should see:
100%|██████████| 10/10 [03:59<00:00, 23.97s/it]
image number to process: 1000
processed 100 images out of 1000
processed 200 images out of 1000
processed 300 images out of 1000
processed 400 images out of 1000
processed 500 images out of 1000
processed 600 images out of 1000
processed 700 images out of 1000
processed 800 images out of 1000
processed 900 images out of 1000
processed 1000 images out of 1000
running with 1 core(s) took: 239.79051423072815 seconds
Check that the resulting csv table and the number of rows are as expected.
Look at the slurm emails that you received from your job (one when the job starts and another when the job finishes). Slurm emails are useful as they detail exit status of the job (Completed Exit Code 0 is normal; any other code is Failed job), time in the queue, cores requested, CPU and memory efficiency.
Cancelling jobs
If you made a mistake and want to cancel the job, you can do so with scancel
command. The scancel JOBID
command will cancel your job. You can find the unique numeric JOBID
of your job with squeue
.
You can also cancel all of your running and pending jobs with scancel -u $USER
where $USER
is your SUNet ID.
Connect with us