11. Submit Jobs to the Scheduler
Serial Run via Slurm
Once the process is capable of running on the interactive Yen command line, we need to create a submission slurm script that will run our python script.
The slurm submission script has two major components:
- Metadata around your job, and the resources the job needs such as time, cores and RAM
- The commands necessary to run your process
To run our python script, we can use the following slurm submission script,
ocr-serial.slurm. Use vi to edit this file
to include your email address:
#!/bin/bash # Example of running python script in a batch mode #SBATCH -J ocr #SBATCH -p normal #SBATCH -c 1 # CPU cores (up to 256) #SBATCH -t 30:00 #SBATCH -o ocr-serial-%j.out #SBATCH --mem=10G #SBATCH --mail-type=ALL #SBATCH --email@example.com # For safety, we deactivate any conda env that might be activated on interactive yens before submission and purge all loaded modules source deactivate module purge # Load software module load anaconda3 # Activate the environment source activate /zfs/gsb/intermediate-yens/conda/ocr # Run python script python ocr-serial.py
#SBATCH options is how a user supplies arguments to the slurm scheduler such as the number of CPU cores and script run time.
-Jis the name of the job that will appear in queue of jobs
-pis the partition (with
normalbeing the default) - choose
-cis the number of CPU cores (note that cores are hyperthreaded so requesting 1 or 2 “cores” in the slurm script will result in allocation of 1 or 2 hyperthreads and you will see that 2 CPUs were allocated; requesting 3 or 4 “cores” will allocate 4 CPU’s).
-tis the amount of time that your job is requesting (up to 2 days on
normal, up to 2 hours on
devand up to 7 days on
--memis the amount of total memory for your job (each CPU core has 4 G of RAM)
-ospecifies the name of the output file. Since the job will run in a batch mode, nothing will be printed to the screen as the job is running. But all of the standard output such as print statements in your script will be appended to a text file. We can optionally add
%jto the name of the output text file so we generate a new output file for each submitted job. If we do not include
%j, the output file will be overwritten if you submit the same slurm script.
--mail-typespecifies when you want to get slurm generated emails (such as when the job starts, finishes and crashes).
ALLwill send an email every time the status of the job changes.
--mail-useris where you add your email address if you want email notifications for this job.
After the slurm options, we tell the scheduler what software and scripts we want to run. Similarly to running software on the interactive yens, we first need to load the appropriate module such as the anaconda module and activate the environment where we installed all the python packages before running the python script.
Last line in the script is what actually runs the code.
Let’s submit the slurm script to the Yen-Slurm server by running
$ sbatch ocr-serial.slurm
You should see a similar output:
Submitted batch job 40293
After you submit the job, let’s look at the queue of jobs:
squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 40293 normal ocr nrapstin R 0:02 1 yen10
You can monitor the out file as the job is running to add the print statements interactively to the screen:
$ tail -f ocr-serial-*.out
After the job is finished, look at the output:
$ cat ocr-serial-*.out
You should see:
100%|██████████| 10/10 [03:59<00:00, 23.97s/it] image number to process: 1000 processed 100 images out of 1000 processed 200 images out of 1000 processed 300 images out of 1000 processed 400 images out of 1000 processed 500 images out of 1000 processed 600 images out of 1000 processed 700 images out of 1000 processed 800 images out of 1000 processed 900 images out of 1000 processed 1000 images out of 1000 running with 1 core(s) took: 239.79051423072815 seconds
Check that the resulting csv table and the number of rows are as expected.
Look at the slurm emails that you received from your job (one when the job starts and another when the job finishes). Slurm emails are useful as they detail exit status of the job (Completed Exit Code 0 is normal; any other code is Failed job), time in the queue, cores requested, CPU and memory efficiency.
If you made a mistake and want to cancel the job, you can do so with
scancel command. The
scancel JOBID command will cancel your job. You can find the unique numeric
JOBID of your job with
You can also cancel all of your running and pending jobs with
scancel -u $USER where
$USER is your SUNet ID.
Connect with us