Intermediate Yens
4. Scheduler
What is a scheduler?
Unlike the interactive yens (yen1, yen2, yen3, yen4 or yen5), you do not directly login to the yen-slurm cluster. The yen-slurm cluster can be accessed by the Slurm workload manager, also known as job scheduler or batch scheduler. Researchers can submit jobs to the scheduler, asking for a certain amount of resources (CPU cores, memory, and time). Slurm will then manage the queue of jobs based on what resources are available. In general, those who request less resources will see their jobs start faster than jobs requesting more resources.
Why use a scheduler?
A job scheduler has many advantages over the directly shared environment of the interactive yens:
- Run jobs with a guaranteed amount of resources (CPU cores, memory, time)
- Setup multiple jobs to run automatically
- Run jobs that exceed the community guidelines on the interactive nodes
- Gold standard for using high performance computing resources around the world
Queue of jobs
See all of the jobs in the yen-slurm queue with:
$ squeue
You will see a list of currently running jobs and a queue of pending jobs. Your job will run based on this queue.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1043 normal a_job user1 PD 0:00 1 (Resources)
1042 normal job_2 user2 R 1:29:53 1 yen10
1041 normal bash user3 R 3:17:08 1 yen11
1044 dev bash user3 R 1:00:08 1 yen12
JOBID
lists a unique numeric job ID for this job.PARTITION
lists the partition the job is submitted to (normal
,dev
orlong
).NAME
lists the job name that the user specified in the submission script (if no name is supplied, the name of the submission batch script is used). Job names do not have to be unique.USER
indicates the yen user who submitted the job.ST
lists the job state.R
means the job is running andPD
means the job is pending in the queue.TIME
lists the time the job has been running. Pending jobs will have time 0:00 until they start running.NODES
lists how many different machines or nodes the job is running on (1 means the job is running on eitheryen10
,yen11
,yen12
,yen13
oryen14
and 2 means the job is running on two nodes, and so on).NODELIST(REASON)
lists the hostname for the node that the job is running on (yen10
,yen11
,yen12
,yen13
oryen14
). For pending jobs, you will see a reason why this jobs has not started yet. Common reasons are(Resources)
when the job is waiting on resources such as CPU cores or memory to be available before it can start and(Priority)
when the job is lower in priority than other jobs in the queue but the resources are available.
Filtering this command for your user will display only your running and queued jobs:
$ squeue -u $USER
where $USER
is your SUNet ID.
How do I check how busy the machines are?
You can pass format options to the sinfo
command as follows:
$ sinfo --format="%m | %C"
MEMORY | CPUS(A/I/O/T)
1031612+ | 86/394/0/480
where MEMORY
outputs the (minimum) size of memory per node in megabytes (the minimum node memory is 1T) and
CPUS(A/I/O/T)
prints the number of CPU cores that are allocated / idle / other / total.
For example, if you see 86/394/0/480
that means 86 CPU cores are allocated, 394 are idle (free) out of 480 cores total
(other should always be 0 unless that node is down for maintenance).
We can also add formatting options to the squeue
command:
$ squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.10l %.4C %.7m %.15R"
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT CPUS MIN_MEM NODELIST(REASON)
157022 normal job1 user1 PENDING 0:00 2-00:00:00 30 1T (Resources)
155217 normal job2 user2 RUNNING 8:59:39 1-00:00:00 4 80G yen10
157027 long job3 user3 RUNNING 7:30:34 7-00:00:00 4 70G yen11
157026_1 normal job_array user3 RUNNING 4:11 4:00:00 4 70G yen12
where we specify what columns we want to display with additional columns for time limit, CPU cores and minimum memory each job requested.
When will my job start?
You can ask the scheduler using squeue --start
, and look at the START_TIME
column.
$ squeue --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
112 normal yahtzeem astorer PD 2020-03-05T14:17:40 1 yen10 (Resources)
113 normal yahtzeem astorer PD 2020-03-05T14:27:00 1 yen10 (Priority)
114 normal yahtzeem astorer PD 2020-03-05T14:37:00 1 yen10 (Priority)
115 normal yahtzeem astorer PD 2020-03-05T14:47:00 1 yen10 (Priority)
116 normal yahtzeem astorer PD 2020-03-05T14:57:00 1 yen10 (Priority)
117 normal yahtzeem astorer PD 2020-03-05T15:07:00 1 yen10 (Priority)
Connect with us