Intermediate Yens
3. Yen-Slurm Architecture
The Yen-Slurm comprises five computing servers (yen10
, yen11
, yen12
, yen13
and yen14
nodes) offered by the Stanford Graduate School of Business.
It is designed to give researchers the ability to run computations that require a large amount of resources without leaving the environment and filesystem of the interactive Yens.
Login to the yens
We will use command line to login to the yens and submit jobs. Open a terminal if you are on a Mac or a Git Bash window
if on Windows. Enter the ssh
(secure shell) command to login into the yen cluster as shown below (Note: replace SUNetID with your SUNet ID!).
$ ssh <SUNetID>@yen.stanford.edu
Enter your password and authenticate with Duo.
Once you login to the yens, you will be on the interactive yens and should see yen1
, yen2
, yen3
, yen4
or yen5
by the command prompt
indicating which server you are on.
Current partitions and their limits
Yen-Slurm has 480 total CPUs and 8.5 TB of RAM.
We divide the server into three logical partition designed for specific purpose - normal
, dev
and long
.
Run sinfo
command to see available partitions:
$ sinfo
You should see the following output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 2-00:00:00 2 mix yen[10-14]
dev up 2:00:00 2 mix yen[10-14]
long up 7-00:00:00 2 mix yen[10-14]
The first column PARTITION lists all available partitions. Default partition is indicated with *
, so normal
partition is the default.
- For “production” jobs, the users can request up to 2 days running time.
- The
dev
partition is great for development/debugging work and it has 2 cores in it available for up to 2 hours. Using the 2 cores in thedev
partition, you can test your environment such as importing packages and paths to data and to make sure the script starts running as you would expect. Then once you know that the script will run, you can move your job to thenormal
partition and run it to completion. - For long running jobs the users can request up to 7 days on the “long” partition.
The cores/memory are not shared among users so once a job is allocated certain cores and RAM, they are unavailable for other users until that job finishes. So, make sure you use what you request!
Parition | CPU core limit | Memory allocated | Time limit (default) |
---|---|---|---|
normal | 256 | 4 GB per CPU core | 2 days (2 hours) |
dev | 2 | 4 GB per CPU core | 2 hours (1 hour) |
long | 32 | 4 GB per CPU core | 7 days (2 hours) |
When you request a certain number of cores, memory gets allocated based on that number. By default, allocated memory is 4 GB per core. If you need more RAM, you can request additional memory in your submission script but extra cores will be allocated to your job and unavailable for other users based on the amount of RAM your job requested.
For example, if you request 1 core but 40 GB of RAM, your job will be allocated 10 cores and the requested 40 GB of RAM meaning that these 10 cores are unavailable for other users even though your script might only be using a single core.
If a two day time limit is not enough for your job, make sure to start your job on the long
partition (up to 7 days) and save intermediate results as you go.
Connect with us