3. Yen-Slurm Architecture

The Yen-Slurm comprises five computing servers (yen10, yen11, yen12, yen13 and yen14 nodes) offered by the Stanford Graduate School of Business. It is designed to give researchers the ability to run computations that require a large amount of resources without leaving the environment and filesystem of the interactive Yens.

Login to the yens

We will use command line to login to the yens and submit jobs. Open a terminal if you are on a Mac or a Git Bash window if on Windows. Enter the ssh (secure shell) command to login into the yen cluster as shown below (Note: replace SUNetID with your SUNet ID!).

$ ssh <SUNetID>@yen.stanford.edu

Enter your password and authenticate with Duo.

Once you login to the yens, you will be on the interactive yens and should see yen1, yen2, yen3, yen4 or yen5 by the command prompt indicating which server you are on.

Current cluster configuration

Current partitions and their limits

Yen-Slurm has 480 total CPUs and 8.5 TB of RAM.

We divide the server into three logical partition designed for specific purpose - normal, dev and long.

Run sinfo command to see available partitions:

$ sinfo

You should see the following output:

normal*      up 2-00:00:00      2    mix yen[10-14]
dev          up    2:00:00      2    mix yen[10-14]
long         up 7-00:00:00      2    mix yen[10-14]

The first column PARTITION lists all available partitions. Default partition is indicated with *, so normal partition is the default.

  • For “production” jobs, the users can request up to 2 days running time.
  • The dev partition is great for development/debugging work and it has 2 cores in it available for up to 2 hours. Using the 2 cores in the dev partition, you can test your environment such as importing packages and paths to data and to make sure the script starts running as you would expect. Then once you know that the script will run, you can move your job to the normal partition and run it to completion.
  • For long running jobs the users can request up to 7 days on the “long” partition.

The cores/memory are not shared among users so once a job is allocated certain cores and RAM, they are unavailable for other users until that job finishes. So, make sure you use what you request!

Parition CPU core limit Memory allocated Time limit (default)
normal 256 4 GB per CPU core 2 days (2 hours)
dev 2 4 GB per CPU core 2 hours (1 hour)
long 32 4 GB per CPU core 7 days (2 hours)

When you request a certain number of cores, memory gets allocated based on that number. By default, allocated memory is 4 GB per core. If you need more RAM, you can request additional memory in your submission script but extra cores will be allocated to your job and unavailable for other users based on the amount of RAM your job requested.

For example, if you request 1 core but 40 GB of RAM, your job will be allocated 10 cores and the requested 40 GB of RAM meaning that these 10 cores are unavailable for other users even though your script might only be using a single core.

If a two day time limit is not enough for your job, make sure to start your job on the long partition (up to 7 days) and save intermediate results as you go.