Storage Solutions
GSB researchers work with datasets that often exceed the capacity of personal machines. To support this work, Stanford provides several storage platforms optimized for performance and collaboration. This page summarizes the storage options available and when to use each one.
Check Data License and Platform
Before transferring data to any platform, make sure the data is licensed for your use and that the storage platform meets the security requirements for that data.
New Storage System: VAST
High-Performance Storage for the Yen Cluster
The Yen cluster now runs on a new 1 PB all-flash VAST Data storage system, replacing the legacy ZFS backend. This upgrade significantly improves performance, reliability, and scalability for data-intensive research.
Designed for data-intensive research, VAST provides:
-
All-flash performance for fast reads/writes
-
Scalability for multi-TB and multi-user workloads
-
User-accessible snapshots for self-service file recovery
Not for High Risk Data
The Yen servers are not approved for high risk data. VAST is mounted only from the Yen servers. You cannot access it from Sherlock, FarmShare or any other system.
All home directories and project spaces on the Yens now live on VAST.
Use VAST when:
-
You are running analyses or workflows on the Yen cluster
-
You need fast access to large datasets
-
You want a shared project directory for collaboration
Multiple Mount Paths
Project directories are now accessible from both:
/zfs/projects/.../yen/projects/...
These paths point to the same underlying VAST storage. You can use either one interchangeably in your workflows.
Home Directory
Every Yen user receives a personal home directory:
/home/users/<SUNetID>
Your home directory is your private working space on the Yens. It’s best used for small scripts and utilities.
Do NOT Store Large Files in Home
Home directories have a strict 80 GB limit. They are not intended for large datasets, large outputs, or collaborative projects. If your home directory is full, you won't be able to access JupyterHub.
Project Directory
Requesting New Project Space
To create a new project directory, submit the Project Space Request Form. This form allows you to estimate disk usage, and specify any collaborators that should be added to the shared access list. Access to the directory is controlled through Stanford workgroups.
Project directories on the Yens provide shared, scalable storage for research conducted on the Yens. Every faculty project space is created by the DARC team in collaboration with the system administrators and mounted at a path:
/zfs/projects/faculty/<your-project-dir>
/zfs/projects/students/<your-project-dir>
Project directories are ideal for:
-
Large datasets
-
Analysis outputs and intermediate files
-
Collaborative research with multiple users
-
Long-running or compute-intensive workflows
Default Quotas
- Faculty led projects: 1 TB
- Student led projects: 200 GB
Additional space can be requested if needed.
Help Keep Shared Storage Healthy
Please archive inactive data and delete unused intermediate files. Yen storage is a shared resource for the entire research community.
If you would like to discuss specific storage solutions for your project, please email the DARC team to discuss further.
Temporary Storage
Some workflows require fast, short-term storage for intermediate results. The Yen cluster provides two options:
- Node-local storage:
/tmp - Shared scratch space:
/scratch/shared
These locations are not intended for permanent data and are not backed up.
Node-Local Storage
Each compute node provides local disk space at:
```bash { .yaml .no-copy } /tmp
This storage is:
- Very fast (local to the node)
- Typically **~1 TB per node**
- Only accessible from the node where your job is running
Use `/tmp` when:
- Your job runs on a **single node**
- You need maximum I/O performance
- Files are temporary and disposable
!!! warning "Always move results back to permanent storage"
- `/tmp` is cleared when the job is done and when the node reboots. Always copy important results back to your project directory.
- Avoid filling `/tmp`. Other jobs on the same node depend on it.
#### Cluster-Wide Scratch
Shared scratch space is available at:
```bash { .yaml .no-copy}
/scratch/shared
This space is:
- Large (~100 TB total)
- Accessible from all Yen nodes
- Slower than
/tmp
Use /scratch/shared when:
- Jobs run across multiple nodes
- Intermediate data must be shared between jobs
- You need more space than
/tmp
🔒 Using Scratch Safely
By default, files in /scratch/shared may be visible to other users unless you restrict access.
You are responsible for setting permissions on your scratch directory.
To safely use scratch, you should:
- Create your own directory
- Restrict access with
chmod - Ensure new files are private using
umask
Step 1: Create a private scratch directory
You can create this directory either manually or directly within your job script:
mkdir -p /scratch/shared/$USER
-p ensures the command won’t fail if the directory already exists.
Step 2: Restrict access to your directory
chmod 700 /scratch/shared/$USER
Step 3: Ensure new files are private
umask 077
umask is session-based
The umask setting only applies to the current terminal session or job. It does not persist automatically and is not tied to a specific directory.
The umask setting only applies to the current terminal session or job. It does not persist automatically and is not tied to a specific directory.
This means:
- You should run
umask 077each time you start an interactive job, or - Include it at the top of every Slurm job script
💻 Example: Interactive Session
cd /scratch/shared
mkdir -p $USER
chmod 700 $USER
cd $USER
umask 077
python script.py
💻 Example: Slurm Job
#!/bin/bash
umask 077
SCRATCH_DIR=/scratch/shared/$USER
mkdir -p $SCRATCH_DIR
chmod 700 $SCRATCH_DIR
cd $SCRATCH_DIR
python script.py
🧹Clean Up Your Scratch Space
Scratch is a shared, temporary resource. You are expected to remove your files when your job is complete.
To clean up your scratch directory:
rm -rf /scratch/shared/$USER/*
Always move results back to permanent storage
- Files in
/scratch/sharedare not backed up and may be periodically cleared by administrators. Any important results should be moved to your project directory.
How to Check Home and Project Space Quota
To check your storage usage of your home directory, /home/users/<SUNet ID>/, run:
gsbquota
Example output:
/home/users/<SUNet ID>:
using 52% (43GiB) of soft quota 80GiB
hard quota: 300GiB
time until lockout: no block scheduled
How to Interpret This
- Soft quota (80GiB) Your target limit. You should stay below this for normal operation.
- Hard quota (300GiB) Absolute maximum. If you exceed this, writes will fail.
- Time until lockout If you stay above the soft quota too long, you may eventually be locked out of the system.
Stay below soft quota for home
Stay below the 80GiB soft quota to avoid interruptions to your workflows and access to tools like JupyterHub.
You can also check size of your project space by passing in a full path to your project space to gsbquota command:
gsbquota /zfs/projects/students/<my-project-dir>/
/zfs/projects/students/<my-project-dir>/: currently using 39% (78G) of 200G available
Email Us If You Encounter Issues
Please report any issues with gsbquota to the DARC team.
How to Recover Deleted Files
Files on the Yens are backed up in snapshots, so if you need to recover something you accidentally deleted, luckily you can still access it!
Here is an example of something that might happen:
- You are working in the directory
/zfs/projects/faculty/hello-world/and have a couple of files namedresults.csvandresults_temp.csvthat you made yesterday. The latter file was clearly temporary and so you want to remove it to clean things up:
rm /zfs/projects/faculty/hello-world/results.csv
- Oops, you accidentally deleted the file you wanted to keep! Only
results_temp.csvremains. Luckily, since you made the files yesterday, there are likely snapshots available.
Note that snapshots are retained according to specific intervals. The current snapshot retainment policy is as follows:
- Hourly --- retain 1 day of hourly snapshots
- Daily --- retain 1 week of daily snapshots
- Weekly --- retain 2 months of weekly snapshots
- Monthly --- retain 1 year of monthly snapshots
You can find them at the top level of any project folder, so for our example:
/zfs/projects/faculty/hello-world/.snapshot
Snapshots Are Still Being Populated
The snapshots are still being populated in the new file system. Eventually we will have a year of snapshots.
We recommend not relying on snapshots since they may not always be available. If you often need snapshots, it may mean you don’t yet have a good backup/versioning workflow in place. Think of the snapshots as "oh thank goodness, I didn't mean to delete that".
Other Storage Options
In addition to VAST, several other storage platforms support research workflows at Stanford GSB. The best option depends on your dataset size, security requirements, compute needs, and level of collaboration.
Redivis
Redivis allows users to deploy datasets in a web-based environment (GCP backend) and provides a powerful query GUI for users who don't have a strong background in SQL.
Google Drive
Available to all users at Stanford. Google Drive is approved for low, medium and high risk data. It supports up to 400,000 files and has a daily upload limit of 750 GB, making it ideal for storing audio, video, PDFs, images, and flat files. Google Drive is great for sharing with external collaborators and is also suitable for archiving research data.
Oak
Oak is a High-Performance Computing (HPC) storage system available to research groups and projects at Stanford for research data. The monthly cost is approximately $45 per 10 TB. Oak does not provide local or remote data backup by default, and should be considered as a single copy. However, backups are available for an additional fee. Oak is the preferred storage location for Sherlock, but can be mounted on the Yen cluster by request using an NFS gateway.
Cloud Platforms
Storing data in the cloud is an effective way to inexpensively archive data, serve files publicly, and integrate with cloud-native query and compute tools. With the growing number of cloud storage options and security risks, we advise caution when choosing to store your data on any cloud platform. If you are considering cloud solutions for storage, please contact DARC to discuss your needs.
Legacy Stanford Platforms
The following storage platforms are currently supported but users are discouraged from relying on them for continuing research:
- AFS: Andrew File System is a distributed, networked file system that allows users to access and share files. UIT no longer automatically provisions new faculty and staff members with AFS user volumes as the service is being sunsetted by the university.
- Box: Stanford University Box provided basic document management and collaboration through Box.com. As of February 28, 2023, university IT retired the Box service and has taken steps to migrate Box content to a new folder on Google Drive or Microsoft OneDrive.
AFS Volumes
You may have a personal AFS volume that is named according to your SUNet ID. For example If your SUNet ID is johndoe13, then the path to your AFS directory is: /afs/ir/users/j/o/johndoe13. The two individual letters are the first two letters of the SUNet ID.
You may have access to other AFS volumes set up for specific projects, or other people may give you access to a specific directory in their AFS volume. To access other AFS volumes, you need to know what the path is. For example, the path might be something like /afs/ir/data/gsb/nameofyourdirectory.
How to access an AFS volume
You can transfer files to and from AFS using OpenAFS using your desktop, a free download available from Stanford. This software will mount your AFS directory so that you can access it using an Explorer (Windows) or Finder (Mac) window as you do with other files.
AFS is NOT Mounted on the Yen Servers
AFS is no longer mounted. If you still wish to access your AFS space (afs-home), you can SSH into SRC's rice nodes. These nodes are part of the University's FarmShare system and you can access them with ssh <SUnetID>@rice.stanford.edu.
WebAFS has been retired and is no longer available to use. For its alternatives, visit this page.