Skip to content

Storage Solutions

GSB researchers work with datasets that often exceed the capacity of personal machines. To support this work, Stanford provides several storage platforms optimized for performance and collaboration. This page summarizes the storage options available and when to use each one.

Check Data License and Platform

Before transferring data to any platform, make sure the data is licensed for your use and that the storage platform meets the security requirements for that data.

New Storage System: VAST

High-Performance Storage for the Yen Cluster

The Yen cluster now runs on a new 1 PB all-flash VAST Data storage system, replacing the legacy ZFS backend. This upgrade significantly improves performance, reliability, and scalability for data-intensive research.

Designed for data-intensive research, VAST provides:

  • All-flash performance for fast reads/writes

  • Scalability for multi-TB and multi-user workloads

  • User-accessible snapshots for self-service file recovery

Not for High Risk Data

The Yen servers are not approved for high risk data. VAST is mounted only from the Yen servers. You cannot access it from Sherlock, FarmShare or any other system.

All home directories and project spaces on the Yens now live on VAST.

Use VAST when:

  • You are running analyses or workflows on the Yen cluster

  • You need fast access to large datasets

  • You want a shared project directory for collaboration

Multiple Mount Paths

Project directories are now accessible from both:

  • /zfs/projects/...
  • /yen/projects/...

These paths point to the same underlying VAST storage. You can use either one interchangeably in your workflows.

Home Directory

Every Yen user receives a personal home directory:

Terminal Output for pwd Command
/home/users/<SUNetID>

Your home directory is your private working space on the Yens. It’s best used for small scripts and utilities.

Do NOT Store Large Files in Home

Home directories have a strict 80 GB limit. They are not intended for large datasets, large outputs, or collaborative projects. If your home directory is full, you won't be able to access JupyterHub.

Project Directory

Requesting New Project Space

To create a new project directory, submit the Project Space Request Form. This form allows you to estimate disk usage, and specify any collaborators that should be added to the shared access list. Access to the directory is controlled through Stanford workgroups.

Project directories on the Yens provide shared, scalable storage for research conducted on the Yens. Every faculty project space is created by the DARC team in collaboration with the system administrators and mounted at a path:

Terminal Output
/zfs/projects/faculty/<your-project-dir>
or for a student-lead project:
Terminal Output
/zfs/projects/students/<your-project-dir>

Project directories are ideal for:

  • Large datasets

  • Analysis outputs and intermediate files

  • Collaborative research with multiple users

  • Long-running or compute-intensive workflows

Default Quotas

  • Faculty led projects: 1 TB
  • Student led projects: 200 GB

Additional space can be requested if needed.

Help Keep Shared Storage Healthy

Please archive inactive data and delete unused intermediate files. Yen storage is a shared resource for the entire research community.

If you would like to discuss specific storage solutions for your project, please email the DARC team to discuss further.

Temporary Storage

Some workflows require fast, short-term storage for intermediate results. The Yen cluster provides two options:

  • Node-local storage: /tmp
  • Shared scratch space: /scratch/shared

These locations are not intended for permanent data and are not backed up.

Node-Local Storage

Each compute node provides local disk space at:

```bash { .yaml .no-copy } /tmp

This storage is:

- Very fast (local to the node)
- Typically **~1 TB per node**
- Only accessible from the node where your job is running

Use `/tmp` when:

- Your job runs on a **single node**
- You need maximum I/O performance
- Files are temporary and disposable

!!! warning "Always move results back to permanent storage"
      - `/tmp` is cleared when the job is done and when the node reboots. Always copy important results back to your project directory.
      - Avoid filling `/tmp`. Other jobs on the same node depend on it.


#### Cluster-Wide Scratch

Shared scratch space is available at:

```bash { .yaml .no-copy}
/scratch/shared

This space is:

  • Large (~100 TB total)
  • Accessible from all Yen nodes
  • Slower than /tmp

Use /scratch/shared when:

  • Jobs run across multiple nodes
  • Intermediate data must be shared between jobs
  • You need more space than /tmp

🔒 Using Scratch Safely

By default, files in /scratch/shared may be visible to other users unless you restrict access.

You are responsible for setting permissions on your scratch directory.

To safely use scratch, you should:

  1. Create your own directory
  2. Restrict access with chmod
  3. Ensure new files are private using umask

Step 1: Create a private scratch directory

You can create this directory either manually or directly within your job script:

Terminal Input
mkdir -p /scratch/shared/$USER
Using -p ensures the command won’t fail if the directory already exists.

Step 2: Restrict access to your directory

Terminal Input
chmod 700 /scratch/shared/$USER
This ensures only you can access the directory.

Step 3: Ensure new files are private

Terminal Input
umask 077
This ensures that any new files or directories you create are private by default.

umask is session-based

The umask setting only applies to the current terminal session or job. It does not persist automatically and is not tied to a specific directory.

The umask setting only applies to the current terminal session or job. It does not persist automatically and is not tied to a specific directory.

This means:

  • You should run umask 077 each time you start an interactive job, or
  • Include it at the top of every Slurm job script

💻 Example: Interactive Session

Using scratch in an interactive job
cd /scratch/shared
mkdir -p $USER
chmod 700 $USER
cd $USER

umask 077
python script.py

💻 Example: Slurm Job

Using scratch in a slurm job
#!/bin/bash

umask 077
SCRATCH_DIR=/scratch/shared/$USER

mkdir -p $SCRATCH_DIR
chmod 700 $SCRATCH_DIR
cd $SCRATCH_DIR

python script.py


🧹Clean Up Your Scratch Space

Scratch is a shared, temporary resource. You are expected to remove your files when your job is complete.

To clean up your scratch directory:

Terminal Command
rm -rf /scratch/shared/$USER/*

Always move results back to permanent storage

  • Files in /scratch/shared are not backed up and may be periodically cleared by administrators. Any important results should be moved to your project directory.

How to Check Home and Project Space Quota

To check your storage usage of your home directory, /home/users/<SUNet ID>/, run:

Terminal Command
gsbquota

Example output:

Terminal Output
/home/users/<SUNet ID>:
  using 52% (43GiB) of soft quota 80GiB
  hard quota: 300GiB
  time until lockout: no block scheduled

How to Interpret This

  • Soft quota (80GiB) Your target limit. You should stay below this for normal operation.
  • Hard quota (300GiB) Absolute maximum. If you exceed this, writes will fail.
  • Time until lockout If you stay above the soft quota too long, you may eventually be locked out of the system.

Stay below soft quota for home

Stay below the 80GiB soft quota to avoid interruptions to your workflows and access to tools like JupyterHub.

You can also check size of your project space by passing in a full path to your project space to gsbquota command:

Terminal Output
gsbquota /zfs/projects/students/<my-project-dir>/
/zfs/projects/students/<my-project-dir>/: currently using 39% (78G) of 200G available

Email Us If You Encounter Issues

Please report any issues with gsbquota to the DARC team.

How to Recover Deleted Files

Files on the Yens are backed up in snapshots, so if you need to recover something you accidentally deleted, luckily you can still access it!

Here is an example of something that might happen:

  • You are working in the directory /zfs/projects/faculty/hello-world/ and have a couple of files named results.csv and results_temp.csv that you made yesterday. The latter file was clearly temporary and so you want to remove it to clean things up:
Terminal Command
rm /zfs/projects/faculty/hello-world/results.csv
  • Oops, you accidentally deleted the file you wanted to keep! Only results_temp.csv remains. Luckily, since you made the files yesterday, there are likely snapshots available.

Note that snapshots are retained according to specific intervals. The current snapshot retainment policy is as follows:

  • Hourly --- retain 1 day of hourly snapshots
  • Daily --- retain 1 week of daily snapshots
  • Weekly --- retain 2 months of weekly snapshots
  • Monthly --- retain 1 year of monthly snapshots

You can find them at the top level of any project folder, so for our example: /zfs/projects/faculty/hello-world/.snapshot

Snapshots Are Still Being Populated

The snapshots are still being populated in the new file system. Eventually we will have a year of snapshots.

We recommend not relying on snapshots since they may not always be available. If you often need snapshots, it may mean you don’t yet have a good backup/versioning workflow in place. Think of the snapshots as "oh thank goodness, I didn't mean to delete that".

Other Storage Options

In addition to VAST, several other storage platforms support research workflows at Stanford GSB. The best option depends on your dataset size, security requirements, compute needs, and level of collaboration.

Redivis

Redivis allows users to deploy datasets in a web-based environment (GCP backend) and provides a powerful query GUI for users who don't have a strong background in SQL.

Google Drive

Available to all users at Stanford. Google Drive is approved for low, medium and high risk data. It supports up to 400,000 files and has a daily upload limit of 750 GB, making it ideal for storing audio, video, PDFs, images, and flat files. Google Drive is great for sharing with external collaborators and is also suitable for archiving research data.

Oak

Oak is a High-Performance Computing (HPC) storage system available to research groups and projects at Stanford for research data. The monthly cost is approximately $45 per 10 TB. Oak does not provide local or remote data backup by default, and should be considered as a single copy. However, backups are available for an additional fee. Oak is the preferred storage location for Sherlock, but can be mounted on the Yen cluster by request using an NFS gateway.

Cloud Platforms

Storing data in the cloud is an effective way to inexpensively archive data, serve files publicly, and integrate with cloud-native query and compute tools. With the growing number of cloud storage options and security risks, we advise caution when choosing to store your data on any cloud platform. If you are considering cloud solutions for storage, please contact DARC to discuss your needs.

Legacy Stanford Platforms

The following storage platforms are currently supported but users are discouraged from relying on them for continuing research:

  • AFS: Andrew File System is a distributed, networked file system that allows users to access and share files. UIT no longer automatically provisions new faculty and staff members with AFS user volumes as the service is being sunsetted by the university.
  • Box: Stanford University Box provided basic document management and collaboration through Box.com. As of February 28, 2023, university IT retired the Box service and has taken steps to migrate Box content to a new folder on Google Drive or Microsoft OneDrive.

AFS Volumes

You may have a personal AFS volume that is named according to your SUNet ID. For example If your SUNet ID is johndoe13, then the path to your AFS directory is: /afs/ir/users/j/o/johndoe13. The two individual letters are the first two letters of the SUNet ID.

You may have access to other AFS volumes set up for specific projects, or other people may give you access to a specific directory in their AFS volume. To access other AFS volumes, you need to know what the path is. For example, the path might be something like /afs/ir/data/gsb/nameofyourdirectory.

How to access an AFS volume

You can transfer files to and from AFS using OpenAFS using your desktop, a free download available from Stanford. This software will mount your AFS directory so that you can access it using an Explorer (Windows) or Finder (Mac) window as you do with other files.

AFS is NOT Mounted on the Yen Servers

AFS is no longer mounted. If you still wish to access your AFS space (afs-home), you can SSH into SRC's rice nodes. These nodes are part of the University's FarmShare system and you can access them with ssh <SUnetID>@rice.stanford.edu.

WebAFS has been retired and is no longer available to use. For its alternatives, visit this page.