Redivis
Redivis is a powerful data querying and analysis platform built specifically with researchers in mind. Redivis is constructed on top of Google Cloud's BigQuery engine, which makes working with Big Data on the multi-TB scale much faster. Data manipulation and queries that may take many hours to run on computing systems like the Yen servers, can take seconds on Redivis.
Why Redivis?
It has always been a major challenge to find a computing and storage solution for Big Data analysis that is intuitive for less technical researchers while still being powerful enough to support intensive data manipulation and queries. There are plenty of potential solutions including locally-hosted database servers and cloud-hosted services (AWS Athena, AWS Redshift, AWS RDS, Google Cloud BigQuery, etc.), but they can be very costly, time consuming to set up, cryptic to manage, unintuitive to use, or some combination of all of the above. The Redivis platform provides the best balance of all of these factors and is our default solution for hosting large datasets for researchers at the GSB.
Furthermore, Redivis has particular points of emphasis on data security and access controls, creating a safe environment for collaborative research work with all types of data.
Data Security
Unlike the Yen cluster, which is only approved for Moderate risk data, Redivis is approved for High risk data. This includes highly-sensitive data such as social security numbers and protected health information (PHI). This is possible, because Redivis has a number of protective measures built into the platform. The fact that Redivis has been approved for High risk data makes it a unique data processing platform at Stanford University and serves as a viable hosting option for researchers negotiating with data vendors for sensitive data.
Learn More
Visit our Security page to learn more about what data and information security mean at both Stanford GSB and Stanford University.
Fine-Grained Access Control
Redivis offers fine-grained access control to most facets of its platform, including at the levels of
- Organization -- Who is able to access or apply to access to datasets within an organization?
- Dataset -- Who is able to use or edit a particular dataset? How much of a dataset can an individual member see?
- Export -- Can a subset or derivative of a dataset be exported? If so, to what environment? How much data can an individual export?
- Worfklow -- Who is able to see, copy, or edit the work done in a workflow?
These straightforward access controls make the platform flexible for both administrators and researchers alike to manage datasets and research work.
Getting Access
Access to Organization
The GSB has their own Redivis organization (StanfordGSBLibrary) within the greater Stanford Data Farm pool of organizations. To join the GSB's Redivis organization, follow the directions on this page. After you join the organization, you can start using the datasets that are already available to you.
Once you have a Redivis account, you can also join the organization hosted by Stanford Libraries (SUL), which features an array of datasets that may be of interest to GSB researchers. Note that the set of datasets in this organization is not maintained by the GSB so you should contact Research Data Services at SUL instead for support.
Access to Datasets
There are a number of datasets hosted in the StanfordGSBLibrary organization that require additional approval. You will need to apply for access to these datasets individually.
Where Do I Start?
To start, Redivis has extensive documentation about their platform, which even includes example workflows that discuss specific data pipelines and use cases.
We recommend watching the video below for a quick overview:
Learn More
Read our blog post covering key use cases and helpful tips for Redivis based on our experience with the platform and working with other users.
When Do I Use Redivis?
The Redivis platform is best used when you want to...
-
Initially explore and query datasets hosted in the StanfordGSBLibrary Redivis organization
You do not need to start by exporting entire tables within datasets or querying data via the Redivis API.
-
Subset or aggregate large datasets (multi-TB) for further processing elsewhere
Leveraging the BigQuery-backed data transforms within Redivis for this type of data processing will be faster and more efficient for you compared with using your laptop or the Yen cluster.
-
Merge small personal datasets, like Excel spreadsheets, with data hosted on Redivis
With the ability to create your own datasets within your own account and to upload your own lists, you can perform dataset merging operations within the Redivis platform and forego exporting data outside Redivis.
-
Work with High risk data
Although there are other options at Stanford for working with Big Data that is classified as high risk (Nero GCP), Redivis is fully approved and offers performant computing resources out-of-the-box that would otherwise need to be configured on your own.
Where Do I Ask for Help?
Depending on the nature of your question, there are several places that you can go for help:
- For Redivis platform-specific questions, you can join the #redivis-users Slack channel hosted for GSB researchers and ask questions there.
- For questions about access to datasets hosted on the StanfordGSBLibrary Redivis organization, email the GSB Research Data Coordination team.
- For questions about hosting your own large datasets on Redivis, email the GSB Data, Analytics, and Research Computing (DARC) team.
- For questions about the content of specific datasets hosted on the StanfordGSBLibrary Redivis organization, fill out the GSB Library Ask Us form.
Storage Costs
The DARC team is happy to perform the technical work to help GSB faculty researchers host datasets on Redivis, but there are associated storage costs on the Redivis platform that will need to be covered by the researcher.