Reproducible Research Essentials
This guide provides the foundational components needed to ensure reproducibility in your research. It focuses on:
- Documentating fixed inputs and expected outputs
- Making a README file
- Managing computational environments
- Summary with additional resources
- Advanced topics
There is also a Research Computing and Reproducible Research presentation available in the Research Hub Training & Workshops. There is a one-hour video, presentation slides, and a code link.
Documenting Fixed Inputs and Expected Outputs
-
Input Data: Clearly describe the data your code expects, including its format, structure, and examples. Highlight any preprocessing steps required.
-
Expected Outputs: Define the outputs your code generates, specifying their formats and how they should be interpreted.
-
Data Sources: For external datasets, provide details on how to access or obtain them.
-
Test Cases: Include sample test cases with known outputs, enabling users to verify the functionality of your system after setup.
Making a README File
A README file serves as the central guide to your project, offering an overview, setup instructions, and usage examples.
-
Installation Instructions: Provide step-by-step setup instructions, including commands for installing dependencies.
-
Usage: Explain how to run your code, detailing command-line arguments or configuration files as needed.
-
Examples: Share illustrative examples, ranging from simple cases to complex scenarios.
-
License: Clearly state the license governing the use, modification, and distribution of your project.
Additionally, include information about the memory and compute requirements of your analysis. For resource-intensive projects, note the necessity of high-performance computing environments, like the Yen cluster, and provide a description of these environments if applicable.
Managing Computational Environment
Maintaining consistent computational environments is critical for reproducibility.
Python
-
Virtual Environments: Use tools like
venvto create isolated environments for managing dependencies. Save these configurations usingpip freeze > requirements.txt. -
Environment Files: Include environment specifications in your repository. While
requirements.txtmay contain superfluous libraries, it provides a starting point for replication. -
For more details, please refer to the Python Virtual Environment Page
R
-
RStudio and Projects: Leverage RStudio's project-based setup to isolate workspace settings and library paths.
-
renvfor Package Management: Use therenvpackage to capture project-specific package states. Share the resultingrenv.lockfile to enable seamless environment recreation.
Initializerenvin your project with:Share your project by including therenv::init()renv.lockfile in your code repo, so others can recreate the environment by runningrenv::restore().
Best Practices for Environment Management
- Document the setup and management of your environment in a README file, including how to install dependencies and any necessary configuration steps.
- Include environment files like
requirements.txtorrenv.lockin your code repository to ensure that others can easily set up identical environments.
Summary
Investing in thorough documentation and environment management pays off by making your research more accessible and reproducible. Host your work on platforms like GitHub, test replicability on various systems, and include detailed README files to guide users.
Taking these steps ensures your research can be understood, validated, and built upon for years to come, significantly enhancing its impact.
Additional Resources
-
Guide to Project Management: Strategies for data organization, FAIR principles, and ethical constraints in sharing research materials.
-
Software Carpentry Git Guide: Introductory and advanced topics on version control with Git.
-
The Dataverse Project: Open-source repository software for research data management and sharing.
Advanced Topics
- Comprehensive Documentation: Use tools like Sphinx for Python projects to create detailed documentation websites from code docstrings.
- Continuous Integration (CI): Automate testing and code quality checks with CI tools like GitHub Actions or GitLab CI.
- Publishing and Sharing: Obtain DOIs for your research software on platforms like Zenodo.
- Data Version Control (DVC): Integrate DVC to manage data and model versions.
- Open Access Repositories: Share research outputs on platforms like arXiv or Zenodo.
- Preregistration: Use platforms like the Center for Open Science to preregister your experiments.
- Containers: Employ Docker, Podman, or Singularity to encapsulate your runtime environment, ensuring portability across systems.