Reproducible Research in R (and friends) (original) (raw)
This cheatsheet provides essential guidelines and best practices for conducting reproducible research using R and related tools. It covers project organization, version control, data management, documentation, environment management, workflow automation, and more. For any suggestions or feedback, please feel free to email me.
Project Organization
- Use a consistent folder structure:
data/
- Analysis data filesscripts/
- Analysis scriptsoutputs/
- Results (figures, tables)docs/
- Documentation and reports
- Use RStudio Projects to facilitate project management and environment isolation
- Maintain a project log to document progress, changes, and key decisions throughout the analysis
- Reference:
Version Control
- Use Git to track changes in scripts and documents
- Commit regularly with meaningful messages
- One repository per analysis
- Make sure your
data/
folder is in the.gitignore
file - Make sure there is no sensitive information in your code
- Reference:
Data Management
- Store raw data in
data/raw/
and never modify it directly - Produce a README describing the source data
- Use scripts to clean and process data, save the cleaned data in
data/processed/
- Document each step of data cleaning
- Keep data cleaning separate from analysis
- Organize your data in a tidy format: each variable is a column, each observation is a row
- Reference:
Documentation
- Comment code extensively to explain steps and logic
- Create README files to explain project structure and instructions for running the analysis
- Document all functions clearly, including input parameters, output, and purpose
- Reference:
Environment Management
- Use
sessionInfo()
ordevtools::session_info()
to capture the R session information - Use
{renv}
to manage package versions - Reference:
Workflow Automation
- Organize your analysis into a series of numbered and ordered scripts to create a clear and reproducible workflow (e.g., 01-data-cleaning.R, 02-data-analysis.R, 03-visualization.R)
- Create a master script (e.g., run_all.R) that sequentially runs each numbered script
OR
- Use Makefile or
{targets}
package to automate and document the workflow - Reference:
Analysis Scripts
- Break analysis into small, reusable functions
- Use meaningful and consistent naming conventions such as provided by the Tidyverse Naming Conventions for variables and functions and by data carpentry for folders and files
- Style your code according to standardized recommendations from the Tidyverse Style Guide
- Reference:
Computational reproducibility
- Set seeds to ensure reproducibility when using randomness in your analysis
- Document all warnings
- Reference:
Reporting
- Use RMarkdown (.Rmd) or Quarto (.Qmd) files to combine code, results, and narrative for creating dynamic reports
- Reference:
Validation
- Get your code reviewed prior to publication
- Reference:
Sharing Code And Data
- Use repositories like GitHub or GitLab for sharing code
- Use repositories like Zenodo or data.gouv.fr if you have data sets to share
Advanced Analysis Practices
- Use the “many models” approach to fit and compare models across many subsets of data (e.g. EWAS). Storing models as list-columns in tibbles simplifies storage, manipulation and visualization while promoting modularity and reusability.