Reproducible Research in R (and friends) (original) (raw)

This cheatsheet provides essential guidelines and best practices for conducting reproducible research using R and related tools. It covers project organization, version control, data management, documentation, environment management, workflow automation, and more. For any suggestions or feedback, please feel free to email me.

Project Organization

Use a consistent folder structure:
- data/ - Analysis data files
- scripts/ - Analysis scripts
- outputs/ - Results (figures, tables)
- docs/ - Documentation and reports
Use RStudio Projects to facilitate project management and environment isolation
Maintain a project log to document progress, changes, and key decisions throughout the analysis
Reference:
- The concept of research compendium
- Using RStudio projects

Version Control

Use Git to track changes in scripts and documents
Commit regularly with meaningful messages
One repository per analysis
Make sure your data/ folder is in the .gitignore file
Make sure there is no sensitive information in your code
Reference:
- Happy Git with R

Data Management

Store raw data in data/raw/ and never modify it directly
Produce a README describing the source data
Use scripts to clean and process data, save the cleaned data in data/processed/
Document each step of data cleaning
Keep data cleaning separate from analysis
Organize your data in a tidy format: each variable is a column, each observation is a row
Reference:
- Principles of tidy data

Documentation

Comment code extensively to explain steps and logic
Create README files to explain project structure and instructions for running the analysis
Document all functions clearly, including input parameters, output, and purpose
Reference:
- Example README file
- {roxygen2} for function documentation

Environment Management

Use sessionInfo() or devtools::session_info() to capture the R session information
Use {renv} to manage package versions
Reference:
- Introduction to renv
- Example on how to show the session info (scroll to bottom)

Workflow Automation

Organize your analysis into a series of numbered and ordered scripts to create a clear and reproducible workflow (e.g., 01-data-cleaning.R, 02-data-analysis.R, 03-visualization.R)
Create a master script (e.g., run_all.R) that sequentially runs each numbered script

Use Makefile or {targets} package to automate and document the workflow
Reference:
- {targets} Package
- Example Project using {targets}

Analysis Scripts

Break analysis into small, reusable functions
Use meaningful and consistent naming conventions such as provided by the Tidyverse Naming Conventions for variables and functions and by data carpentry for folders and files
Style your code according to standardized recommendations from the Tidyverse Style Guide
Reference:
- Embrace functional programming
- Tidyverse Style Guide

Computational reproducibility

Set seeds to ensure reproducibility when using randomness in your analysis
Document all warnings
Reference:
- Random number seed in R

Reporting

Use RMarkdown (.Rmd) or Quarto (.Qmd) files to combine code, results, and narrative for creating dynamic reports
Reference:
- Quarto Documentation

Validation

Get your code reviewed prior to publication
Reference:
- Code Review Practices

Use repositories like GitHub or GitLab for sharing code
Use repositories like Zenodo or data.gouv.fr if you have data sets to share

Advanced Analysis Practices

Use the “many models” approach to fit and compare models across many subsets of data (e.g. EWAS). Storing models as list-columns in tibbles simplifies storage, manipulation and visualization while promoting modularity and reusability.
- Reference:
  * Tiefenbach’s Many Models Tutorial
  * R for Data Science: Many Models Chapter