10 Reproducible Workflows for Data Science Projects
In Chapter 9, you practiced your data science skills by conducting an exploratory data analysis (EDA) project. Using data from the Centers for Disease Control and Prevention (CDC) PLACES project, you explored depression prevalence across Texas. Perhaps some of you even took it a step further and looked at other states on your own.
A data science project contains many moving pieces, ranging from the initial EDA you just completed to the advanced statistical modeling you will learn in upcoming chapters. An absolutely vital feature of any data science project is its reproducibility. In this chapter, we focus specifically on computational reproducibility.
A project is computationally reproducible when an independent data scientist (or you, a few months from now) can take your exact data, source code, and software environment, rerun the entire pipeline, and obtain the exact same results, figures, and tables. Essentially, it means you have tracked your steps clearly enough that your claims can be verified by others or by your future self.
Making your first project reproducible is important because it helps you build foundational habits for all your future work. If you are working in a team or as a student, it allows a teammate or an instructor to easily understand your thinking and verify your findings. Reproducibility matters far beyond your first project; it is a cornerstone of credible data science, enabling seamless collaboration and building trust in your results across both academic and professional settings.
Even though this is a dedicated chapter on reproducibility, you have been building your reproducibility muscles since Chapter 1. You started using Quarto and GitHub right away, both of which are central pillars of the reproducibility ecosystem used throughout this book. Quarto documents are inherently reproducible when written well because they weave your executable code, its direct output, and your narrative prose into a single file; anyone can re-render your document to recreate the exact analysis. GitHub adds another layer by tracking every single modification, creating a complete history of your project so you can revisit or restore any previous version of your work.
This chapter builds directly on that foundation. We will continue developing the project we started in the previous chapter, moving from the core analysis to thinking critically about how to make our entire workflow reproducible.
- Prepare reproducible presentations using Revealjs
- Adopt a clean, standard folder structure optimized for data science reproducibility
- Coordinate smoothly with teammates using a collaborative GitHub workflow
10.1 Reproducible Presentations
Let’s say you have been tasked by a colleague or your professor to present your data analysis on depression prevalence in Texas using slides.1 In this section, we will consider a few points on how to make your slides reproducible. We will consider reproducibility of numeric values (e.g., mean), tables, and figures.
Any numeric value, table, or figure that is a result of your data analysis must be generated dynamically by code. Never copy-paste values or screenshots from your analysis output directly into your presentation or manuscript.
Let’s get started with our slides. You can start a Quarto presentation in RStudio by clicking File > New File > Quarto Presentation. You will be prompted with multiple layout formats; for this workflow, select the RevealJS option.
10.1.1 Numeric Values
To report numeric values within text, in Chapter 1 we have learned about inline code. Any numeric value you represent within text related to your data should be generated by an inline code. For instance, if we were to report the number of counties, we could write in our presentation:
The dataset contained depression prevalence for `r nrow(texas_depression)`
counties of Texas.This would display on our presentation as:
The dataset contained depression prevalence for 254 counties of Texas.
Tailoring your presentation to your audience often means deciding whether to show your source code. You can easily toggle code visibility using the echo chunk option (true to reveal, false to conceal). Just remember the syntax: in Quarto, options start with #| (a hashtag and a vertical line). Don’t forget that crucial space right after echo:—it’s an easy syntax error to make when you’re first starting out!
```{r}
#| echo: true
nrow(texas_depression)
```10.1.2 Tables
You have already seen how to generate tables directly from a data frame in Chapter 8, and you built Table 9.1 back in Chapter 9.
To make your tables ready for presentations or manuscripts, it is essential to label them and include captions. For example, here is what the code chunk options looked like on our end when we generated Table 9.1:
```{r}
#| label: tbl-make-top10-depression-table
#| tbl-cap: A table for 10 Texas counties with the highest adult depression prevalence estimates for 2023.
```While any code chunk can be given a unique label to identify it, chunks that generate tables must use a label starting with the tbl- prefix. This allows you to create dynamic cross-references in your text. By typing @tbl-make-top10-depression-table in text, Quarto automatically enumerates the table and links the text to the table, allowing readers to click and jump directly to it (e.g., you can click on this text Table 9.1) in online formats. Here, you see the table’s number as it appears in the order of tables and in relation to the chapter number in our book. However, in your presentation if this is your first table, then it would be shown as Table 1, if second, then Table 2 etc.
10.1.3 Figures
When building presentations, you will typically work with two types of images: data-driven visualizations (like {ggplot2} plots) and external, non-data images (like presenter photos, timeline diagrams, or logos).
You already know how to write code to generate data visualizations from Chapter 3. Below, we will show you how to include external images using a code chunk. Keep in mind that the chunk options featured here apply to both data-generated and non-data-generated images!
Imagine you want to display the “Hello Data Science” logo in your acknowledgments slide. Your code chunk would look like this:
```{r}
#| label: fig-logo
#| fig-cap: Hello Data Science Logo
#| fig-alt: "A hex logo that reads Hello Data Science with scatterplot-like points in the background colored into three clusters."
#| fig-align: center
#| out-width: 20%
knitr::include_graphics(here::here("presentation/logo.png"))
```
Notice that for images and figures, your chunk label must now begin with the fig- prefix. This tells Quarto to treat the asset as a figure, enabling automated numbering and interactive cross-referencing in your document. So when you write @fig-logo in text, Quarto would display a link to Figure 10.1 directly. To give your figure a visible title, use fig-cap to define the caption text.
As we discussed in Chapter 4, digital accessibility is vital, which is why you should always include a descriptive fig-alt (alternative text) for screen readers. There are a few tricky points about writing alt text as a chunk option. We first need to warn you against the use of colon within alt text. If your description includes a colon (:), Quarto will think of it as a chunk option and will throw an error. So if you absolutely must use a colon, then we would recommend you to use quotation marks like we have done above.
Finally, you can control the layout of your image using fig-align to position it (such as center, left, right) and out-width to scale its size relative to the page or slide (e.g., 20%). To actually load the external image file into your document, use the include_graphics() function from the {knitr} package alongside here() from the {here} package to display the image.
There are other Quarto features that can make your presentations look polished but are not necessarily related to reproducibility. You can read the documentation for Revealjs presentations on the Quarto website. Two features we frequently use include multiple columns and themes.
10.2 Project Folder Content
A key part of reproducibility is organization. Think of your project folder like your room. If you always leave your belongings scattered around your room it will be chaos and you can never find what you are looking for. A project with a clean, logical folder structure is like having a well-organized system of folders on your computer—everything is in its place, and simple for you (or anyone else) to find what you are looking for.
There is no single correct way to structure a project, but there are a few core principles that make your workflow better. First, your file and folder names must be machine-readable. In other words, they should avoid special characters, spaces, or accent marks that can cause different operating systems or programs run into errors. Following the Tidyverse Style Guide, it is best practice to use all lowercase letters and separate words with a hyphen (i.e., kebab-case) rather than a space or underscore (e.g., hello-data-science-logo.png). This approach not only makes your project highly searchable but also ensures your filenames to function smoothly across different operating systems.
Second, the file and folder names should be meaningful to humans (i.e., human readable) with files in predictable locations. For instance, if you have a .csv file for your data, it would make sense to put this file into a data folder. This would ensure your collaborators and your future self finding the file you are looking for.
Finally, because data science workflows almost always involve multiple sequential steps, it is best to prefix your filenames with numbers and/or letters to clarify the exact execution order. For instance, structured filenames like 01-import-data.qmd, 02-clean-data.qmd, and 03-fit-model.qmd instantly communicate the pipeline’s flow. This simple numbering convention ensures that anyone running your project knows exactly where the analysis begins and how it progresses.
Which of the following filenames adheres best to proper file-naming conventions?
10-linear-regression.qmd10 linear regression.qmdlinear regression.qmd10-lin-r.qmd
Check footnote for answer2
Below, we give an example of the directory structure for the EDA project we completed in Chapter 9. You can access the actual project on GitHub.
depression-in-texas/
├── .gitignore
├── README.md
├── depression-in-texas.Rproj
├── data/
│ ├── README.md
│ ├── 01-raw-data/
│ │ └── PLACES__Local_Data_for_Better_Health,_County_Data,_2025_release_20260311.csv
│ ├── 02-converted-data/
│ │ └── places-data.parquet
│ └── 03-cleaned-data/
│ └── texas-depression-map-data.csv
├── data-processing/
│ ├── 01-convert-data.qmd
│ └── 02-clean-data.qmd
└── presentation/
├── logo.png
└── slides.qmd
This folder structure creates a clear and logical flow for your project. At the top level (the “root”), you have your main project file (.Rproj), configuration files, and the main project README.md. The sub-folders organize the different components of your work:
data/: This folder is the single source for all datasets. It’s further divided into three folders: raw-data for original, unopened csv file, converted-data for the parquet file we created from the original csv, and cleaned-data for the processed data that is ready for analysis.
data-processing/: This is your digital workbench. It contains numbered Quarto notebooks (.qmd files) that perform the work of loading and cleaning your data in a sequential order.
presentation/: This folder contains the final output of your project. By keeping the Quarto document for your presentation separate, you create a clear distinction between the code used for processing data and the code used for communicating your findings. Separating your project this way makes it easy to follow the path from raw data to final result, which is essential for reproducibility. We will explore the purpose of the README.md files and the .gitignore file in more detail next.
10.2.1 README.md
A README file is a text file which is essentially the welcome mat of a folder. We will write our README using markdown, hence the .md file extension. It provides information about the files in a folder. Our project has two of them: one in the main project folder and the other one in the data folder. Each serves a distinct but equally important purpose.
The Main README.md
Think of the README.md file in your project’s root folder as the front door to your project. When you publish your project to a platform like GitHub, the contents of this file are automatically displayed on the project’s homepage. It is the first thing a visitor will see.
A good README makes your project welcoming and accessible to others. It should, at a minimum, answer three questions:
- What is this project about? A brief, clear description of the project’s research question and goals.
- How do I run it? Clear instructions for a new user to replicate your analysis. This includes stating any software requirements (e.g., “This project requires R and the tidyverse package”) and outlining the steps to reproduce the final output (e.g., “Open presentation/presentation.qmd and click Render”).
- Who are you? Your name and contact information, as well as acknowledgments for anyone who helped you.
The data/README.md (The Data Dictionary)
The README.md inside your data folder has a much more specific job: it documents your data. This file, often called a data dictionary or codebook, is essential for making your data understandable and usable. Without it, anyone trying to understand your analysis (including you, two months from now) is left to guess what your column names mean or how values are coded.
A complete data dictionary should include:
Source: Where did the data come from? Who collected or compiled the data? Provide a citation.
Host: Where was the data located when you downloaded it? Provide a URL and citation.
Difference between data source and host: Think of the source as the creator and the host as the venue. The source is the original author, agency, or researcher who designed the study, gathered the measurements, and compiled the data (e.g., the World Health Organization). The host is the platform, repository, or website where that data currently lives and from which you actually downloaded the file (e.g., Kaggle, GitHub, or an academic data repository like Zenodo). Always attribute both so your readers know who to credit for the science, and exactly where to go to replicate your download.
Date: When the data was accessed and downloaded.
File Descriptions: A list of the files in the data folder (e.g.,
raw_data/source_data.csv,processed_data/clean_data.csv) and a brief description of each.Variable Descriptions: For each important data file (especially your processed_data), list every column name and provide:
- A plain-language description of what the variable is.
- The units of measurement (e.g., “dollars”, “degrees Celsius”).
- An explanation of any codes used (e.g., 1 = “Strongly Disagree”, 2 = “Disagree”, 99 = “Not Applicable”).
Keeping this documentation right next to the data it describes makes your project transparent and trustworthy.
10.2.2 gitignore
If you have been tracking your project’s progress using Git, you have likely noticed a simple text file in your root directory named .gitignore. As the name suggests, this file tells Git exactly which files and folders to ignore. When a file is “ignored,” Git completely stops tracking changes to it, ensuring it won’t be bundled into your repository or shared with others when you push your work to GitHub.
Why would Git ignore some files? First, it protects your sensitive data. You should never commit or push raw datasets containing private information, or passwords. Adding these files to your .gitignore ensures they stay safe on your local machine. Second, it reduces repository clutter. Your computer and software automatically generate numerous temporary or system-specific files—such as .DS_Store on macOS or .Rhistory and .RData in RStudio—which are unique to your local session. Pushing them to GitHub would create unnecessary clutter and additional problems when you start collaborating.
10.3 Tracking software
A project is not truly reproducible if you can only reproduce it on your own machine at a single point in time. To ensure that you (and others) can run your analysis in the future, you must also document the computational environment—the specific versions of the software and packages you used to generate your results.
An R package you use today might be updated tomorrow, and that update could subtly change a function’s behavior, leading to a different set of results or even breaking your code. Documenting your software environment is the solution to this problem.
The simplest way to document your environment is with the session_info() function from the {sessioninfo} package.
When you run this command, R prints a detailed summary of your current session, including: the version of R you are using, your operating system, and a list of all the packages that are currently loaded into your R session and their version numbers. It is a good practice to run sessioninfo::session_info(to_file = TRUE) in the very last code chunk of your final document (e.g., at the end of presentation/presentation.qmd) and it will create a file with the session info.
This command provides a crucial snapshot of the exact environment that produced your final results. If someone (or your future self) has trouble rerunning the code, the session_info() output is the first place to look for discrepancies in package versions. See an example in Section 1. Once you complete introductory level in data science, you may also consider using the {renv} package in the near future for keeping a record of your computational environment.
10.4 Collaboration on GitHub
Data science is not a solo sport. It is very probable that you will get to work on a team. GitHub facilitates this collaboration by providing a central, shared repository where all team members can sync their work. However, to collaborate effectively and avoid overwriting each other’s contributions, it’s essential to follow a specific workflow. Since you are currently at an introductory level, we will show you a simple workflow, but as you get more advanced in data science, keep in mind that there are more complex ways to collaborate on GitHub.
Recall our Git and GitHub workflow from Section 1.6. We first cloned our repo to our computer. As we worked we made commits using Git, still only on our own computer. Then we pushed commits to GitHub online. After cloning, our workflow was just: “commit->push”.
To keep things simple, we will assume for now it is just you and one more collaborator working together towards completing your presentation.qmd. One way you may initially consider working together is by taking turns sequentially. Let’s consider the steps of this kind of workflow. We also summarize this workflow in Table 10.1.
- You both clone the repos from GitHub online. You both have the same version of the file as GitHub.
- You individually make a change to the
presentation.qmdon your own computer and commit this change. At this point, nothing has changed on your collaborator’s computer and nothing has changed on GitHub. - You push your changes to GitHub. In this case both you and GitHub have the most up to date version, but your collaborator still has the old version of the file (and repo).
- If your collaborator also wants the latest version then they need to PULL these changes. The pull button is next to the push button. Once they pull their
presentation.qmdfile will be updated. - You collaborator commits a change. The change is only recorded on their end.
- They push the change. Now the change is reflected on GitHub.
- You pull and the change is now reflected on your computer.
| Steps | GitHub Version | Your Computer | Collaborator’s Computer |
|---|---|---|---|
| Both collaborators clone the repo | Version 1 | Version 1 | Version 1 |
| You commit a change | Version 1 | Version 2 | Version 1 |
| You push a commit | Version 2 | Version 2 | Version 1 |
| Collaborator pulls | Version 2 | Version 2 | Version 2 |
| Collaborator commits a change | Version 2 | Version 2 | Version 3 |
| Collaborator pushes a commit | Version 3 | Version 2 | Version 3 |
| You pull | Version 3 | Version 3 | Version 3 |
Pulling is the action of fetching the latest changes from the remote repository on GitHub and downloading (merging) them directly into your computer’s repository. Think of it as checking for updates so your version matches the GitHub version. Pulling is an essential step when working with others on GitHub.
Taking turns is safe, but it is rarely how real-world data science happens. More often, you and your collaborator will be working on your own computers at the exact same time. If you are a student doing group project for a class, as professors, we know the probability of everyone working together the night before the due date.
Fortunately, Git is incredibly smart. If you are both editing the same file but working on different lines (for instance, you are writing the Introduction on line 10 and your collaborator is writing the Conclusion on line 150), Git can automatically merge your work together without overwriting anyone’s contributions. The key here is to plan ahead with your collaborator and discuss who will write which section.
Let’s look at the steps of this simultaneous workflow, tracked in Table 10.2.
You both start with the exact same version of the repo from GitHub.
While you are typing away on your computer and editing the introduction of presentation.qmd, your collaborator is actively editing the conclusion part of the same file on theirs. You both commit your respective changes locally. At this exact moment, your computer has your edits, your collaborator’s computer has their edits, and GitHub remains unchanged.
You push your changes first. GitHub accepts them immediately because your local history matches what was on GitHub. GitHub now moves to an updated version that contains your edits, but it does not have your collaborator’s edits yet.
Your collaborator tries to push their changes right after you. Git will reject their push. Because you just pushed your work, the version on GitHub is now different from the version your collaborator started with. Git will tell them that their local repository is “behind” the remote server.
To fix this, your collaborator must pull your changes first. When they click pull, Git looks at the file, realizes you edited the top and they edited the bottom, and automatically merges your changes into their local file. Their computer now holds a combined version containing both of your updates.
Now that your collaborator’s computer is fully caught up with GitHub’s history, they can successfully push. GitHub is updated with the combined version.
Finally, you pull from GitHub so that your computer also reflects the final combined document.
| Steps | GitHub Version | Your Computer | Collaborator’s Computer |
|---|---|---|---|
| Both start synchronized | Version 1 | Version 1 | Version 1 |
| You both edit and commit locally | Version 1 | Version 2A (Your edits) | Version 2B (Their edits) |
| You push your commit | Version 2A | Version 2A | Version 2B |
| Collaborator tries to push | Push Rejected | Version 2A | Version 2B |
| Collaborator pulls your changes | Version 2A | Version 2A | Version 3 (Merged text) |
| Collaborator pushes | Version 3 | Version 2A | Version 3 |
| You pull their changes | Version 3 | Version 3 | Version 3 |
Git is excellent at automatically weaving changes together if you and your collaborator work on different files or even different sections of the same file. However, if you both edit the exact same line of the exact same file at the same time, Git will not guess whose work is correct. Instead, it will throw a merge conflict. A merge conflict is simply Git telling you “You both changed line 12. I am not sure if I should keep your line 12 or your collaborator’s line 12. I need a human to tell me which one to keep”.
When a merge conflict occurs during a pull, Git will modify the affected file and insert markers directly into your document to make it explicit to you where the merge conflict occurred. The file will look like this:
<<<<<<< HEAD
This is the text as it looks on YOUR computer.
=======
This is the text your COLLABORATOR pushed to GitHub.
>>>>>>> 1a2b3c4f...To resolve the conflict, you must open the file and manually clean it up: Delete the markers (<<<<<<<, =======, and >>>>>>>). Decide whether to keep your version, your collaborator’s version, or rewrite a combination of both. Save the file, stage, and commit the resolved version to finish the merge.
To avoid merge conflicts, always decide who is going to do what part of a project. This is not only a good Git workflow, but it is also an essential team management practice. By assigning specific files or distinct sections of a document to individual team members, you naturally avoid stepping on each other’s code lines.
Pull before you work, pull before you push. Make it a habit to click the Pull button the moment you sit down to start a working session to ensure you are building on top of your collaborator’s latest commits. Similarly, always pull right before you push to catch any updates they landed while you were working.
Your routine workflow should be: pull, then work and periodically commit -> pull -> push.
For readers, who might be interested in presenting their work in manuscripts we recommend reading the Quarto documentation at https://quarto.org/docs/manuscripts/. Also see a full list of available journal templates on the Quarto Journals GitHub page at https://github.com/quarto-journals. ↩︎
Correct choice is a. It uses all lowercase letters, replaces spaces with hyphens to remain perfectly machine-readable, and begins with a two-digit number to maintain an explicit chronological order in your project folder. The choices b and c contain blank spaces which can break automated code. The choice d is machine-readable, but the abbreviations are too cryptic (
lin-r). Filenames should remain human-readable so a collaborator instantly knows what content lives inside the file without opening it.↩︎