Week 13 Reproducibility and Projects in R
13.1 The benefits of code reproducibility
Reproducibility refers to the capacity for any process you create to be fully and independently replicable either by yourself in the future or by another person. Non-reproducibility of scientific findings has been cited as a leading problem and some of the problem comes from the ad-hoc and thus non-reproducible conduct of data preparation and analysis.
Spatial epidemiology requires intensive data preparation, cleaning, and management, and often complex sequence of analytic steps. In other words it would be difficult for another analyst or a future version of yourself to repeat the process in exactly the same way unless there is a perfect record of what was done. For this reason, reproducibility of analysis is emphasized and required in this course.
For an analysis to be reproducible you be sure that all data stays paired with all code, and that all (or as many as possible) steps that change or manipulate data are written in your scripts rather than done ‘by hand’ (e.g. in Excel or some other editor).
13.2 Workflows to enhance reproducibility
Because R
and RStudio are often used for data preparation, analysis, and reporting, the fundamental importance of reproducibility (making analytic processes transparent, interpretable and repeatable) is built-in through many features. This Appendix introduces several strategies that are important for reproducibility broadly, and also important for the work you do in this course.
First, there is a brief introduction to projects in RStudio, and then there is a slightly more in-depth description of a specific file format, rmarkdown
and how it can be used to create Notebooks.
13.2.1 Using Projects in R
A project in R
organizes your work much as you might use folders on your computer to sort and separate into some logical scheme. In other words, it is a place where you put multiple documents or files that are related to one another.
For instance, you might choose to have a single project for each week of this class, and perhaps a separate project for each assignment. In each project directory (folder) you could store the data, the scripts or code, and any outputs (e.g. saved maps or other saved objects) that are specific to that week or assignment.
The advantage of creating a formal project in RStudio (rather than just a regular folder, for example), is that RStudio projects have certain benefits for your coding workflow.
- When you open a project, the working directory (e.g. the root directory or file path where
R
looks for files when you import) is automatically set to be inside the project folder. This means that if you keep your data inside the project, you will never have to worry about broken links or incorrect file paths that occur because data was moved. - Projects remember environmental settings in RStudio, so you may customize something to a specific project and that will be remembered each time you open the project.
- If you ever work with a version control system such as Github, projects are the natural strategy to contain a repository
You should avoid using setwd()
in R
! That function changes the working directory and you may have been taught to do this to make it easier. This is bad because whatever pathname you put inside the setwd()
will amost never work on another computer. That means your code is fragile and specific to your computer, and probably to your computer at only a specific point in time.
If you find yourself relying on setwd()
or any other strategy to hard code file pathnames, please consider learning about projects. They help make code less fragile and more robust for sharing and reproducing.
To create a new project:
1.Look in the upper-right corner of RStudio for the blue-ish R symbol that will likely say ‘Project’. Click the pull-down menu and select New Project 2. You will see the Project Wizard open with three options: + If you have not yet created the folder on your computer that will be your project, choose New Directory + If you already have a folder (e.g. perhaps it is named ‘Week1’), choose Existing Directory + If you are are forking or checking out a repository from Github, GitLab or other system, choose Version Control 3. Navigate to the location you want your new folder to be, or else the location where you existing folder already is 4. Name the project and click Create Project
Once the project is created, you can navigate via your finder to that folder. You will notice a new file with extension .Rproj
. If you double-click this file, your project will open, including whatever files and settings you have already worked on.
Get in the habit of opening R
by double-clicking on the xxx.Rproj
icon in your project folder. Doing this makes sure that the working directory is set and helps you maintain relative rather than absolute file pathnames within your project folder.
13.3 Organizing projects
Some projects or analyses are simple and perhaps only involve a single script document and use built-in data. But most projects are more complex than that, involving dataset(s), one or more files with code scripts, possibly output including datasets as well as images saved from figures, and markdown files or reports. It is good practice to have a standard strategy for organizing these.
13.3.1 Make scripts that do discrete tasks
You may be used to having one file with hundreds or even thousands of lines of code to do every part of an analysis. This isn’t inherently wrong, but it can make it difficult to find the particular snippets of code where you defined a recoded variable, or carried out descriptive analyses. For larger projects, consider creating separate scripts for discrete steps. If you do have many different R
scripts in a given project, consider storing them in a sub-folder perhaps labeled code/
. You might break your work down into separate scripts like this:
- A script for data preparation. This allows you to quickly return to the process of retrieving and preparing your data to make changes.
- Scripts for descriptive analysis. You may want to revisit your descriptives in the future and having them separate makes that easier.
- Scripts (one or more) for more complex analyses including modeling, figure preparation, or simulation.
Each script should have an informative name such as project-x-data-prep.R
or project-x-create-final-maps.R
.
13.3.2 Always store data with code and output
If you are creating maps, the raw (and possibly post-processed, or intermediate) data that supports those maps should be stored inside your project folder. This is the only way to guarantee that you can return in a year and recreate the map exactly. If you have multiple data files, you might consider putting this content in a sub-folder, possibly labeled data/
.
13.3.3 Maintain all output files (figures, cleaned datasets, etc)
Just as you want to store code and data together, you should also plan to store all output content in the main project folder or possibly in one or more sub-folders (e.g. images/
or reports/
). There are several kinds of outputs that might be generated including:
- Images or figures
- Maps
- Cleaned or prepared datasets (either stored as
.xlsx
or.csv
or possibly stored inR
binary format such as.rds
) - Reports (e.g. rendered from R-markdown either as
html
orpdf
)
13.4 Use the here
package to maintain robust relative pathnames
There are many reasons to keep your work organized, but one is to maintain a known and constant relationship between where data and code are stored. As discussed above, the use of setwd()
creates a rigid or absolute pointer to where a file (e.g. your data might be at C:\MyDocuments\EPI563\Week1
) are stored. But if you changed computers or changed your file structure on your current computer, that absolute path would likely fail making the code non-reproducible (your code could not find you data)!
Instead, please try to preference relative pathnames. That is a way of describing where something is relative to a given starting point. In the case of projects in R-studio, that starting point is always the folder containing the project. Thus, the location of a dataset stored in a sub-folder called data is: data/mydataset.xlsx
; it is assumed that the folder data
is a sub-folder of the parent or project folder. As long as you keep your project as a self-contained folder (e.g. copy/paste it as a folder or share it as a folder with all contents), this relative location will be robust.
The here
package was developed to try to make some of this a bit easier. The package named here
also has a function named here()
(I know it feels a bit repetitive!). The function, here()
serves to describe the hierarchical nesting of folders that locates the file or location you desire (e.g. where to import a dataset from or where to save a figure to). This is some examples of how to use here()
:
-
Importing data:
mydata <- read.csv(here('data', 'wave1', 'wave1_data.csv))
. In this code, we create a new object (namedmydata
) that results from using the functionread.csv()
. The data is located within the project folder at this relative path location:data/wave1/wave1_data.csv
. -
Saving output:
ggsave(here('figures', 'figure1.png'))
. In this code, we save theggplot()
figure to our computer at this location within the overall project folder:figures/figure1.png
.
Caution: If you work in a Windows OS environment, be careful how you designate file pathnames. R
uses notation that is similar to Unix OS, and also the one adopted by Mac OS, which is to define a set of nested folders with a forward slash as in: H:/mkram02/gis-file
. Unfortunately that is the opposite of how Windows describes pathnames (e.g. in Windows the above would use back slash like this: H:\mkram02\gis-file
). Using the here
packages avoids this confusion.
13.4.1 Specify a relative location outside the working directory
What if you have one folder for this entire course, and inside it you have a separate project directory for each week. If you are working on the project for Week2
, you might wish to load a file that you saved previously in Week1
. In other words it is not a sub-folder, but is actually outside of the current directory. You could use the setwd()
function to change the location, but that creates a possibly fragile absolute pathname and can be dangerous. Instead you could create a more robust relative pathname by referring to the other file in relation to your current location.
Using two dots in a pathname tells R
to go up a level in the directory. So to if the georgia.csv
file I referred to above were in your Week1
directory, but your are currently woring in Week2
you could do this:
dd <- read.csv('../data/death-data/georgia.csv')
This means “go up a level, then look in the data folder, then the death-data folder, then load the georgia.csv file”. If you need to go up two (or more) levels, simply repeat: ../../data/death-data.georgia.csv