Data Science, PhD, R

Manage your Data Science Project in R

Reading time: 9 minutes

A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.

Source: Here.

The pain of managing a Data Science project

Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown, and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.

One of the things that I had been looking for is an approach that would allow me to track/know the “version” of the dataset I used for a specific analysis when I wrote and run an R script or a set of R scripts. Have I run some-script.R in the raw data or the preprocessed data? Have I run the some-other-script.R in the dataset preprocessed this way or preprocessed that other way? What version of my dataset was used in the analysis done by the code committed put-a-git-commit-hash here? In a few words: How to track code and data at the same time? (And of course, reports in RMarkdown being included is a plus).

This tutorial will guide you through the management/execution of a basic example of a Data Science project written in R. It’s not very close to what I do in my daily life and I assume it’s not what you do either, but it’s basic, simple and still useful for understanding how to use these tools together, mostly DVC.

One second to talk about Git, Packrat and RMarkdown

Source: here.

I don’t want to get into much detail about git and packrat here (or how to install Git and Packrat), this is not the focus of the post. In a few words, git is a version control software which means that it helps you to track the evolution of your source code. Whenever you do some reasonable change to your code you take a snapshot of it (save point, whatever) through what is called a commit. This way, you can “time travel” to your project. You can find out when a bug was accidentally introduced in the code or work at the same time with several friends in the same code. Git will manage this for you. DVC comes to assist Git by adding this management/tracking support to data files and to manage pipelines. There will be some Git commands here and there, and for the sake of not making this tutorial longer than it already is, I will not comment on them. You can get help here.

Source: here.

Packrat is a package dependency management for R which means that you can, for example, have a Data Science project using an R package version 1.X and using the same package in a different version 1.Z in another DS project, being both projects on the same machine. Besides, you can take snapshots with packrat too, making it easy to share your whole environment with colleagues or have it on another machine. If you’re from a Python background, this is like pip and virtualenv. Isolation and package dependency management.

Source: here.

As its official website says, “Your data tells a story. Tell it with R Markdown. Turn your analyses into high quality documents, reports, presentations and dashboards”. Think of it as Markdown + R technologies, turning it into markup language on steroids.

DVC

Source: here.

When it comes to data files, the idea behind DVC is to create small text files (metadata) about the real data files and give them to git to take care, instead of the data files. Git is not made for big blobs or data files and by versioning your GB (TB?) datasets, you can make your repository slow to work with (and I won’t even mention cloning..). DVC will take care of the real files, while git will take care of metadata generated by DVC. DVC operates like Git and uses basically the same syntax: you add (dvc add) and push (dvc push) files to a remote location (remote storage, in dvc nomenclature). If you’re new to version control, this can take a little time to get used to. If you added a dataset to DVC, but you forgot to push and by accident you removed your dataset or lost the git repository, the remote storage (where DVC saves your real files) won’t have the dataset, just like the git remote repository wouldn’t have your source code, because you did not git push. Everything is fine if you push, of course. However, in the same way that you have a “copy” of your source code in the remote repository, but also in your local repository, the real data files are both in the remote storage and in your local repository (in DVC cache hidden folder). Some common tasks in DVC can be automated to be integrated into your git commands. If you like this idea, dvc install is what you’re looking for.

If you have pip installed, you can easily install dvc with pip install dvc. If you don’t, don’t worry. There are other ways to install it, such as snap (GNU/Linux) or brew (macOS).

Let’s start our experiment.

The first thing we should do is to (1) create our Git repository, (2) start packrat and (3) DVC.

# Starting your repository with git, packrat and DVC
mkdir -p ~/dev/r_dvc_rmarkdown && cd ~/dev/r_dvc_rmarkdown
git init
dvc init
R –silent -e "packrat::init(\"~/dev/r_dvc_rmarkdown\")"
git add .Rprofile .gitignore packrat/
git commit -m "Initializes Git, Packrat and DVC"
view raw start_tools.sh hosted with ❤ by GitHub

In this case, I’m creating a local remote which actually is a folder synced with Dropbox. Your real data files will be there.

# Configuring remote
dvc remote add -d myremote ~/Dropbox/PhD\ data/dvc-storage
# You can see your remotes with the command: dvc remote list
# After creating a remote, dvc creates a config file. Let's commit it
git commit .dvc/config -m "Adds local/dropbox remote"

Now it’s time to show DVC where our dataset is. DVC will generate metadata for git to track and DVC itself will start tracking the real data files. You can download the dataset (simulation.tsv) file clicking here.

# Setting up data
mkdir data
# Replace the next line with the path where you saved simulation.tsv
dvc get-url ~/Dropbox/PhD\ data/simulation.tsv data/
# If you stop here, git will see the data file what may make you
# feel like you should git add/commit it. *Don't do it*, you should not
# git commit data files. They're huge and git is not made for this.
# Instead, request dvc to track it.
dvc add data/simulation.tsv
# This command will generate the data file metadata which is what
# you should track with git. Also ask git to track the .gitignore
# file created by DVC so that git will stop suggesting you to track
# the big data files.
git add data/simulation.tsv.dvc data/.gitignore
git commit -m "Adds source data to DVC"
view raw setup_data.sh hosted with ❤ by GitHub

At this point, your git repository is up to date. If you don’t have a remote repository, everything is all set with Git. If you are also hosting your code in GitHub, Bitbucket, GitLab (or any other git repository hosting solution), it’s time to push. In DVC it is no different, if you go to your remote storage (which in this case is local), there is nothing there. Not even the folder has been created. Well, if you have a git remote, your repo won’t be there either, right? We haven’t pushed yet. Let’s push with the dvc push command.

# Push to the DVC remote repository
dvc push
view raw push.sh hosted with ❤ by GitHub

If you check now, not only there is a folder there but also the dataset. Not metadata, the real big file. Not in a git remote, but in a DVC remote storage. However, it won’t look like you expect. It’s going to look like a mysteriously named folder, with a mysteriously named file inside – it’s the md5 hash for the dataset, but believe me, the content is there. To check, you can use the command ls -h (in bash) to check the size of the remote storage and compare it to what you expect.

Source: here.

Let’s now start writing our data science code/analysis. Let’s say we want to set all negative values to -1 and all positive values to +1, that is, binarize our variables. This would be our preprocessing step.

# Create the preprocess script
# preprocess.R with the content below.
input_file <- read.csv(file = 'data/simulation.tsv',
sep='\t', stringsAsFactors=FALSE)
input_file <- input_file[, 1:50]
input_file[, unlist(lapply(input_file, is.numeric))] <-
apply(input_file[, unlist(lapply(input_file, is.numeric))],
2,
function(x) ifelse(x < 0, -1, x))
input_file[, unlist(lapply(input_file, is.numeric))] <-
apply(input_file[, unlist(lapply(input_file, is.numeric))],
2,
function(x) ifelse(x > 0, 1, x))
write.csv2(input_file,
'data/simulation_preprocessed.csv',
row.names=FALSE)
view raw preprocess.R hosted with ❤ by GitHub

After having written the preprocessing script, we must set it as a stage in our pipeline. You do this with the dvc run command. It’s here that you tell DVC what this stage needs (e.g. input files) and what it will output (e.g. processed data, results). When you have a pipeline with several stages, that’s how DVC knows if it’s necessary to re-run all your stages or only the N-th stage because it verified that nothing changed in the previous stages ;-).

# Create a pipeline to preprocess
dvc run -f preprocess.dvc \
-d data/simulation.tsv -d preprocess.R \
-o data/simulation_preprocessed.csv \
Rscript preprocess.R
view raw run1.sh hosted with ❤ by GitHub

Let’s commit this.

git add data/.gitignore preprocess.dvc preprocess.R
git commit -m "Adds preprocessing script/pipeline entry"
view raw commit1.sh hosted with ❤ by GitHub

After having preprocessed our original dataset, we would like to generate a graph with the MIIC package. MIIC is an information-theoretic method that learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many datasets. It generates a DAG, so please, do not get confused with DVC or the DAG for the pipeline that DVC generates.

How do I install an R package to this repository environment and not to the global environment (my whole system)? As long as you run R from within the repository (where you initialized packrat, remember?) you will be fine.

R –silent -e "install.packages(\"miic\")"
# You can run the command below to check which packages are installed
# to the repository R environment (and even see the ones that should)
# and generate a snapshot of it so others can reproduce the same env.
R –silent -e "packrat::snapshot()"
# You will see some new packages which are dependencies of miic
# Now you have to add these changes so that if someone clone
# your repository, they will have the same packages you used
# making your DS project reproducible. You don't commit yet, for it
# makes sense to commit this installation together with the R script
# file that will use this package.
git add .
view raw packrat.sh hosted with ❤ by GitHub

With MIIC installed, let’s write the script file for the second stage of our pipeline.

# Create a miic.R file with the content below.
input_file <- read.csv(file = 'data/simulation_preprocessed.csv', sep=';')
library(miic)
input_file <- as.data.frame(sapply(input_file, as.factor))
res <- miic(input_file, propagation = F)
if (!dir.exists('output_dir')) {
dir.create('output_dir')
}
write.csv2(res$all.edges.summary,
'output_dir/all.edges.summary.csv',
row.names=FALSE)

Now, we need to add miic.R to the pipeline.

# Create a pipeline to preprocess
dvc run -f miic.dvc \
-d data/simulation_preprocessed.csv -d miic.R \
-o output_dir/all.edges.summary.csv \
Rscript miic.R
view raw run2.sh hosted with ❤ by GitHub

And after that, we must commit, as suggested by DVC.

git add output_dir/.gitignore miic.dvc miic.R
git commit -m "Adds miic script/pipeline/management by packrat"
view raw commit2.sh hosted with ❤ by GitHub

As a final step, we would like to know the edges contained in the inferred graph, that is, which nodes (variables) are associated with each other among them.

# Results
# Create a final.R file with the content below.
input_file <- read.csv2(file = 'output_dir/all.edges.summary.csv')
input_file <- input_file[input_file$type == 'P', ]
write.csv2(input_file[, c('x', 'y')],
'output_dir/retained.edges.csv',
row.names=FALSE)
view raw final.R hosted with ❤ by GitHub
# Create a pipeline to preprocess
dvc run -f final.dvc \
-d output_dir/all.edges.summary.csv -d final.R \
-o output_dir/retained.edges.csv \
Rscript final.R
# And then git add/commit
git add final.dvc output_dir/.gitignore final.R
git commit -m "Adds final script/pipeline entry"
view raw run3.sh hosted with ❤ by GitHub

How does our pipeline look?

# Run the following command
dvc pipeline show –ascii final.dvc
# The output should look like this:
+————————-+
| data/simulation.tsv.dvc |
+————————-+
*
*
*
+—————-+
| preprocess.dvc |
+—————-+
*
*
*
+———-+
| miic.dvc |
+———-+
*
*
*
+———–+
| final.dvc |
+———–+
view raw ascii.sh hosted with ❤ by GitHub

You can also see the same DAG but instead of stage names, you can see the files run at each stage.

dvc pipeline show –ascii final.dvc –command
+———————-+
| Rscript preprocess.R |
+———————-+
*
*
*
+—————-+
| Rscript miic.R |
+—————-+
*
*
*
+—————–+
| Rscript final.R |
+—————–+
view raw commands.sh hosted with ❤ by GitHub

Or the output files generated at each stage by which files and based on what files.

dvc pipeline show –ascii final.dvc –outs
+———————+ +————–+
| data/simulation.tsv | | preprocess.R |
+———————+ +————–+
*** ***
** **
** **
+———————————-+ +——–+
| data/simulation_preprocessed.csv | | miic.R |
+———————————-+ *+——–+
** ***
*** **
** **
+———————————-+ +———+
| output_dir/all.edges.summary.csv | | final.R |
+———————————-+ **+———+
** **
*** ***
** **
+——————————-+
| output_dir/retained.edges.csv |
+——————————-+
view raw outs.sh hosted with ❤ by GitHub

Now let’s do the final part of our pipeline which is to count the number of edges contained in the final graph. MIIC starts with a complete graph, that is, all nodes (variables) are connected, through an edge, to all the other nodes. MIIC then looks for spurious associations and removes the edges that seem to be spurious. One of the interesting things that MIIC can do, given some assumptions, is to show you that A doesn’t cause B, even though A and B are correlated, among other things (you can read about it here).

Our goal in this tutorial is to count the number of edges that were retained at the end of the analysis. A silly task, but again, this is a basic tutorial on DVC.

Let’s write our last script for the last stage.

# Create a file named evaluate.R with the content below
input_file <- read.csv2(file = 'output_dir/retained.edges.csv')
write.table(nrow(input_file),
'output_dir/metric.txt',
row.names=FALSE,
col.names=FALSE)
view raw evaluate.R hosted with ❤ by GitHub

Let’s add it as a stage in our pipeline, setting the output file as a metrics file for DVC. By doing this, we let DVC know what information we want to compare between different experiments. After that, we commit the changes and check the metric.

# And then the pipeline entry
dvc run -f evaluate.dvc \
-d output_dir/retained.edges.csv -d evaluate.R \
-m output_dir/metric.txt \
Rscript evaluate.R
# By giving the parameter -m, instead of -o, we tell DVC
# that his file has a metric for our pipeline.
# Git
git add evaluate.dvc output_dir/.gitignore evaluate.R
git commit -m "Adds evaluate script/pipelien entry"
# Let's check our metrics (thanks to the -m parameter).
dvc metrics show -T
view raw run4.sh hosted with ❤ by GitHub
working tree:
output_dir/metric.txt: 87

This is our baseline experiment. Cool. Let’s make it clear to DVC.

git tag -a "baseline-experiment" -m "baseline"
view raw tag.sh hosted with ❤ by GitHub

Let’s start a new experiment

Let’s change our preprocessing. Before binarizing, let’s replace the missing values (NA) with the mean of the variable and then we binarize again.

# Your new preprocess.R should look like this
input_file <- read.csv(file = 'data/simulation.tsv',
sep='\t',
stringsAsFactors=FALSE)
input_file <- input_file[, 1:50]
input_file <- lapply(input_file,
function(x) replace(x,
is.na(x),
mean(x,
na.rm = TRUE)))
input_file <- as.data.frame(input_file)
input_file[, unlist(lapply(input_file, is.numeric))] <-
apply(input_file[, unlist(lapply(input_file, is.numeric))],
2,
function(x) ifelse(x < 0, -1, x))
input_file[, unlist(lapply(input_file, is.numeric))] <-
apply(input_file[, unlist(lapply(input_file, is.numeric))],
2,
function(x) ifelse(x > 0, 1, x))
write.csv2(input_file,
'data/simulation_preprocessed.csv',
row.names=FALSE)
view raw preprocess2.R hosted with ❤ by GitHub
# Then let's reproduce our full pipeline
dvc repro evaluate.dvc
# Let's compare our metrics.
dvc metrics show -T
# Output
working tree:
output_dir/metric.txt: 95
baseline-experiment:
output_dir/metric.txt: 87
view raw repro.sh hosted with ❤ by GitHub

One thing you can do so that you don’t have to type the name of the last stage of your pipeline after dvc repro is to name the stage Dvcfile (dvc run -f Dvcfile and so on). This is the file that dvc repro will look for when it’s run without a stage name.

After everything is done, as we would do with git, we do it with DVC. A push.

dvc push
# And let's commit this new analyses.
git add .
git commit -m "Changes preprocessing for NA imputation by mean"
# Add a new tag to help DVC know this is a finished analyses
git tag -a "NA-imputation-by-mean" -m "NAimputation"
view raw new_tag.sh hosted with ❤ by GitHub

Traveling in time with data objects

If you run the command cat output_dir/retained.edges.csv you will see what you expect, that is, the generated file from the new (last) experiment. All the data files in the local repository were changed according to this. What if you wanted to see how this file looked like in the previous experiment? You can easily see the source code of preprocess.R. This is straightforward if you already know git. You simply do:

git checkout HEAD~1
cat preprocess.R
view raw git_checkout.sh hosted with ❤ by GitHub

But if you do a cat output_dir/retained.edges.csv, you still see the file generated with the latest version of your pipeline. How to see the file as it looked like when you did the commit in which you’re right now?

dvc checkout output_dir/retained.edges.csv
cat output_dir/retained.edges.csv
view raw dvc_checkout.sh hosted with ❤ by GitHub

It makes it easy to remember the dvc checkout command because it’s the same command from git (git checkout).

Presenting your results

Now that you’re done with your analyses, it’s time to present it. You can do it through a web dashboard (Shiny?), through a slide presentation (Xaringan?), or a report in PDF or HTML (RMarkdown?). The nice thing about R is that you can do all these things without leaving R. Shiny, Xaringan, and so on are all R packages!

In this tutorial, I decided to present my analyses through RMarkdown in HTML. I usually start writing the report during the analyses, sometimes even before I start writing code. I may be writing about the dataset itself, the data, its context, among other things. However, in this tutorial for the sake of simplicity, I left to start writing the report when everything else was done.

You can download the template here and play with it. RStudio can knit it for you and you should have an HTML report looking like the image below.

Reproducibility

What about reproducing these results somewhere else? It can be either at another machine of mine, at a machine of a colleague from work, at a server, or anywhere. How would we do it?

# The first thing to do is to bring a copy of the source code to
# your local machine. In git terminology, this is called cloning.
git clone https://github.com/mribeirodantas/r_dvc_git_packrat.git
cd r_dvc_git_packrat
# You can see within this folder that it only contains what git
# tracks, that is, NOT your data. Metadata about your data, but
# not data.
# Next step is to open R. This will trigger packrat to auto install.
R
# Next, you need to do a restore, that is, restore the R packages
# that were in the environment but are not in yours now.
packrat::restore()
# Now, the same way you cloned from git, you need to clone from a
# DVC remote. In my computer, since I have the local remote, I only
# have to do dvc pull. In your case, you need to add the really
# remote remote 🙂
dvc remote add newremote https://www.dropbox.com/sh/kjviy4r3la22fj6/AAB43oyVnbx-sNkERJshMxdha
# And then you pull with the new remote
dvc pull -r newremote
# I hosted my remote in Dropbox but you can do it in Amazon S3,
# among other solutions. If someone hosted it somewhere, you
# can also download it and turn it into your local remote.
# First, download it.
wget -c https://www.dropbox.com/sh/kjviy4r3la22fj6/AAB43oyVnbx-sNkERJshMxdha\?dl\=1 –output-document=dvc-storage.zip
# Next, unzip and remove the .zip
unzip dvc-storage.zip -d dvc-storage
rm dvc-storage.zip
# Place this folder (your dvc remote) somewhere else, like in your
# home directory
mv dvc-storage $HOME/
# Set it to be your remote and remove the default one I set
dvc remote remove myremote
dvc remote add mynewremote $HOME/dvc-storage
dvc pull -r mynewremote
# If you do not want to set manually the remote, set this as your
# default
dvc remote default mynewremote
dvc pull
view raw reproduce.sh hosted with ❤ by GitHub

Workflow overview

  • You track your source code with Git
  • You track your data files with DVC
  • You track your pipeline with DVC
  • You track your R dependencies with packrat
  • You isolate your work environment with packrat
  • You present your analysis results with Xaringan, RMarkdown, Shiny, or whatever you want 🙂

Everything we did here is hosted in a GitHub repository. I would like to thank all my friends who read this text when it was still a draft, including also Ivan Shcheklein, Elle O’Brien, and Jorge Orpinel, all from the DVC team.