Manage your Data Science Project in R
A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.
The pain of managing a Data Science project
Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.
One of the things that I had been looking for is some approach that would allow me to track/know the “version” of the dataset I used for a specific analysis when I wrote and run R script A or a set of R scripts. Have I run some-script.R in the raw data or the preprocessed data? Have I run the some-other-script.R in the dataset preprocessed this way or preprocessed that other way? What version of my dataset was used in the analysis done by the code committed put-a-git-commit-hash here? In a few words: How to track code and data at the same time? (And of course, reports in RMarkdown being included is a plus).
This tutorial will guide you through the management/execution of a basic example of a Data Science project written in R. It’s not very close to what I do in my daily life and I assume it’s not what you do either, but it’s basic, simple and still useful for understanding how to use these tools together, mostly DVC.
One second to talk about Git, Packrat and RMarkdown
I don’t want to get into much detail about git and packrat here (or how to install Git and Packrat), this is not the focus of the post. In few words, git is a version control software which means that it helps you to track the evolution of your source code. Whenever you do some reasonable change to your code you take a snapshot of it (save point, whatever) through what is called a commit. This way, you can “time travel” to your project. You can find out when a bug was accidentally introduced in the code or work at the same time with several friends in the same code. Git will manage this for you. DVC comes to assist Git by adding this management/tracking support to data files and to manage pipelines. There will be some Git commands here and there, and for the sake of not making this tutorial longer than it already is, I will not comment them. You can get help here.
Packrat is a package dependency management for R which means that you can, for example, have a Data Science project using an R package version 1.X and using the same package in a different version 1.Z in another DS project. Besides, you can take snapshots with packrat too, making it easy to share your whole environment with colleagues or having it in another machine. If you’re from a Python background, this is like pip and virtualenv. Isolation and package dependency management.
As its official website says, “Your data tells a story. Tell it with R Markdown. Turn your analyses into high quality documents, reports, presentations and dashboards”. Think of it as Markdown + R technologies, turning it into markup language on steroids.
When it comes to data files, the idea behind DVC is to create small text files (metadata) about the real data files and give it to git to take care Git is not made for big blobs or data files and by versioning your GB (TB?) datasets, you can make your repository slow to work with (and I won’t even mention cloning..). DVC will take care of the real files, while git will take care of metadata generated by DVC. DVC operates like Git and uses basically the same syntax: you add (dvc add) and push (dvc push) files to a remote location (remote storage, in dvc nomenclature). If you’re new to version control, this can take a little time to get used to. If you added a dataset to DVC, but you forgot to push and by accident you removed your dataset or lost the git repository, the remote storage (where DVC saves your real files) won’t have the dataset, just like the git remote repository wouldn’t have your source code, because you did not git push. Everything is fine if you pushed, of course. However, the same way that you have a “copy” of your source code in the remote repository, but also in your git repository, the real data files are both in the remote storage and in your git repository (in DVC cache hidden folder). Some common tasks in DVC can be automated to be integrated to your git commands. If you like this idea, dvc install is what you’re looking for.
If you have pip installed, you can easily install dvc with pip install dvc. If you don’t, don’t worry. There are other ways to install it, such as snap (GNU/Linux) or brew (OSX).
Let’s start our experiment.
The first thing we should do is to (1) create our Git repository, (2) start packrat and (3) DVC.
In this case, I’m creating a local remote which actually is a folder synced with Dropbox. Your real data files will be there.
Now it’s time to show DVC where our dataset is. DVC will generate metadata for git to track and DVC itself will start tracking the real data files. You can download the dataset (simulation.tsv) file clicking here.
At this point, your git repository is up to date. If you don’t have a remote repository, everything is all set with Git. If you are also hosting your code in GitHub, Bitbucket, GitLab (or any other git repository hosting solution), it’s time to push. In DVC it is no different, if you go to your remote storage (which in this case is local), there is nothing there. Not even the folder has been created. Well, if you have a git remote, your repo won’t be there either, right? We haven’t pushed yet. Let’s push with the dvc push command.
If you check now, not only there is a folder there but also the dataset. Not metadata, the real big file. Not in a git remote, but in a DVC remote storage. However, it won’t look like you expect. It’s going to look like a mysteriously named folder, with a mysteriously named file inside – it’s the md5 hash for the dataset, but believe me, the content is there. To check, you can use the command ls -h (in bash) to check the size of the remote storage and compare to what you expect.
Let’s now start writing our data science code/analysis. Let’s say we want to set all negative values to -1 and all positive values to +1, that is, binarize our variables. This would be our preprocessing step.
After having written the preprocessing script, we must set it as stage in our pipeline. You do this with the dvc run command. It’s here that you tell DVC what this stage needs (e.g. input files) and what it will output (e.g. processed data, results). When you have a pipeline with several stages, that’s how DVC knows if it’s necessary to re-run all your stages or only the N-th stage because it verified that nothing changed in the previous stages ;-).
Let’s commit this.
After having preprocessed our original dataset, we would like to generate a graph with the MIIC package. MIIC is an information-theoretic method that learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many datasets. It can generate a DAG, for example, so please, do not get confused with DVC or the DAG it generates.
How do I install an R package to this repository environment and not to the global environment (my whole system)? As long as you run R from within the repository (where you initialized packrat, remember? you will be fine.
With MIIC installed, let’s write the script file for the second stage of our pipeline.
Now, we need to add miic.R to the pipeline.
And after that, we must commit, as suggested by DVC.
As a final step, we would like to know the edges contained in the graph inferred, that is, which nodes (variables) are associated among them.
How does our pipeline look?
You can also see the same DAG but instead of stage names you can see the files run at each stage.
Or the output files generated at each stage by which files and based on what files.
Now let’s do the final part of our pipeline which is to count the number of edges contained in the final graph. MIIC starts with a complete graph, that is, all nodes (variables) are connected, through an edge, to all the other nodes. MIIC then looks for spurious associations and removes the edges that seem to be spurious. One of the interesting things that MIIC can do, given some assumptions, is to show you that A doesn’t cause B, even though A and B are correlated, among other things (you can read about it here).
Our goal in this tutorial is to count the number of edges that were retained at the end of the analysis. A silly task, but again, this is a basic tutorial on DVC.
Let’s write our last script for the last stage.
Let’s add it as a stage in our pipeline, setting the output file as a metrics file for DVC. By doing this, we let DVC know what information we want to compare between different experiments. After that, we commit the changes and check the metric.
This is our baseline experiment. Cool. Let’s make it clear to DVC.
Let’s start a new experiment
Let’s change our preprocessing. Before binarizing, let’s replace the missing values (NA) with the mean of the variable and then we binarize again.
One thing you can do so that you don’t have to type the name of the last stage of your pipeline after dvc repro is to name the stage Dvcfile (dvc run -f Dvcfile and so on). This is the file that dvc repro will look for when it’s run without a stage name.
After everything is done, as we would do with git, we do with DVC. A push.
Traveling in time with data objects
If you run the command cat output_dir/retained.edges.csv you will see what you expect, that is, the generated file from the new (last) experiment. All the data files in the local repository were changed according to this. What if you wanted to see how this file looked like in the previous experiment? You can easily see the source code of preprocess.R. This is straightforward if you already know git. You simply do:
But if you do a cat output_dir/retained.edges.csv, you still see the file generated with the latest version of your pipeline. How to see the file as it looked like when you did the commit in which you’re right now?
It makes it easy to remember the dvc checkout command because it’s the same command from git (git checkout).
Presenting your results
Now that you’re done with your analyses, it’s time to present it. You can do it through a web dashboard (Shiny?), through a slide presentation (Xaringan?) or a report in PDF or HTML (RMarkdown?). The nice thing about R is that you can do all these things without leaving R. Shiny, Xaringan and so on are all R packages!
In this tutorial, I decided to present my analyses through RMarkdown in HTML. I usually start writing the report during the analyses, sometimes even before I start writing code. I may be writing about the dataset itself, the data, the context and so on. However, in this tutorial for the sake of simplicity, I left to start writing the report when everything else was done.
You can download the template here and play with it. RStudio can knit it to you and you should have an HTML report looking like the image below.
What about reproducing these results somewhere else? It can be either at another machine of mine, at a machine of a colleague from work, at a server, anywhere. How would we do it?
- You track your source code with Git
- You track your data files with DVC
- You track your pipeline with DVC
- You track your R dependencies with packrat
- You isolate your work environment with packrat
- You present your analyses results with Xaringan, RMarkdown, Shiny or whatever you want 🙂
Everything we did here is hosted in a GitHub repository. I would like to thank all my friends who read this text when it was still a draft, including also Ivan Shcheklein, Elle O’Brien, and Jorge Orpinel, all from the DVC team.