A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.
The pain of managing a Data Science project
Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.
Depending on your background, you have already heard of spurious dependence in a way or another. It goes by the names of spurious association, spurious dependence, the famous quote “correlation does not imply causation” and also other versions based on the same idea that you can not say that X necessarily causes Y (or vice versa) solely because X and Y are associated, that is, because they tend to occur together. Even if one of the events always happens before the other, let’s say X preceding Y, still, you can not say that X causes Y. There is a statistical test very famous in economics known as Granger causality.
The post hoc ergo propter hoc fallacy is also known as “after this, therefore because of this”. It’s pretty clear today that Granger causality is not an adequate tool to infer causal relationships and this is one of the reasons that when X and Y are tested by the granger causality test, and an association is found, it’s said that XGranger-causesY instead of saying that X causes Y. Maybe it’s not clear to you why the association between two variables and the notion that one always precedes the other is not enough to say that one is causing the other. One explanation for a hypothetical situation, for example, would be a third lurking variableC, also known as a confounder, that causes both events, a phenomenon known as confounding. By ignoring the existence of C (which in some contexts happens by design and is a strong assumption called unconfoundedness), you fail to realize that the events X and Y are actually independent when taking into consideration this third variable C, the confounder. Since you ignored it, they seem dependent, associated. A very famous and straight forward example is the positive correlation between (a) ice cream sales and death by drowning or (b) ice cream sales and homicide rate.
When I first heard of scientific retreats at Institut Curie (Twitter here), I was surprised. But then I kept thinking of how would a scientific retreat work out. I mean, it would inevitably fall into a retreat or a scientific event. It would either be (1) a very pleasant experience to relax and get to know people, something like vacations from work with work peers (which could turn into us talking about work and then no vacations from work) or then (2) a scientific event just like any other. The two things at the same time? Quite unlikely, I thought.
One way to evaluate your model is in terms of error types. Let’s consider a scenario where you live in a city where it rains every once in a while. If you guessed that it would rain this morning, but it did not, your guess was a false positive, sometimes abbreviated as FP. If you said it would not rain, but it did, then you had a false negative (FN). Raining when you do not have an umbrella may be annoying, but life is not always that bad. You could have predicted that it would rain and it did (true positive, TP) or predicted that it would not rain and it did not (true negative, TN). In this example, it’s easy to see that in some contexts one error may be worse than the other and this will vary according to the problem. Bringing an umbrella with you in a day with no rain is not as bad as not bringing an umbrella on a rainy day, right?
The year was going just fine in my academic and professional life. I had obtained some nice results for my masters and managed to get very interesting advancements at work. In parallel, at the beginning of the year, I had applied to a PhD position, though I had little hope to be selected. Months passed due to the lengthy process, and I kept following what I had planned for the year. For a week in May, the institute funded all applicants that reached the last stage of the selection process to come to Paris for several interviews, among other activities. By this point, I was already happy, regardless of the result. The experience allowed me to open my eyes to several subjects that I today deem very important and it also gave me an opportunity to meet some amazing people. I’m sorry if I’m missing some country, my memory is not the best, but if I recall correctly there were people from the United States, Chile, Uruguay, Portugal, Spain, Italy, Greece, Poland, Hungary, Estonia, India, Pakistan, Iran, Saudi Arabia, Thailand, China and Taiwan.