Best links of the week #63

Mobility and COVID-19 cases. Did Brazil stop?

Reading Time: 9 minutes
Illustration du nouveau coronavirus, Covid-19 – Mars 2020 / © UPI/MaxPPP

You have probably heard that Google has released a set of mobility reports recently. The site hosting these reports, the so-called COVID-19 Community Mobility Reports, begins with the following sentence: “See how your community is moving differently due to COVID19”.

What is it about?

Google offers a Location History feature in its services/systems that monitors the location, and consequently the displacement, of users. This data can be accessed and disabled at any time by users. According to Google, this feature needs to be activated voluntarily, as it is disabled by default. Based on this information, they observed how and where these individuals used to go in a period prior to the COVID-19 outbreak and how and where they are moving now, during the outbreak. There is a clear bias here. People who do not have a cell phone or tablet, or who have not activated this feature, are out of their sampling and this can impact the conclusions of the report. Still, it’s worth a look.


Manage your Data Science Project in R

Reading Time: 9 minutes

A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.

Source: Here.

The pain of managing a Data Science project

Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.


Spurious Independence: is it real?

Reading Time: 14 minutes

First things first: Spurious Dependence

Depending on your background, you have already heard of spurious dependence in a way or another. It goes by the names of spurious association, spurious dependence, the famous quote “correlation does not imply causation” and also other versions based on the same idea that you can not say that X necessarily causes Y (or vice versa) solely because X and Y are associated, that is, because they tend to occur together. Even if one of the events always happens before the other, let’s say X preceding Y, still, you can not say that X causes Y. There is a statistical test very famous in economics known as Granger causality.

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969.[1] Ordinarily, regressions reflect “mere” correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of “true causality” is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only “predictive causality”.

Granger Causality at Wikipedia.

The post hoc ergo propter hoc fallacy is also known as “after this, therefore because of this”. It’s pretty clear today that Granger causality is not an adequate tool to infer causal relationships and this is one of the reasons that when X and Y are tested by the granger causality test, and an association is found, it’s said that X Granger-causes Y instead of saying that X causes Y. Maybe it’s not clear to you why the association between two variables and the notion that one always precedes the other is not enough to say that one is causing the other. One explanation for a hypothetical situation, for example, would be a third lurking variable C, also known as a confounder, that causes both events, a phenomenon known as confounding. By ignoring the existence of C (which in some contexts happens by design and is a strong assumption called unconfoundedness), you fail to realize that the events X and Y are actually independent when taking into consideration this third variable C, the confounder. Since you ignored it, they seem dependent, associated. A very famous and straight forward example is the positive correlation between (a) ice cream sales and death by drowning or (b) ice cream sales and homicide rate.


How can I evaluate my model? Part I.

Reading Time: 8 minutes
Source of image: here.

One way to evaluate your model is in terms of error types. Let’s consider a scenario where you live in a city where it rains every once in a while. If you guessed that it would rain this morning, but it did not, your guess was a false positive, sometimes abbreviated as FP. If you said it would not rain, but it did, then you had a false negative (FN). Raining when you do not have an umbrella may be annoying, but life is not always that bad. You could have predicted that it would rain and it did (true positive, TP) or predicted that it would not rain and it did not (true negative, TN). In this example, it’s easy to see that in some contexts one error may be worse than the other and this will vary according to the problem. Bringing an umbrella with you in a day with no rain is not as bad as not bringing an umbrella on a rainy day, right?


Best links of the week #15

Reading Time: 2 minutes

Best links of the week from 15th April to 21st April


  1. When it comes to clustering, depending on the algorithm used, one may have a hard time determining the appropriate k (number of clusters). Some algorithms do not require it, but for the ones that do, such as k-means, you should have a look at the elbow method to evaluate the appropriate k or at the silhouette of objects regarding the clusters.
  2. Dunder Data is a professional training company dedicated to teaching data science and machine learning. There is paid and free online material.
  3. Software Carpentry, teaching basic lab skills for research computing.
  4. ROpenSci, transforming science through open data and software.
  5. mlmaisleve, conceitos rápidos e leves sobre Machine Learning ?.
  6. kite, Code Faster in Python with Line-of-Code Completions.

The unintended trap in bracket subsetting in R

Reading Time: 3 minutes
The silent [and maybe mortal?] trap in bracket subsetting.

Dear reader,

It should be clear to you that, as several other programming languages, R provides different ways to tackle the same problem. One common problem in data analysis is to subset your data frame and, as Google can show you, there are several blog posts and articles trying to teach you different ways to subset your data frame in R. Let’s do a quick review here:

Before starting to subset a data frame, we must first create one. I will create a data frame of patients named var_example with two columns, one for vital status (is_alive) and one for birth year (birthyear). Birth year values are 4-digit numbers representing the year of birth. The is_alive column can have one of three values:

  • TRUE: The person is alive;
  • FALSE: The person is dead;
  • NA: We do not know if this person is either alive or dead.
> var_example <- cbind(, TRUE, FALSE),
                                          prob = c(0.1, 0.5, 0.4))),
> colnames(var_example) <- c("is_alive", "birthyear")