Manage your Data Science Project in R

Reading Time: 9 minutes

A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.

Source: Here.

The pain of managing a Data Science project

Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.


Best links of the week #49

Best links of the week #47

Reading Time: 3 minutes

Best links of the week from 25th November to 1st December

This image has an empty alt attribute; its file name is meme-1.jpg


  1. Researchers Have Successfully Tricked A.I. Into Seeing The Wrong Things at PopSci.
  2. Fooling the machine at PopSci.
  3. Why isn’t confounding a statistical concept? at Judea Pearl’s discussion with readers.
  4. The impossibility of asymmetric causation at Judea Pearl’s discussion with readers.
  5. d-SEPARATION WITHOUT TEARS at Judea Pearl’s discussion with readers. There is an interactive adaptation from this at dagitty’s website here.
  6. An Illustration of Pearl’s Simpson Machine at dagitty.
  7. Do you think you know DAG terminology? This game can help you try your skills. There is also another game here for testing your knowledge on covariate roles and another one about Table 2 Fallacy. All this at dagitty.
  8.  On causality and decision trees at Judea Pearl’s discussion with readers.
  9. On causality and decision trees (cont.) at Judea Pearl’s discussion with readers.
  10. Back-door criterion and epidemiology at Judea Pearl’s discussion with readers.
  11. Indirect Effects at Judea Pearl’s discussion with readers.
  12. The meaning of counterfactuals at Judea Pearl’s discussion with readers.
  13. Has causality been defined? at Judea Pearl’s discussion with readers.
  14. The tidyverse for Machine Learning presentation by Bruna Wundervald at satRday São Paulo.
  15. Centrality measures as a proxy for causal influence? at Fabian Dablander‘s website.
  16. Garoto de 12 anos já trabalha como cientista de dados at Olhar Digital.
  17. CGU lança novo Painel Correição em Dados at CGU.


  1. Causality in Machine Learning 101 for Dummies like Me by Sangeet Moy Das at Towards Data Science.
  2. An introduction to Causal inference at Fabian Dablander‘s Blog.
  3. Spurious correlations and random walks at Fabian Dablander‘s Blog.
  4. Curve fitting and the Gaussian distribution at Fabian Dablander‘s Blog.
  5. In Review: Ten Great Ideas About Chance at Fabian Dablander‘s Blog.
  6. Using causal graphs to understand missingness and how to deal with it at Cookie Scientist.


  1. A network of science: 150 years of Nature papers at nature video‘s YouTube channel.
  2. ViennaR Meetup March 2019 | Hadley Wickham Tidy Data at Quantargo‘s YouTube channel.
  3. Causal Graphs by Julian Schüssler at MZES Methods Bites‘s YouTube channel.

Positions available

  1. Lecturer/Senior Lecturer/Reader in Media & Data Science at the University of Glasgow.
  2. Ph.D. fellowship in Machine Learning for Robot Manipulation at Bosch.
  3. Fully Funded Ph.D. position in AI and Machine Learning for mental well being at Örebro University.
  4. Research Assistant in Computer Vision and Deep Learning at Edge Hill University.
  5. Tenure Track ML Teaching Professor Position at UCSD.
  6. Post-doctoral fellowship (Genomics) at Instituto Tecnológico Vale.
  7. Data Science Vice President at Big Cloud.
  8. Director of Data Science at Ideal Team Consulting.
  9. Gerente de Governança e Arquitetura de Dados at Wiz.
  10. Senior Business Intelligence Analyst at SumUp.
  11. Data Architect – Restaurant Product at iFood.
  12. Lead Data Engineer at QuintoAndar.
  13. Software Engineer at Google.
  14. Senior SQL Server/ETL Developer at Cognizant.
  15. Data Architect D2- Lunch DFN at iFood.
    The next opportunities (30+) are reserved for readers registered in the newsletter. By having registered, you will receive updates on the posts in the blog!

Best links of the week #16

Reading Time: 2 minutes

Best links of the week from 22nd April to 28th April

You can check this comic here


  1. Do more with R: drag-and-drop ggplot at InfoWorld.
  2. Apart from esquisse, the package mentioned in the link above, there is another one that allows you to drag-and-drop and plot your data: ggplotAssist.
  3. DreamRs is a French R consulting firm. In their website, they have made publicly available some shiny apps on real data, such as RATP traffic and a GitHub dashboard.
  4. VCs just invested $8 million into this startup that gave away its software for free because they noticed how much people loved it!
  5. Cheat Sheets for several softwares and concepts related to Data Science at Asif Bhat GitHub.
  6. Data Science must read articles, tutorials and useful links at Asif Bhat GitHub.
  7. Math required for Data Science at Asif Bhat GitHub.
  8. Quick overview of Statistics for Biologists (it’s useful for pretty much everybody, you don’t say no to an offer of knowledge :-).
  9. How can I show the intermediate steps of a long routine in R? at StackOverflow.
  10. ‘Friendly’ reviewers rate grant applications more highly at Nature.
  11. Calm down, everyone. Keeping dead pig cells alive is not ‘brain resuscitation’ at Los Angeles Times.
  12. Uber is sharing publicly some data!
  13. Need help on choosing the right visualization method? From data-to-viz can help you!
  14. IBM releases Diversity in Faces, a dataset with over 1 million annotated images to help fight bias at Turing Tribe.
  15. Até 2030, AI contribuirá em mais de US$ 15,7 trilhões para economia global at Computer World.
  16. A extraordinária cientista que estudou o cérebro de Einstein e revolucionou a neurociência moderna at Época Negócios.
  17. TerraBrasilis, a open access public geographical data for environmental monitoring.

Best links of the week #15

Reading Time: 2 minutes

Best links of the week from 15th April to 21st April


  1. When it comes to clustering, depending on the algorithm used, one may have a hard time determining the appropriate k (number of clusters). Some algorithms do not require it, but for the ones that do, such as k-means, you should have a look at the elbow method to evaluate the appropriate k or at the silhouette of objects regarding the clusters.
  2. Dunder Data is a professional training company dedicated to teaching data science and machine learning. There is paid and free online material.
  3. Software Carpentry, teaching basic lab skills for research computing.
  4. ROpenSci, transforming science through open data and software.
  5. mlmaisleve, conceitos rápidos e leves sobre Machine Learning ?.
  6. kite, Code Faster in Python with Line-of-Code Completions.

Best links of the week #14

Best links of the week #13

Best links of the week #8

Best links of the week #5

Reading Time: < 1 minute

Best links of the week from 4th February to 10th February.


  1. Como controlar o braço de outra pessoa com o poder da sua mente? at UOL.
  2. vidente is an R package I am currently writing to parse and analyze data from the Surveillance, Epidemiology and End Results (SEER) Program, which covers over 1/3 of the US population on cancer incidence and survival.
  3. Ciência de Dados com R is a book on Data Science using R at Instituto Brasileiro de Pesquisa e Análise de Dados.
  4. Data Science & Machine Learning Course at Ivanovitch Silva’s GitHub repository.
  5. A receita dos candidatos a deputado federal em 2018 at Nexo Jornal.
  6. AI 100: The Artificial Intelligence Startups Redefining Industries at CB Insights.
  7. The open-source and crowd sourced conference website.
  8. Ranking of IT conferences.

The unintended trap in bracket subsetting in R

Reading Time: 3 minutes
The silent [and maybe mortal?] trap in bracket subsetting.

Dear reader,

It should be clear to you that, as several other programming languages, R provides different ways to tackle the same problem. One common problem in data analysis is to subset your data frame and, as Google can show you, there are several blog posts and articles trying to teach you different ways to subset your data frame in R. Let’s do a quick review here:

Before starting to subset a data frame, we must first create one. I will create a data frame of patients named var_example with two columns, one for vital status (is_alive) and one for birth year (birthyear). Birth year values are 4-digit numbers representing the year of birth. The is_alive column can have one of three values:

  • TRUE: The person is alive;
  • FALSE: The person is dead;
  • NA: We do not know if this person is either alive or dead.
> var_example <- cbind(, TRUE, FALSE),
                                          prob = c(0.1, 0.5, 0.4))),
> colnames(var_example) <- c("is_alive", "birthyear")