Tag: causality

Causality, Data Science, R, tools, Uncategorized

Continuous Machine Learning – Part II

Reading time: 3 minutes

This is a 3-part series about Continuous Machine Learning. You can check Part I here and Part III here. This post is a continuation of the previous one, in which we initiated our experience on automating Data Science in GitHub with CML. We will basically make use of Docker to improve the computation time in our GitHub Actions checks.

You can think of a Docker image as taking a snapshot of the software environment of a project, and then being able to setup that snapshot on any other computer. When GitHub Actions is called, it loads your Docker image in their infrastructure and then runs your code. That’s why it’s quicker, because when you use a Docker container with your dependencies already installed, you don’t have to spend time setting them up all over again on your GitHub Actions runner every time it is triggered, which is the way we did in the first part of this series.

Creating a Docker image

Image from “Build a Docker Image just like how you would configure a VM”.
Causality, Data Science, R, tools, Uncategorized

Continuous Machine Learning – Part I

Reading time: 9 minutes
Image by Taras Tymoshchuck from here.

This is a 3 part series about Continuous Machine Learning. You can check Part II here and Part III here.

What is it?

Continuous Machine Learning (CML) follows the same concept of Continuous Integration and Continuous Delivery (CI/CD), famous concepts in Software Engineering / DevOps, but applied to Machine Learning and Data Science projects.

What is this post about?

I will cover a set of tools that can make your life as a Data Scientist much more interesting. We will use MIIC, a network inference algorithm, to infer the network of a famous dataset (alarm from bnlearn). We will then use (1) git to track our code, (2) DVC to track our dataset, outputs and pipeline, (3) we will use GitHub as a git remote and (4) Google Drive as a DVC remote. I’ve written a tutorial on managing Data Science projects with DVC, so if you’re interested on it open a tab here to check it later.

Causality, Data Science, PhD, R

Spurious Independence: is it real?

Reading time: 14 minutes

First things first: Spurious Dependence

Depending on your background, you have already heard of spurious dependence in a way or another. It goes by the names of spurious association, spurious dependence, the famous quote “correlation does not imply causation” and also other versions based on the same idea that you can not say that X necessarily causes Y (or vice versa) solely because X and Y are associated, that is, because they tend to occur together. Even if one of the events always happens before the other, let’s say X preceding Y, still, you can not say that X causes Y. There is a statistical test very famous in economics known as Granger causality.

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969.[1] Ordinarily, regressions reflect “mere” correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of “true causality” is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only “predictive causality”.

Granger Causality at Wikipedia.

The post hoc ergo propter hoc fallacy is also known as “after this, therefore because of this”. It’s pretty clear today that Granger causality is not an adequate tool to infer causal relationships and this is one of the reasons that when X and Y are tested by the granger causality test, and an association is found, it’s said that X Granger-causes Y instead of saying that X causes Y. Maybe it’s not clear to you why the association between two variables and the notion that one always precedes the other is not enough to say that one is causing the other. One explanation for a hypothetical situation, for example, would be a third lurking variable C, also known as a confounder, that causes both events, a phenomenon known as confounding. By ignoring the existence of C (which in some contexts happens by design and is a strong assumption called unconfoundedness), you fail to realize that the events X and Y are actually independent when taking into consideration this third variable C, the confounder. Since you ignored it, they seem dependent, associated. A very famous and straight forward example is the positive correlation between (a) ice cream sales and death by drowning or (b) ice cream sales and homicide rate.


Best links of the week #47

Reading time: 3 minutes

Best links of the week from 25th November to 1st December

This image has an empty alt attribute; its file name is meme-1.jpg


  1. Researchers Have Successfully Tricked A.I. Into Seeing The Wrong Things at PopSci.
  2. Fooling the machine at PopSci.
  3. Why isn’t confounding a statistical concept? at Judea Pearl’s discussion with readers.
  4. The impossibility of asymmetric causation at Judea Pearl’s discussion with readers.
  5. d-SEPARATION WITHOUT TEARS at Judea Pearl’s discussion with readers. There is an interactive adaptation from this at dagitty’s website here.
  6. An Illustration of Pearl’s Simpson Machine at dagitty.
  7. Do you think you know DAG terminology? This game can help you try your skills. There is also another game here for testing your knowledge on covariate roles and another one about Table 2 Fallacy. All this at dagitty.
  8.  On causality and decision trees at Judea Pearl’s discussion with readers.
  9. On causality and decision trees (cont.) at Judea Pearl’s discussion with readers.
  10. Back-door criterion and epidemiology at Judea Pearl’s discussion with readers.
  11. Indirect Effects at Judea Pearl’s discussion with readers.
  12. The meaning of counterfactuals at Judea Pearl’s discussion with readers.
  13. Has causality been defined? at Judea Pearl’s discussion with readers.
  14. The tidyverse for Machine Learning presentation by Bruna Wundervald at satRday São Paulo.
  15. Centrality measures as a proxy for causal influence? at Fabian Dablander‘s website.
  16. Garoto de 12 anos já trabalha como cientista de dados at Olhar Digital.
  17. CGU lança novo Painel Correição em Dados at CGU.


  1. Causality in Machine Learning 101 for Dummies like Me by Sangeet Moy Das at Towards Data Science.
  2. An introduction to Causal inference at Fabian Dablander‘s Blog.
  3. Spurious correlations and random walks at Fabian Dablander‘s Blog.
  4. Curve fitting and the Gaussian distribution at Fabian Dablander‘s Blog.
  5. In Review: Ten Great Ideas About Chance at Fabian Dablander‘s Blog.
  6. Using causal graphs to understand missingness and how to deal with it at Cookie Scientist.


  1. A network of science: 150 years of Nature papers at nature video‘s YouTube channel.
  2. ViennaR Meetup March 2019 | Hadley Wickham Tidy Data at Quantargo‘s YouTube channel.
  3. Causal Graphs by Julian Schüssler at MZES Methods Bites‘s YouTube channel.

Positions available

  1. Lecturer/Senior Lecturer/Reader in Media & Data Science at the University of Glasgow.
  2. Ph.D. fellowship in Machine Learning for Robot Manipulation at Bosch.
  3. Fully Funded Ph.D. position in AI and Machine Learning for mental well being at Örebro University.
  4. Research Assistant in Computer Vision and Deep Learning at Edge Hill University.
  5. Tenure Track ML Teaching Professor Position at UCSD.
  6. Post-doctoral fellowship (Genomics) at Instituto Tecnológico Vale.
  7. Data Science Vice President at Big Cloud.
  8. Director of Data Science at Ideal Team Consulting.
  9. Gerente de Governança e Arquitetura de Dados at Wiz.
  10. Senior Business Intelligence Analyst at SumUp.
  11. Data Architect – Restaurant Product at iFood.
  12. Lead Data Engineer at QuintoAndar.
  13. Software Engineer at Google.
  14. Senior SQL Server/ETL Developer at Cognizant.
  15. Data Architect D2- Lunch DFN at iFood.
    The next opportunities (30+) are reserved for readers registered in the newsletter. By having registered, you will receive updates on the posts in the blog!

Best links of the week #31

Reading time: 2 minutes

Best links of the week from 5th August to 11th August

Source: here.


  1. randomizr: R Package for randomized experiments.
  2. Bayes’rule: Guide (course with several different levels).
  3. Extracting Brazilian schools census data with R at Fernando Barbalho’s gists.
  4. Download all data from DATASUS (several Brazilian health-related datasets) with R at Fernando Barbalho’s gists.

Best links of the week #29

Reading time: 3 minutes

Best links of the week from 22nd July to 28th July


  1. Listen to people all over the world pronouncing the name of countries and capitals.
  2. Write a letter to the future!
  3. A Personal Journey into Bayesian Networks by Judea Pearl.
  4. An innovative way to publish at Nature.
  5. Here’s What Fruits And Vegetables Looked Like Before We Domesticated Them at Science Alert.
  6. Regression Sensitivity Analysis: the Robustness Value and the partial R², a shiny app by Carlos Cinelli.
  7. Do you need to normalize your input data for Random Forests and Neural Networks? (More on Random Forests here) at Data Science (Stack Exchange).
  8. Cumulative Variable Importance for Random Forest (RF) Models at Rich Pauloo’s Gists.
  9. Contributing to the R ecosystem by Colin Fay at SpeakerDeck.
  10. Entrevista: Por que homeopatia é placebo – e não deve ser paga pelo SUS at Super Interessante.

Best links of the week #26

Reading time: 2 minutes

Best links of the week from 1st July to 7th July

Source: here.


  1. Open Source Guides
  2. Accuracy paradox at Wikipedia.
  3. Lucas critique, Goodhart’s law and Campbell’s law at Wikipedia.
  4. 40 Artificial Intelligence Interview Questions & Answers at Vipul Patel’s LinkedIn.
  5. State of AI Report 2019 at state.ai.
  6. Gramr add-in for RStudio at ROpenScilabs’s GitHub Repository.
  7. Chega a São Paulo a École 42, escola francesa que ensina programação sem cobrar nada at Época Negócios.
  8. Seleção de desafios para o Hackathon em Saúde 2019 at ICICT.
  9. Indicadores criminais divulgados oficialmente pela Secretaria de Segurança Pública e Defesa Social (SSPDS) do Ceará. Data!!!
  10. Drone com projetor consegue enganar IA de carro at Olhar Digital.
  11. Curso de Data Science em Português at sn3fu’s GitHub.

Best links of the week #18

Reading time: 3 minutes

Best links of the week from 6th May to 12th May

Source: xkcd.


  1. NextJournal, Seamless Data Science for Teams.
  2. An executive’s guide to AI.
  3. What should I use to serve R applications over the internet? at Brian Caffo’s YouTube channel. He talks about PlumbeR (PlumbeR book here).
  4. Will AI eat statistics? at Brian Caffo’s YouTube channel.
  5. A radical new neural network design could overcome big challenges in AI at MIT Technology Review.
  6. Urgent need for a government-led big data system, say industry experts at The Edge Markets.
  7. Top 10 Cities Across The Globe With The Highest Pay Packages For Data Scientists at Analytics India.
  8. What Nobody Tells You About Machine Learning at Forbes.
  9. How the data mining of failure could teach us the secrets of success at MIT Technology Review.
  10. How to hide from the AI surveillance state with a color printout at MIT Technology Review.
  11. Boosting (machine learning) at Wikipedia.
  12. Weak Learning, Boosting, and the AdaBoost algorithm at Jeremy Kun’s Blog.
  13. Weak vs. Strong Learning and the Adaboost Algorithm at Jenn Wortman Vaughan’s Website.
  14. What is a weak learner? at StackOverflow.
  15. AI está pronta para transformar radicalmente o desenvolvimento de software at CIO.
  16. O orçamento das universidades e institutos federais desde 2000 at NexoJornal.
  17. O governo contra as universidades, em dados e análises at NexoJornal.
  18. Existe alguma microevolução documentada nos humanos nos últimos duzentos anos? at Quora.

Best links of the week #17

Reading time: 2 minutes

Best links of the week from 29nd April to 5th May

Source: Dilbert.


  1. This Will Be The Biggest Disruption In Higher Education at Forbes.
  2. Dead Facebook users could outnumber living ones within 50 years at MIT Technology Review.
  3. To Build Truly Intelligent Machines, Teach Them Cause and Effect at QuantaMagazine.
  4. The Worlds largest listings of AI Conferences, Events and Meetups with the biggest collection of conference discount codes.
  5. 2nd International Summer School on Artificial Intelligence: From Deep Learning to Data Analytics.
  6. Microsoft launches a drag-and-drop machine learning tool at TechCrunch.
  7. Actively curated list of awesome BI tools at Jan Kyri’s GitHub.
  8. How much of human height is genetic and how much is due to nutrition at Scientific American.
  9. Announcing JupyterHub 1.0!
  10. I hate it that sometimes Jupyter notebooks don’t render properly (or take a long time to render) at GitHub. If you’ve faced similar situations, your solution is here!
  11. Cryptography That Can’t Be Hacked at QuantaMagazine.
  12. Hacker-Proof Code Confirmed at QuantaMagazine.
  13. “PUT DOWN THE DEEP LEARNING: When not to use neural networks (and what to do instead)”, a talk by Rachael Tatman. Code here.
  14. Socially-Stratified Validation for ML Fairness, another talk by Rachael Tatman.
  15. Google Books Ngram Viewer is a tool that displays a graph showing how phrases specified by you have occurred in a corpus of books through time.
  16. Why Generation Y Yuppies Are Unhappy at Wait But Why.
  17. Looking for data? You mean data? DATA? Yes, data, data and data!!
  18. Becoming a Data Scientist – Curriculum via Metromap at nirvacana.
  19. Demystifying Artificial Intelligence. What is Artificial Intelligence & explaining it from different dimensions at nirvacana.
  20. The weakening relationship between the Impact Factor and papers’ citations in the digital age at arXiv.org.
  21. Visualização GeoEspacial com R at Gabriel Sartori’s GitHub.

Best links of the week #14

Reading time: 2 minutes

Best links of the week from 8th April to 14th April

Source: Business Broadway.


  1. Many more images like the one above at Business Broadway.
  2. Websites with challenges and exercises at Gabriel Fonseca’s GitHub page.
  3. Support innovation in healthcare with Hacking Health! There are several chapters around the world, including several in Brazil and in France :-).
  4. What are some examples of “Correlation does not equal causation?” at Quora.
  5. Does no correlation imply no causality? at Cross Validated.
  6. PEARL VS RUBIN (GELMAN) at Dokyun Lee’s website.
  7. Virgilio, your new Mentor for Data Science E-Learning at Giacomo Ciarlini.
  8. A quick reference for data visualization.
  9. Dev Tube.
  10. Por que preciso de “Análise de Componentes Principais” ou PCA na mineração de dados? at Quora.
  11. Harvard lança 15 cursos gratuitos de Inteligência Artificial at Estagio Online.
  12. Os testes de Harvard selecionam seus genes at Deviante.
  13. A realidade biopsicossocial da violência at Deviante.