Category: Data Science

BestLinks, Causality, Data Science

Best links of the week #78

Reading time: < 1 minute

Best links of the week from 7th September to 13th September

This image has an empty alt attribute; its file name is meme-1.jpg

Links

  1. What are the various scenarios where we get a negative R squared in a (linear) regression model? at Quora.

Blog posts

  1. Quem seguir para acompanhar Machine Learning e AI no Twitter? at Machina Economicus.
  2. Now Nubank’s Data Scientists have their own values at Nubank.
  3. Resumo de livro: “Mostly Harmless Econometrics”, cap 1 e 2 at Machina Economicus.
  4. Causal Inference Part I: Simpson’s Paradox at Data for Science’s Medium.
  5. Causal Inference Part II: Probability Theory at Data for Science’s Medium.
  6. Causal Inference Part III: Graphs at Data for Science’s Medium.
  7. Causal Inference Part IV: Structural Causal Models at Data for Science’s Medium.
  8. Causal Inference Part V: Chains, and Forks at Data for Science’s Medium.
  9. Causal Inference Part VI: Colliders at Data for Science’s Medium.
  10. Causal Inference Part VII: d-separation at Data for Science’s Medium.
  11. Computational Causal Inference at Netflix at Netflix TechBlog.

Videos

  1. Método Científico – Grandes Pensadores at A Ciência da Estatística‘s YouTube channel.
BestLinks, Causality, Data Science

Best links of the week #77

Reading time: 2 minutes

Best links of the week from 31st August to 6th September

This image has an empty alt attribute; its file name is meme-1.jpg

Links

  1. Guess the Correlation Game.
  2. dagitty: Graphical Analysis of Structural Causal Models (R package) at CRAN.
  3. What Is The Difference Between Nonlinear Mechanics And Chaos Theory? at Forbes.

Blog posts

  1. Probability concepts explained: Marginalisation at Towards Data Science.
  2. When the Fundamental Problem of Causal Inference Ain’t No Problem at Brady Neal’s Blog.
  3. Microsoft’s DoWhy is a Cool Framework for Causal Inference by Jesus Rodriguez at Data Series.
  4. 4 Reasons Why Data Scientists Should Version Data by Fabiana Clemente at The Startup‘s Medium.
  5. ‘Looping’ and ‘Branching’ with Pipes at David Ranzolin’s Blog.
  6. Bayes’ Theorem is Actually an Intuitive Fraction by Andre Ye at Towards Data Science.
  7. Using Xkcd to Make Density Plots at Arthur Rocha’s Blog.
  8. Solving the Birthday Paradox at Arthur Rocha’s Blog.
  9. Como analisar a eficiência dos seus algoritmos at Turing Talks’ Medium.
  10. Análise de Algoritmos de Machine Learning at Turing Talks’ Medium.
  11. Modelos de Predição | Introdução à Predição at Turing Talk’s Medium.
  12. Modelos de Predição | Regressão Linear at Turing Talk’s Medium.
  13. Modelos de Predição | Decision Tree at Turing Talk’s Medium.
  14. Modelos de Predição | Random Forest at Turing Talk’s Medium.
  15. An Incredible Interactive Chart of Biblical Contradictions at Friendly Atheist.
  16. Tensorbook, a deep learning laptop at Data Science Rush.

Videos

  1. This is why you’re learning differential equations at Zach Star’s YouTube channel.
  2. Why people fall for misinformation by Joseph Isaac at TED-Ed.
  3. How statistics can be misleading by Mark Liddell at TED-Ed.
  4. QUE REINE O CAOS, MAS O QUE É O CAOS? ft. Fernando Lenarduzzi at Matemaníaca’s YouTube channel.
  5. Alguns pensadores que ajudaram a desenvolver a estatística at A Ciência da Estatística on YouTube Channel.
BestLinks, Causality, Data Science, R

Best links of the week #76

Reading time: 2 minutes

Best links of the week from 24th August to 30th August

This image has an empty alt attribute; its file name is meme-1.jpg

Links

  1. Difference-in-Difference Estimation at Columbia Public Health.
  2. Cluster Analysis Using K-Means at Columbia Public Health.
  3. Discrete Choice Analysis at Columbia Public Health.
  4. Exploratory Factor Analysis at Columbia Public Health.
  5. Instrumental Variables at Columbia Public Health.
  6. Spline Regression at Columbia Public Health.
  7. Principal Components Analysis at Columbia Public Health.
  8. Propensity Score at Columbia Public Health.
  9. Inverse Probability Weighting at Columbia Public Health.
  10. Path Analysis at Columbia Public Health.
  11. Probabilistic Sensitivity Analysis of Misclassification at Columbia Public Health.
  12. Markov Chain Monte Carlo at Columbia Public Health.
  13. Raio X dos Municípios.
  14. Correlation or causation? Mathematics can finally give us an answer at NewScientist.
  15. A new kind of logic: How to upgrade the way we think at NewScientist.
  16. An Unpredictable Universe: A Deep Dive Into Chaos Theory at Space.
  17. Introduction to Causal Inference Course by Brady Neal.

Blog posts

  1. Data versus Science: Contesting the Soul of Data-Science at Causal Analysis in Theory and Practice.
  2. Race, COVID Mortality, and Simpson’s Paradox at Causal Analysis in Theory and Practice.
  3. What Statisticians Want to Know about Causal Inference and The Book of Why at Causal Analysis in Theory and Practice.

Videos

  1. The Science Behind the Butterfly Effect at Veritasium.
BestLinks, Causality, Data Science, R, tools

Best links of the week #75

Reading time: 2 minutes

Best links of the week from 18th August to 23th August

This image has an empty alt attribute; its file name is meme-1.jpg

Links

  1. Facebook cria IA que acelera em 4 vezes exames de ressonância magnética at Olhar Digital.
  2. Quasi experiment at Wikipedia.
  3. Internal validity at Wikipedia.
  4. External validity at Wikipedia.
  5. cmx.js – a library for authoring xkcd-style comixes.
  6. Create your own xkcd-style comics using HTML markup.

Blog posts

  1. GPT-3 Has No Idea What It Is Saying by Steve Shwartz at Towards Data Science.
  2. Mathematics: An essential skill for aspiring data science professional! at Mark Taylor‘s Medium.
  3. tidyverse-tips at Olivier Gimenez‘s Blog.

Videos

  1. Causal Inference – Part I by Susan Athey at SMILES’s YouTube channel.
  2. Causal Inference – Part II by Susan Athey at SMILES’s YouTube channel.
  3. Causality by Kun Zhang at SMILES’s YouTube channel.
  4. Learning Theory by Ruth Urner at SMILES’s YouTube channel.
  5. Technical obstacles to ML implementation in healthcare by Anna Goldenberg at SMILES’s YouTube channel.
  6. Causality and Increasing Model Reliability by Suchi Saria at SMILES’s YouTube channel.
  7. Quasi-Experimental Designs at Onderzoeksmethoden UvA‘s YouTube channel.
  8. Regression Discontinuity Analysis at Onderzoeksmethoden UvA‘s YouTube channel.
  9. Identification Strategies, Part 1: How Economists Establish Causality at Ashley Hodgson‘s YouTube channel.
  10. Identification, Part 2: Regression Discontinuity at Ashley Hodgson‘s YouTube channel.
  11. Identification, Part 3: Instrumental Variables at Ashley Hodgson‘s YouTube channel.
  12. Identification, Part 4: Differences-in-differences / Natural Experiment at Ashley Hodgson‘s YouTube channel.
  13. An intuitive introduction to Regression Discontinuity at Doug McKee‘s YouTube channel.
  14. Regression Discontinuity: Looking at People on the Edge (Causal Inference Bootcamp) at Mod•U: Powerful Concepts in Social Science‘s YouTube channel.
  15. Noncompliance in Experiments (Causal Inference Bootcamp) at Mod•U: Powerful Concepts in Social Science‘s YouTube channel.
  16. Discrete Choice Analysis (Causal Inference Bootcamp) at Mod•U: Powerful Concepts in Social Science‘s YouTube channel.
  17. Confounders in Discrete Choice Analysis (Causal Inference Bootcamp) at Mod•U: Powerful Concepts in Social Science‘s YouTube channel.
  18. An intuitive introduction to Difference-in-Differences at Doug McKee‘s YouTube channel.
  19. Types of Experimental Designs at Simple Learning Pro‘s YouTube channel.
  20. The Effects of Outliers and Extrapolation on Regression (2.4) at Simple Learning Pro‘s YouTube channel.
  21. Density Curves and their Properties (5.1) at Simple Learning Pro‘s YouTube channel.
  22. MLOps Tutorial #4: GitHub Actions with your own GPUs at DVCorg‘s YouTube channel.
Causality, Data Science, R, tools, Uncategorized

Continuous Machine Learning – Part II

Reading time: 3 minutes

This is a 3-part series about Continuous Machine Learning. You can check Part I here and Part III here. This post is a continuation of the previous one, in which we initiated our experience on automating Data Science in GitHub with CML. We will basically make use of Docker to improve the computation time in our GitHub Actions checks.

You can think of a Docker image as taking a snapshot of the software environment of a project, and then being able to setup that snapshot on any other computer. When GitHub Actions is called, it loads your Docker image in their infrastructure and then runs your code. That’s why it’s quicker, because when you use a Docker container with your dependencies already installed, you don’t have to spend time setting them up all over again on your GitHub Actions runner every time it is triggered, which is the way we did in the first part of this series.

Creating a Docker image

Image from “Build a Docker Image just like how you would configure a VM”.
Causality, Data Science, R, tools, Uncategorized

Continuous Machine Learning – Part I

Reading time: 9 minutes
Image by Taras Tymoshchuck from here.

This is a 3 part series about Continuous Machine Learning. You can check Part II here and Part III here.

What is it?

Continuous Machine Learning (CML) follows the same concept of Continuous Integration and Continuous Delivery (CI/CD), famous concepts in Software Engineering / DevOps, but applied to Machine Learning and Data Science projects.

What is this post about?

I will cover a set of tools that can make your life as a Data Scientist much more interesting. We will use MIIC, a network inference algorithm, to infer the network of a famous dataset (alarm from bnlearn). We will then use (1) git to track our code, (2) DVC to track our dataset, outputs and pipeline, (3) we will use GitHub as a git remote and (4) Google Drive as a DVC remote. I’ve written a tutorial on managing Data Science projects with DVC, so if you’re interested on it open a tab here to check it later.

BestLinks, Data Science, R

Best links of the week #63

Reading time: 2 minutes

Best links of the week from 30th March to 5th April

This image has an empty alt attribute; its file name is meme-1.jpg

Links

  1. MonitoraCovid-19 at Big Data Fiocruz.
  2. Livros Gratuitos da Springer at Marcus Nunes’ Blog.
  3. Painel de Leitos e Insumos dos estados brasileiros.
  4. See how your community is moving around differently due to COVID-19.
  5. Data extraction of Google’s COVID-19 Mobility Reports at vitorbaptista’s GitHub.
Data Science, R

Mobility and COVID-19 cases. Did Brazil stop?

Reading time: 9 minutes
Illustration du nouveau coronavirus, Covid-19 – Mars 2020 / © UPI/MaxPPP

You have probably heard that Google has released a set of mobility reports recently. The site hosting these reports, the so-called COVID-19 Community Mobility Reports, begins with the following sentence: “See how your community is moving differently due to COVID19”.

What is it about?

Google offers a Location History feature in its services/systems that monitors the location, and consequently the displacement, of users. This data can be accessed and disabled at any time by users. According to Google, this feature needs to be activated voluntarily, as it is disabled by default. Based on this information, they observed how and where these individuals used to go in a period prior to the COVID-19 outbreak and how and where they are moving now, during the outbreak. There is a clear bias here. People who do not have a cell phone or tablet, or who have not activated this feature, are out of their sampling and this can impact the conclusions of the report. Still, it’s worth a look.

Data Science, PhD, R

Manage your Data Science Project in R

Reading time: 9 minutes

A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.

Source: Here.

The pain of managing a Data Science project

Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.

Causality, Data Science, PhD, R

Spurious Independence: is it real?

Reading time: 14 minutes

First things first: Spurious Dependence

Depending on your background, you have already heard of spurious dependence in a way or another. It goes by the names of spurious association, spurious dependence, the famous quote “correlation does not imply causation” and also other versions based on the same idea that you can not say that X necessarily causes Y (or vice versa) solely because X and Y are associated, that is, because they tend to occur together. Even if one of the events always happens before the other, let’s say X preceding Y, still, you can not say that X causes Y. There is a statistical test very famous in economics known as Granger causality.

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969.[1] Ordinarily, regressions reflect “mere” correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of “true causality” is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only “predictive causality”.

Granger Causality at Wikipedia.

The post hoc ergo propter hoc fallacy is also known as “after this, therefore because of this”. It’s pretty clear today that Granger causality is not an adequate tool to infer causal relationships and this is one of the reasons that when X and Y are tested by the granger causality test, and an association is found, it’s said that X Granger-causes Y instead of saying that X causes Y. Maybe it’s not clear to you why the association between two variables and the notion that one always precedes the other is not enough to say that one is causing the other. One explanation for a hypothetical situation, for example, would be a third lurking variable C, also known as a confounder, that causes both events, a phenomenon known as confounding. By ignoring the existence of C (which in some contexts happens by design and is a strong assumption called unconfoundedness), you fail to realize that the events X and Y are actually independent when taking into consideration this third variable C, the confounder. Since you ignored it, they seem dependent, associated. A very famous and straight forward example is the positive correlation between (a) ice cream sales and death by drowning or (b) ice cream sales and homicide rate.