Continuous Machine Learning

Reading Time: 11 minutes
Image by Taras Tymoshchuck from here.

What is it?

Continuous Machine Learning (CML) follows the same concept of Continuous Integration and Continuous Delivery (CI/CD), famous concepts in Software Engineering / DevOps, but applied to Machine Learning and Data Science projects.

What is this post about?

I will cover a set of tools that can make your life as a Data Scientist much more interesting. We will use MIIC, a network inference algorithm, to infer the network of a famous dataset (alarm from bnlearn). We will then use (1) git to track our code, (2) DVC to track our dataset, outputs and pipeline, (3) we will use GitHub as a git remote and (4) Google Drive as a DVC remote. I’ve written a tutorial on managing Data Science projects with DVC, so if you’re interested on it open a tab here to check it later.

The first thing is that I don’t really like having to go to the GitHub website all the time, so I will also introduce you to gh, GitHub’s official command line application. We will also use CML, an open-source library for implementing continuous integration & delivery (CI/CD) in machine learning projects, that will link git, DVC and GitHub Actions. The idea is that every time you do something in your repository, some actions will be triggered and executed by GitHub Actions in their computing infrastructure. One example would be using branches as experiments in your ML project, such as several inferences of the same algorithm but changing some parameters. Every time you commit changing a parameter and push, a report would be presented to make it easier (and beautiful) for you to compare the results with the different parameters.

Time to start.

Let’s create our repository on GitHub and make a local copy of it. From the command line! (Instructions here to install gh).

mkdir $HOME/dev
cd dev
gh repo create dvc-miic-cml -d 'GitHub repo to play with DVC, MIIC and CML' --public

You will be asked if you want to create a local copy of this repository. If you say no, you will have to clone the repository later, so reply Y and press enter. After that, enter the directory.

cd dvc-miic-cml

Let’s create a README file so that we can describe the purpose of the repository. In GitHub, this is usually a file named README.md written in Markdown format.

echo '# DVC-MIIC-CML' > README.md
echo 'This is a sample repository for testing DVC, MIIC and CML in GitHub.' >> README.md
echo 'The analyses will be performed using [MIIC](https://github.com/miicTeam/miic_R_package) to infer the network from the [alarm dataset](https://rdrr.io/cran/bnlearn/man/alarm.html).' >> README.md

We will also download a license file (GNU GPLv3) from the GNU website and have it named as LICENSE, as it is commonly done in GitHub.

wget -c https://www.gnu.org/licenses/gpl-3.0.txt -O LICENSE

We will then add our two new files to the index of our git repository with the git add command and then commit to the repository, that is, save a snapshot of our local repository. Afterwards, we will push our modifications to GitHub to make sure anyone can see the most up to date version of our repository. Since this is the first time we’re pushing, we will need to tell git what’s the default branch to push. We do that with the –set-upstream parameter. In the future, when we want to push to the default branch, we can just type git push.

# The dot when provided to git add means everything in the current folder
git add .
git commit -m 'Initial commit'
git push --set-upstream origin master

You can check your repository at GitHub now. It should be updated! Mine is. Git is not supposed to track data, output files (metrics files, plots, reports) or pipelines. That’s where DVC fits in. Let’s start by tracking the alarm dataset. You can download it from MIIC official website clicking here. In the command line, that’s what we would do:

wget -c https://miic.curie.fr/datasets/alarm1000samples.txt -O alarm.tsv

I don’t really like when git repositories are just a bunch of files thrown at the root folder, so let’s make it a bit more organized.

mkdir data
mv alarm.tsv data/

DVC enters the scene

If you type git status, you will see that the folder is untracked, which is a bit annoying since (a) git is not supposed to track data and (b) you do not want that either. One of the things that DVC does, after being told to track files, is to tell git to ignore such files. After all, DVC will be taking care of them! Before telling DVC what to track, though, you must tell DVC you want it to work in this repository. Just like you would normally do with a new git repository without the help of gh (git init), you do with DVC. Let’s do this and then tell DVC to track our dataset. Instructions on DVC installation can be found here.

dvc init
dvc add data/alarm.tsv

Some files will be created by DVC. These meta data files must be tracked by git, so let’s just add everything new to the index and commit it. Our dataset won’t be added because DVC added it to .gitignore, a hidden file used by git for this purpose: what to ignore.

git add .
git commit -m 'Initiates DVC and asks it to track alarm dataset'

Just to make sure that the workflow of git is clear to you, you don’t have to push every time you commit. We won’t push now, for example (though we could). Just like GitHub is a git remote, you can also have a DVC remote. It can be Dropbox, an Amazon S3 bucket, Google Drive or even a folder in your computer or in your external disk. For simplicity here, let’s use Google Drive.

I went to the Google Drive website, logged with my account, and created at the root of my drive a new folder named dvc-miic-cml example. The URL is https://drive.google.com/drive/u/1/folders/188CmpQIYqKOgvcgaLZOxz1GqlwTasv8c

What you need here is the last part after the folders/, that is, 188CmpQIYqKOgvcgaLZOxz1GqlwTasv8c. Let’s set this as our DVC remote now with the following command:

dvc remote add -d myremote gdrive://188CmpQIYqKOgvcgaLZOxz1GqlwTasv8c

The -d parameter is important to tell DVC that this is your default remote. Otherwise, it will ask you what remote you want to use whenever you run a command that will do something based on a remote. We used git push to push to git. Can you guess what command we should use to push to our DVC remote at Google Drive? I’m sure you guessed it right!

dvc push

If you check your folder in Google Drive you will see it is no longer empty. You can’t really understand what’s there, but take my word for it: DVC knows how to interpret it 😛 . As an habit, you type git status and you realize something changed in your repository. Wait, what!? By adding a default remote, the DVC configuration file was changed. You could git add the folder and git commit it but for didactic reasons I will do something else: I will amend it to the last commit, and by doing so, update the commit message. Amending is useful when you committed something but forgot to add something, or you decided that your last commit message wasn’t that good. So you change your last commit, instead of doing a new one!

git add .
git commit --amend -m 'Initiates DVC, sets the default remote and asks it to track alarm dataset'

Ok, let’s push now.

git push

The GitHub page for your repository should look like this. You may wonder why there are two files in your data folder since I told you git won’t be used to track data. One of the files is .gitignore, to make sure git won’t annoy you saying that the dataset file is not tracked, when it actually is tracked [by DVC]. The .dvc file is a meta data file used by DVC and it contains a hash built out of the content of the dataset. That’s how DVC knows if the dataset changed, because the hash will change.

DVC Note

If someone is interested in this repository (maybe you are), they would initially do just like any other GitHub repository: They would clone it!

git clone https://github.com/mribeirodantas/dvc-miic-cml.git
cd dvc-miic-cml

By checking the data folder with ls data, you will realize the dataset is not there. Well, of course it is not there, right? You only cloned the git repository. Let’s use dvc pull to pull what DVC is tracking for this repository.

dvc pull

Now it’s there 🙂 . Let’s start writing our network inference script. We will use MIIC (Multivariate information-based inductive causation) for that. Create a file named infer_network.R with the content below:

library(miic)
alarm_dataset <- read.table('data/alarm.tsv', header = TRUE)
res <- miic(input_data = alarm_dataset)
total_edges <- nrow(res$all.edges.summary)
retained_edges <- nrow(res$all.edges.summary[res$all.edges.summary$type == 'P', ])
ratio_edges <- paste0('Ratio of retained edges: ', retained_edges/total_edges)
write.table(ratio_edges, file = 'metrics.txt', col.names = FALSE, row.names = FALSE)

This code loads the miic R package, reads the dataset into the R environment, runs miic to infer the network and calculates the ratio of retained edges by the number of possible edges. Then, the ratio is saved to a file named metrics.txt.

GitHub Actions

Now it’s time to start playing with GitHub Actions to make CML work for us. Every time we push a new commit to the repository, the model will be rebuilt and our metrics recalculated.

In order to use GitHub Actions, we need to create a special file in a special folder. The path from within your git repository is: .github/workflows

Inside the folder, you have to create your GitHub Action file. The name is not important, but it must be a file in YAML format. Let’s create a file named cml.yaml inside the path mentioned above.

mkdir -p .github/workflows
cd .github/workflows

Then, create a file named cml.yaml and put the code below inside it. This asks for a machine running the latest version of Ubuntu, sets up an R environment, checks out the current git repository, installs MIIC, DVC, their dependencies, dvc pull our dataset, calls the infer_network.R script that will save the metrics to a file in the end, and then output it.

name: dvc-cml-miic
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    steps:
      - uses: r-lib/actions/setup-r@master
        with:
          version: '3.6.1'
      - uses: actions/checkout@v2
      - name: cml_run
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
        run: |

          # Install miic and dependencies
          wget -c https://github.com/miicTeam/miic_R_package/archive/v1.4.2.tar.gz
          tar -xvzf v1.4.2.tar.gz
          cd miic_R_package-1.4.2
          R --silent -e "install.packages(c(\"igraph\", \"ppcor\", \"scales\", \"Rcpp\"))"
          R CMD INSTALL . --preclean
          cd ..
          # Install Python packages
          pip install --upgrade pip
          pip install wheel
          pip install PyDrive2==1.6.0 --use-feature=2020-resolver
          # Install DVC
          wget -c https://github.com/iterative/dvc/releases/download/1.4.0/dvc_1.4.0_amd64.deb
          sudo apt install ./dvc_1.4.0_amd64.deb
          # Run DVC
          dvc pull
          Rscript infer_network.R

          # Write your CML report
          echo "MODEL METRICS"
          cat metrics.txt

Instead of comitting this to the master (default) branch, we will create an experiment branch. That’s how you should use DVC! We will analyze the raw version of the alarm dataset, no pre-processing, so I will call this branch raw_alarm_dataset.

You have used dvc pull already, so you authenticated your machine with Google Drive. Create a GitHub secret with the content of the file .dvc/tmp/gdrive-user-credentials.json and name it GDRIVE_CREDENTIALS_DATA.

git checkout -b raw_alarm_dataset
# infer_network.R is not in this folder, therefore `git add .` wouldn't
# add it to the index of your git repository. -A adds everything.
git add -A
git commit -m 'Infers alarm network with MIIC and default parameters'
git push origin raw_alarm_dataset
gh pr create --title 'Network inference of alarm dataset'

Now, go to GitHub and check what’s happening. If everything goes according to plan, you will see something like the image below when the check is over.

Well… You got your metrics printed out in the checks log file. Cool, but you probably agree with me that we should expect something more elegant, right? Hehe `^^

Let’s add some lines to our infer_network.R script to make it plot the network, and then let’s change the last part to make use of CML functionalities. The new infer_network.R should look like:

library(miic)
alarm_dataset <- read.table('data/alarm.tsv', header = TRUE)
res <- miic(input_data = alarm_dataset)
total_edges <- nrow(res$all.edges.summary)
retained_edges <- nrow(res$all.edges.summary[res$all.edges.summary$type == 'P', ])
ratio_edges <- paste0('Ratio of retained edges: ', retained_edges/total_edges)
write.table(ratio_edges, file = 'metrics.txt', col.names = FALSE, row.names = FALSE)

# Plot network
png(file='network_diagram.png')
miic.plot(res)
dev.off()

And the new cml.yaml file should look like the code below. The new thing now is that we’re also installing CML and making use of it.

name: dvc-cml-miic
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    steps:
      - uses: r-lib/actions/setup-r@master
        with:
          version: '3.6.1'
      - uses: actions/checkout@v2
      - name: cml_run
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
        run: |

          # Install miic and dependencies
          wget -c https://github.com/miicTeam/miic_R_package/archive/v1.4.2.tar.gz
          tar -xvzf v1.4.2.tar.gz
          cd miic_R_package-1.4.2
          R --silent -e "install.packages(c(\"igraph\", \"ppcor\", \"scales\", \"Rcpp\"))"
          R CMD INSTALL . --preclean
          cd ..
          # Install Python packages
          pip install --upgrade pip
          pip install wheel
          pip install PyDrive2==1.6.0 --use-feature=2020-resolver
          # Install DVC
          wget -c https://github.com/iterative/dvc/releases/download/1.4.0/dvc_1.4.0_amd64.deb
          sudo apt install ./dvc_1.4.0_amd64.deb
          # Run DVC
          dvc pull
          Rscript infer_network.R

          # Install CML
          npm init --yes
          npm i @dvcorg/cml@latest
          # Write your CML report
          echo "## Model Metrics" > report.md
          cat metrics.txt >> report.md
          echo "## Data visualization" >> report.md
          npx cml-publish network_diagram.png --md >> report.md
          npx cml-send-comment report.md

Let’s commit.

git add .
git commit -m 'Uses CML to improve PR feedback'
git push origin raw_alarm_dataset

Now, right after the checks are done, you should have an automatic comment with your report like in the figure below.

Let’s say that I think too many edges have been removed and maybe the network is not consistent. I will change the infer_network.R script to make MIIC look for a consistent network. The third line now looks like:

res <- miic(input_data = alarm_dataset, consistent='orientation')
git add .
git commit -m 'Makes network consistent'
git push origin raw_alarm_dataset

So now I think it’s right and I should approve the pull request 🙂 . I could do it clicking on the green “Merge pull request” button or I could use gh again, GitHub’s official command line application.

gh pr merge 1

It will ask you two questions. I chose to create a merge commit and to not remove the branch, be it locally or at GitHub. To go back to the master branch, you should do:

git checkout master

Using Docker containers

You probably noticed it takes a while to do the checks and depending on how many things you want to install, it can take very long. One way out of this situation is by using a docker container that already has your dependencies installed. The way we’ve been doing it so far is ready for you to use your containers, after all, I’m installing CML manually. If you don’t want to use a container of yours, but don’t want either to download and install CML at every check, you can use CML’s official docker container.

Since we merged a pull request, our remote (GitHub) is different from our local repository. To update our local repository, let’s run git pull, and then create a new branch.

git pull
git checkout -b cml_container

Change your cml.yaml to the code below.

name: dvc-cml-miic
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml
    steps:
      - uses: actions/checkout@v2
        
      - uses: r-lib/actions/setup-r@master
        with:
          version: '3.6.1'

      - name: cml_run
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
        run: |
          # Install miic and dependencies
          wget -c https://github.com/miicTeam/miic_R_package/archive/v1.4.2.tar.gz
          tar -xvzf v1.4.2.tar.gz
          cd miic_R_package-1.4.2
          R --silent -e "install.packages(c(\"igraph\", \"ppcor\", \"scales\", \"Rcpp\"))"
          R CMD INSTALL . --preclean
          cd ..

          # Run DVC
          dvc pull
          Rscript infer_network.R

          # Write your CML report
          echo "## Model Metrics" > report.md
          cat metrics.txt >> report.md
          echo "## Data visualization" >> report.md
          cml-publish network_diagram.png --md >> report.md
          cml-send-comment report.md

Let’s add the changed file, commit it, push and create a Pull Request (PR).

git add .
git commit -m 'Makes use of CML container'
git push origin cml_container
gh pr create --title 'Use CML container'

Everything should have run fine, like in here. You can merge the pull request and then git pull to update your local copy.

gh pr merge 2
git pull

What else?

DVC is not limited to data tracking. We could also track our pipeline, including output files such as the images that our infer_network.R script plotted. Imagine that we could have some code for preprocessing that would deliver a preprocessed dataset to the infer_network.R script that would generate the image with the network. Instead of running all these scripts (and we can easily think of scenarios that are much more complicated), we can use dvc to create a pipeline and a simple command (dvc repro) in our GitHub action file would be enough to reproduce our whole pipeline.

Besides, instead of installing a bunch of the same things (R, DVC, CML…) every time we push to the repository, we could have a Docker container with these things already installed. This could save us some time :-). In our case here, for example, downloading, compiling and installing MIIC takes a few minutes that could be spared if it was already installed in a Docker container. For our simple example, the time to download/setup the docker container may not make it worth to use it, but when complexity and dependencies increase, the benefits become more evident.

That’s it for today folks! 😉

You would not be reading this post if it wasn’t for Elle O’Brien, who told me so many things about CML + presentations and examples, and David Ortega who helped me setting up the R environment within the CML docker container.

Best links of the week #63

Mobility and COVID-19 cases. Did Brazil stop?

Reading Time: 9 minutes
Illustration du nouveau coronavirus, Covid-19 – Mars 2020 / © UPI/MaxPPP

You have probably heard that Google has released a set of mobility reports recently. The site hosting these reports, the so-called COVID-19 Community Mobility Reports, begins with the following sentence: “See how your community is moving differently due to COVID19”.

What is it about?

Google offers a Location History feature in its services/systems that monitors the location, and consequently the displacement, of users. This data can be accessed and disabled at any time by users. According to Google, this feature needs to be activated voluntarily, as it is disabled by default. Based on this information, they observed how and where these individuals used to go in a period prior to the COVID-19 outbreak and how and where they are moving now, during the outbreak. There is a clear bias here. People who do not have a cell phone or tablet, or who have not activated this feature, are out of their sampling and this can impact the conclusions of the report. Still, it’s worth a look.

Continue…

Manage your Data Science Project in R

Reading Time: 9 minutes

A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.

Source: Here.

The pain of managing a Data Science project

Something has been bothering me for a while: Reproducibility and data tracking in data science projects. I have read about some technologies but had never really tried any of them out until recently when I couldn’t stand this feeling of losing track of my analyses anymore. At some point, I decided to give DVC a try after some friends, mostly Flávio Clésio, suggested it to me. In this post, I will talk about Git, DVC, R, RMarkdown and Packrat, everything I think you may need to manage your Data Science project, but the focus is definitely on DVC.

Continue…

Spurious Independence: is it real?

Reading Time: 14 minutes

First things first: Spurious Dependence

Depending on your background, you have already heard of spurious dependence in a way or another. It goes by the names of spurious association, spurious dependence, the famous quote “correlation does not imply causation” and also other versions based on the same idea that you can not say that X necessarily causes Y (or vice versa) solely because X and Y are associated, that is, because they tend to occur together. Even if one of the events always happens before the other, let’s say X preceding Y, still, you can not say that X causes Y. There is a statistical test very famous in economics known as Granger causality.

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969.[1] Ordinarily, regressions reflect “mere” correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of “true causality” is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only “predictive causality”.

Granger Causality at Wikipedia.

The post hoc ergo propter hoc fallacy is also known as “after this, therefore because of this”. It’s pretty clear today that Granger causality is not an adequate tool to infer causal relationships and this is one of the reasons that when X and Y are tested by the granger causality test, and an association is found, it’s said that X Granger-causes Y instead of saying that X causes Y. Maybe it’s not clear to you why the association between two variables and the notion that one always precedes the other is not enough to say that one is causing the other. One explanation for a hypothetical situation, for example, would be a third lurking variable C, also known as a confounder, that causes both events, a phenomenon known as confounding. By ignoring the existence of C (which in some contexts happens by design and is a strong assumption called unconfoundedness), you fail to realize that the events X and Y are actually independent when taking into consideration this third variable C, the confounder. Since you ignored it, they seem dependent, associated. A very famous and straight forward example is the positive correlation between (a) ice cream sales and death by drowning or (b) ice cream sales and homicide rate.

Continue…

How can I evaluate my model? Part I.

Reading Time: 8 minutes
Source of image: here.

One way to evaluate your model is in terms of error types. Let’s consider a scenario where you live in a city where it rains every once in a while. If you guessed that it would rain this morning, but it did not, your guess was a false positive, sometimes abbreviated as FP. If you said it would not rain, but it did, then you had a false negative (FN). Raining when you do not have an umbrella may be annoying, but life is not always that bad. You could have predicted that it would rain and it did (true positive, TP) or predicted that it would not rain and it did not (true negative, TN). In this example, it’s easy to see that in some contexts one error may be worse than the other and this will vary according to the problem. Bringing an umbrella with you in a day with no rain is not as bad as not bringing an umbrella on a rainy day, right?

Continue…