Causality, Data Science, R, tools, Uncategorized

Continuous Machine Learning – Part II

Reading time: 3 minutes

This is a 3-part series about Continuous Machine Learning. You can check Part I here and Part III here. This post is a continuation of the previous one, in which we initiated our experience on automating Data Science in GitHub with CML. We will basically make use of Docker to improve the computation time in our GitHub Actions checks.

You can think of a Docker image as taking a snapshot of the software environment of a project, and then being able to setup that snapshot on any other computer. When GitHub Actions is called, it loads your Docker image in their infrastructure and then runs your code. That’s why it’s quicker, because when you use a Docker container with your dependencies already installed, you don’t have to spend time setting them up all over again on your GitHub Actions runner every time it is triggered, which is the way we did in the first part of this series.

Creating a Docker image

Image from “Build a Docker Image just like how you would configure a VM”.

As you can see on the image above, you write a text file called Dockerfile and the Docker app will create a Docker image based on the instructions contained in the Dockerfile. Then, with the Docker engine running in some infrastructure (GitHub’s infra, in our case), this image will be converted into a real container, like in the image below (There are three containers)..

Image from “Si Docker m’était conté… 2ème partie : plongée à l’intérieur des conteneurs”.

We will base our Dockerfile on CML’s official Dockerfile and it will look like the code below.

FROM dvcorg/cml
RUN apt-key adv --keyserver --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    add-apt-repository 'deb bionic-cran35/' && \  
    apt update && \
    apt install -y r-base && \
    R --silent -e "install.packages(c(\"igraph\", \"ppcor\", \"scales\", \"Rcpp\"))" && \
    wget -c && \
    tar -xvzf v1.4.2.tar.gz && \
    R CMD INSTALL miic_R_package-1.4.2/. --preclean

As you can see, it has a lot of the code that we used to have in our cml.yaml file (in the first post of this series). That’s the magic! Because of that, our cml.yaml file is much shorter and should look like this:

name: dvc-cml-miic
on: [push]
    runs-on: [ubuntu-latest]
    container: docker://mribeirodantas/cml-test:r
      - uses: actions/checkout@v2

      - name: cml_run
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          R --version

          dvc pull
          Rscript infer_network.R

          # Write your CML report
          echo "## Model Metrics" >
          cat metrics.txt >>
          echo "## Data visualization" >>
          cml-publish network_diagram.png --md >>

You can see the difference between the old and new cml.yaml in the image below (move the slider in the center to either side to see). You can also view it directly on GitHub clicking here.

You can save the Dockerfile showed here in a file named Dockerfile in your repository (that right now should look like this) and update your cml.yaml like the code above.

To build it and publish it in Docker Hub (Think of it like GitHub but for Docker images), you gotta type the following Docker commands. You need Docker installed for that (instructions for Ubuntu here). The first command takes a while, it’s creating your image and that’s the only time that it will really download and install everything. If you understood what I explained earlier, it should be clear that having a container pays off 🙂

# Create the Docker image from the Dockerfile
sudo docker build -t mribeirodantas/cml:r -f ./Dockerfile .
# Login to Docke Hub
sudo docker login
# Upload it to Docker Hub. GitHub will always download the
# Docker image from there.
sudo docker push mribeirodantas/cml:r

Testing Docker image

We are not sure if this will work so we should create a branch and check what happens. Instead of creating pull requests from GitHub’s web interface, we will use the official GitHub command line gh, just like we did in the first part of this series.

git checkout -b docker_ghactions
git add -A
git commit -m 'Makes use of Docker to speed up GH Actions checks'
git push origin docker_ghactions
gh pr create --title 'Makes use of Docker to speed up GH Actions checks'

In the first part of this series, you probably remember that the checks took around 7 minutes to finish. In this branch that uses a Docker container it took less than 2 minutes. Taking into consideration this is a very simple project example and other projects could have much more dependencies, the benefits can be even larger than the ones we see here. Great, right? Let’s merge this push request.

# This command will merge all open push requests.
# We only have one, so it will merge the one we have opened.
gh pr merge

One question that you may have is why leaving dvc pull in the cml.yaml file instead of doing it only once in the Docker. After all, downloading big datasets every time should take a while right? Well, you are right, it may take a while. However, if nothing changed, dvc pull quickly will sa yso and nothing will be done. And in case you changed your raw datasets, for example, it will download them.

That’s it for today folks 🙂