Data Science, R

Mobility and COVID-19 cases. Did Brazil stop?

Reading time: 9 minutes
Illustration du nouveau coronavirus, Covid-19 – Mars 2020 / © UPI/MaxPPP

You have probably heard that Google has released a set of mobility reports recently. The site hosting these reports, the so-called COVID-19 Community Mobility Reports, begins with the following sentence: “See how your community is moving differently due to COVID19”.

What is it about?

Google offers a Location History feature in its services/systems that monitors the location, and consequently the displacement, of users. This data can be accessed and disabled at any time by users. According to Google, this feature needs to be activated voluntarily, as it is disabled by default. Based on this information, they observed how and where these individuals used to go in a period prior to the COVID-19 outbreak and how and where they are moving now, during the outbreak. There is a clear bias here. People who do not have a cell phone or tablet, or who have not activated this feature, are out of their sampling and this can impact the conclusions of the report. Still, it’s worth a look.

The reports are divided by countries and obviously Brazil caught my attention (you can access the report for Brazil by clicking here). They do not make the data available in a format that facilitates its manipulation and analysis, unfortunately. However, Vitor Baptista took care of this for us 🙂 He used Python to extract the data from these PDF reports (not only for Brazil but also some other countries) and generate a CSV (comma-separated value) file that you can download by clicking here.

Ready for the analysis?

The first thing that came to my mind was to see how this variation in displacements is related to variables in these states (population density, for example) and the growth rate of the number of cases of COVID19. With that in mind, I downloaded the most current data from the official COVID19 dashboard of te Ministry of Health, and data about Brazilian states from the IBGE website. I put all these files in a directory, I entered it through R, loaded some libraries that I will need soon at the beginning of this analysis and then read the files.

setwd('/home/mribeirodantas/COVID mobility/')
library(readr)
library(dplyr)
library(lubridate)
# Source: https://covid.saude.gov.br/
brazil <- read_csv2(file = 'brazil n_cases.csv')
# Source: https://github.com/vitorbaptista/google-covid19-mobility-reports
mobility <- read_delim(file = 'mobility_reports.csv', delim = ',')
# Source: https://www.ibge.gov.br/cidades-e-estados
ibge <- read_delim(file = 'IBGE.csv', ',', skip = 1)
view raw 1.R hosted with ❤ by GitHub

Ministry of Health dataset

The dataset I downloaded from the Ministry of Health dashboard has 7 variables: regiao, estado, data, casosNovos, casosAcumulados, obitosNovos and obitosAcumulados. The idea is to work state by state, that is, group by state, sort by date, and calculate the growth rate from one day to the next. The code I just described follows below:

# Função para calcular a taxa de crescimento diária
growth_rate <- function(x) {
(x/lag(x)-1)*100
}
# Transformando o campo data para uma data válida
brazil$data <- dmy(brazil$data)
# Calculando a taxa de cresc diária do número de casos
brazil %>%
select(estado, data, casosAcumulados) %>%
arrange(estado, data) %>%
group_by(estado) %>%
mutate(growth_rate_cases = growth_rate(casosAcumulados)) -> df
# Calculando a taxa de cresc diária do número de óbitos
brazil %>%
select(estado, data, obitosAcumulados) %>%
arrange(estado, data) %>%
group_by(estado) %>%
mutate(growth_rate_obitos = growth_rate(obitosAcumulados)) -> df2
# Juntando isso em um só dataframe
df <- merge(df, df2, by=c('estado', 'data'))
view raw 2.R hosted with ❤ by GitHub

Another thing that we need to take care of is to compare the same epidemiological week. As the number of cases tends to double with an alarming frequency (days, if not just a few days), comparing the scenario of a state after a month of the beginning of the outbreak with a state that has just identified its first cases would not be adequate. The correct thing would be to compare the beginning with the beginning, and so on. Based on this, we will ignore the lines, by state, prior to the first confirmed case.

df %>%
group_by(estado) %>%
slice(which(growth_rate_casos == Inf)[1]:n()) -> df
view raw 3.R hosted with ❤ by GitHub

You may be curious about the Inf in the code above. If you calculate the growth rate from 0 to any other number, in one part of your equation you will have a number over zero and that is equal to infinity. Therefore, the first moment we encounter an Inf is because the first case occurred. And we removed what was before.

Now that we have our dataframe with growth rate information, we will calculate the average daily rates (for new cases and new deaths taking into account the accumulated values) for each state and we are ready with regard to this dataset. After that, we delete the variables that we will no longer use. The is.finite filter is just to remove the remaining Inf.

df %>%
group_by(estado) %>%
filter(is.finite(growth_rate_casos)) %>%
filter(is.finite(growth_rate_obitos)) %>%
summarise_at(c('growth_rate_casos', 'growth_rate_obitos'), funs(mean)) -> casos_obitos
rm(df2, brasil, df, growth_rate)
view raw 4.R hosted with ❤ by GitHub

Google Mobility dataset

This dataset contains the following variables: country, region, updated_at, retail_and_recreation, grocery_and_pharmacy, parks, transit_stations, workplaces and residential. The data I downloaded was last updated on March 29th. If you opened the report, you will see the definition of these variables, which I bring below:

retail_and_recreation: Places such as restaurants, cafes, shopping centers, theme parks, museums, libraries and cinemas.
grocery_and_pharmacy: Places such as grocery stores, markets, food stores, specialty food stores, drugstores and pharmacies.
parks: Places like national parks, public beaches, marinas, dog parks, squares and public gardens.
transit_stations: Places like public transport hubs, such as metro, bus and train stations.
workplaces: Work places.
residential: Residential area.

The first thing we will do is remove the updated_at column, that does not interest us, filter for data related to Brazil and correct the name of the states to match the state names in the Ministry of Health dataset, which are the acronyms of the federation units (RN, SP, RJ, etc.). This will be important when we have to merge the data.

mobility %>%
filter(country == 'Brazil') %>%
select(-c('updated_at')) %>%
# Updating state names
mutate(region = case_when(region == 'State of Acre' ~ 'AC',
region == 'State of Amapá' ~'AP',
region == 'State of Espírito Santo' ~ 'ES',
region == 'State of Maranhão' ~ 'MA',
region == 'State of Mato Grosso do Sul' ~ 'MS',
region == 'State of Paraná' ~ 'PR',
region == 'State of Pará' ~ 'PA',
region == 'State of Piauí' ~ 'PI',
region == 'State of Rio Grande do Sul' ~ 'RS',
region == 'State of Rondônia' ~ 'RO',
region == 'State of Santa Catarina' ~ 'SC',
region == 'State of São Paulo' ~ 'SP',
region == 'Federal District' ~ 'DF',
region == 'State of Alagoas' ~ 'AL',
region == 'State of Amazonas' ~ 'AM',
region == 'State of Ceará' ~ 'CE',
region == 'State of Goiás' ~ 'GO',
region == 'State of Mato Grosso' ~ 'MT',
region == 'State of Minas Gerais' ~ 'MG',
region == 'State of Paraíba' ~ 'PB',
region == 'State of Pernambuco' ~ 'PE',
region == 'State of Rio Grande do Norte' ~ 'RN',
region == 'State of Rio de Janeiro' ~ 'RJ',
region == 'State of Roraima' ~ 'RR',
region == 'State of Sergipe' ~ 'SE',
region == 'State of Bahia' ~ 'BA',
region == 'State of Tocantins' ~ 'TO')) -> mobility
view raw 5.R hosted with ❤ by GitHub

IBGE dataset

The IBGE dataset also needs some pre-processing. We will also change the name of the states, ignore lines that are not related to the information we want and columns as well. This dataset has the following variables: UF, Código, Gentílico, Governador [2019], Capital [2010], Área Territorial – km² [2018], População estimada – pessoas [2019], Densidade demográfica – hab/km² [2010], Matrículas no ensino fundamental – matrículas [2018], IDH [2010], Receitas realizadas… and several other columns that do not interest me for this analysis. I kept only population, population density, and state name.

ibge %>%
select(`UF [-]`, `Popula&ccedil;&atilde;o estimada – pessoas [2019]`,
`Densidade demogr&aacute;fica – hab/km&sup2; [2010]`) %>%
mutate(`UF [-]` = case_when(`UF [-]` == 'Acre' ~ 'AC',
`UF [-]` == 'Alagoas' ~ 'AL',
`UF [-]` == 'Amap&aacute;' ~ 'AP',
`UF [-]` == 'Amazonas' ~ 'AM',
`UF [-]` == 'Bahia' ~ 'BA',
`UF [-]` == 'Cear&aacute;' ~ 'CE',
`UF [-]` == 'Distrito Federal' ~ 'DF',
`UF [-]` == 'Esp&iacute;rito Santo' ~ 'ES',
`UF [-]` == 'Goi&aacute;s' ~ 'GO',
`UF [-]` == 'Maranh&atilde;o' ~ 'MA',
`UF [-]` == 'Mato Grosso' ~ 'MT',
`UF [-]` == 'Mato Grosso do Sul' ~ 'MS',
`UF [-]` == 'Minas Gerais' ~ 'MG',
`UF [-]` == 'Par&aacute;' ~ 'PA',
`UF [-]` == 'Para&iacute;ba' ~ 'PB',
`UF [-]` == 'Paran&aacute;' ~ 'PR',
`UF [-]` == 'Pernambuco' ~ 'PE',
`UF [-]` == 'Piau&iacute;' ~ 'PI',
`UF [-]` == 'Rio de Janeiro' ~ 'RJ',
`UF [-]` == 'Rio Grande do Norte' ~ 'RN',
`UF [-]` == 'Rio Grande do Sul' ~ 'RS',
`UF [-]` == 'Rond&ocirc;nia' ~ 'RO',
`UF [-]` == 'Roraima' ~ 'RR',
`UF [-]` == 'Santa Catarina' ~ 'SC',
`UF [-]` == 'S&atilde;o Paulo' ~ 'SP',
`UF [-]` == 'Sergipe' ~ 'SE',
`UF [-]` == 'Tocantins' ~ 'TO')) -> ibge
colnames(ibge) <- c('estado', 'pop', 'dens_demo')
# Ignorar a última linha que não tem informação relacionada aos estados
ibge <- slice(ibge, 1:27)
view raw 6.R hosted with ❤ by GitHub

I merged the three datasets in just one, with all the information I need, corrected the data type in each column and deleted the variables that will no longer be used.

df <- merge(x = casos_obitos, y = mobility, by.x = c('estado'), by.y = c('region'))
df <- merge(x = df, y = ibge, by.x = c('estado'))
rm(casos_obitos, mobility, ibge)
# Setting the correct column types
df[, c(2,3,5:12)] <- sapply(df[, c(2,3,5:12)], as.numeric)
view raw 7.R hosted with ❤ by GitHub

Data visualization

Based on these data, something I would like to analyze is the average daily rate of growth in the number of cases and deaths in relation to the changes in displacements that individuals in that state experienced. For example, the code below will plot the graph below that you see right after it. The size of the points is related to the population of the states and the color to the demographic density of that state.

ggplot(df, aes(x=growth_rate_casos, y=retail_and_recreation,size=pop,color=dens_demo)) +
geom_point(shape=19) +
geom_smooth() +
scale_color_gradient(name = "Densidade\ndemográfica", low="blue", high="red",
labels = c('100 hab/km²',
'200 hab/km²',
'300 hab/km²',
'400 hab/km²',
'500 hab/km²')) +
scale_size_continuous(name = "População", labels = c('1 milhão hab',
'10 milhões hab',
'20 milhões hab',
'30 milhões hab',
'40 milhões hab',
'50 milhões hab')) +
xlab('Taxa de crescimento médio de número de casos por dia desde o primeiro caso confirmado') +
ylab('Variação de mobilidade para fins de\n lazer e varejo') +
labs(caption = "Ribeiro-Dantas, MC. April, 2020. mribeirodantas.xyz/blog") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(labels = function(x) paste0(sprintf("%0.0f", x), '%')) +
ggtitle("Taxa média diária de casos novos e deslocamento") +
ggrepel::geom_label_repel(aes(label = df$estado), size=3)
view raw 8a.R hosted with ❤ by GitHub

The greatest variation occurred in this type of mobility, for retail and leisure locations, which is cool from the point of view of the lockdown (leisure), but controversial from the economic point of view (retail), even though e-commerce has been pretty active in some segments in some countries. The state with the lowest reduction was Pará and even still it had a ~60% reduction, which is a substantial reduction.

The code below will plot the graph right after it, this time with respect to the average daily rate of the number of new deaths.

ggplot(df, aes(x=growth_rate_obitos, y=workplaces,size=pop,color=dens_demo)) +
geom_point(shape=19) +
geom_smooth() +
scale_color_gradient(name = "Densidade\ndemográfica", low="blue", high="red",
labels = c('100 hab/km²',
'200 hab/km²',
'300 hab/km²',
'400 hab/km²',
'500 hab/km²')) +
scale_size_continuous(name = "População", labels = c('1 milhão hab',
'10 milhões hab',
'20 milhões hab',
'30 milhões hab',
'40 milhões hab',
'50 milhões hab')) +
xlab('Taxa de crescimento médio de número de óbitos por dia desde o primeiro caso confirmado') +
ylab('Variação de mobilidade para ambiente\n de trabalho') +
labs(caption = "Ribeiro-Dantas, MC. April, 2020. mribeirodantas.xyz/blog") +
ggtitle("Taxa média de óbitos novos e deslocamento") +
ggrepel::geom_label_repel(aes(label = df$estado), size=3)
view raw 9a.R hosted with ❤ by GitHub

I thought about not publishing the image above, for fear that someone would use it for disinformation purposes. However, I preferred to leave it and explain here why it gives the wrong information and why I decided not to plot the other images based on the average daily rate of new deaths. Espírito Santo had the first death of COVID19 on April 2nd, on April 3rd it had three more deaths (300% growth) and another one on April 4th (25% growth). On the average of these two growth rates, we have about ~162%, a total of 5 deaths, which throws Espírito Santo to the right corner of the chart above, quite different from the other states. Does he deserve this? Ceará had 3 deaths on March 26, and another death on March 28 (33% increase), a total of 4 deaths. Average? 16.6%.

By changing the name of the variable in the code for the first chart, you can get the other charts. They follow below.

This is the only graph where we had an increase in mobility, instead of a decrease, but it makes sense. People are staying at home or walking nearby to practice physical exercises, such as jogging.

As an experiment, I decided to use the total number of confirmed cases instead of the average daily rate of new cases. Code and plots are below.

brasil <- read_csv2(file = 'brasil n_casos.csv')
brasil %>%
group_by(estado) %>%
summarise(total = sum(casosNovos)) -> brasil
df <- merge(x = df, y = brasil, by = c('estado'))
library(ggplot2)
# Scatter plots
# retail_and_recreation
ggplot(df, aes(x=growth_rate_obitos, y=retail_and_recreation,size=pop,color=dens_demo)) +
geom_point(shape=19) +
geom_smooth() +
scale_color_gradient(name = "Densidade\ndemográfica", low="blue", high="red",
labels = c('100 hab/km²',
'200 hab/km²',
'300 hab/km²',
'400 hab/km²',
'500 hab/km²')) +
scale_size_continuous(name = "População", labels = c('1 milhão hab',
'10 milhões hab',
'20 milhões hab',
'30 milhões hab',
'40 milhões hab',
'50 milhões hab')) +
xlab('Número total de casos confirmados em 5 de Abril') +
ylab('Variação de mobilidade para fins de\n lazer e varejo') +
labs(caption = "Ribeiro-Dantas, MC. April, 2020. mribeirodantas.xyz/blog") +
ggtitle("Total de casos confirmados e deslocamento") +
ggrepel::geom_label_repel(aes(label = df$estado), size=3)
view raw 11.R hosted with ❤ by GitHub

As you can see, there is a very large disparity on the x-axis with São Paulo displaced when compared to the other states. For this reason, I decided to apply a logarithmic transformation (log10) on the x-axis, as you can see in the code below and graphs below.

ggplot(df, aes(x=total, y=retail_and_recreation,size=pop,color=dens_demo)) +
geom_point(shape=19) +
geom_smooth() +
scale_color_gradient(name = "Densidade\ndemográfica", low="blue", high="red",
labels = c('100 hab/km²',
'200 hab/km²',
'300 hab/km²',
'400 hab/km²',
'500 hab/km²')) +
scale_size_continuous(name = "População", labels = c('1 milhão hab',
'10 milhões hab',
'20 milhões hab',
'30 milhões hab',
'40 milhões hab',
'50 milhões hab')) +
xlab('Número total de casos confirmados em 5 de Abril') +
ylab('Variação de mobilidade para fins de\n lazer e varejo') +
labs(caption = "Ribeiro-Dantas, MC. April, 2020. mribeirodantas.xyz/blog") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(labels = function(x) paste0(sprintf("%0.0f", x), '%')) +
scale_x_continuous(trans='log10') +
ggtitle("Total de casos confirmados e deslocamento") +
ggrepel::geom_label_repel(aes(label = df$estado), size=3)
view raw 12.R hosted with ❤ by GitHub

I could go on plotting the data for the 6 types of displacement with the lethality rate, among other metrics, but I believe it is better to move towards the ending of this post.

Disclaimer

There is an uncontrollable urge within all of us to be useful, especially at a time like this, of a pandemic. We feel bold, eager to jump at the first opportunity we can to draw conclusions from the information we are given. While good intent is commendable, we need to have a lot of responsibility. Some people may be tempted to assess the quarantine intervention based on this data. This is a terrible idea. Google data compares two moments in time, before the intervention (a point in time that in the opinion of some can be seen as an arbitrary choice) and March 29, 2020. Even if we had this information throughout the period, per day, we would still be establishing only a correlation. And, as we all know, correlation does not imply causation.

Likewise, some of my decisions may seem arbitrary to others. Let’s look at the metric I used, for example, the average daily rate of growth in the number of cases or deaths. Let’s say that there is a state A that over the first 10 days after the first confirmed case had its cases cumulatively distributed as follows: 1, 1, 1, 1, 1, 1, 1, 1, 8, 10, with a total of 10 confirmed patients. That is, 1 patient confirmed on the first day and suddenly seven more patients confirmed on the ninth day and two more on the tenth day. Another state B had over the first 10 days after the first confirmed case the cases distributed as follows: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Also 10 patients. Shall we calculate the fee?

> A <- c(1,1,1,1,1,1,1,1,8,10)
> B <- c(1,2,3,4,5,6,7,8,9,10)
> growth_rate(A)
[1] NA 0 0 0 0 0 0 0 700 25
> sum(growth_rate(A), na.rm=TRUE)/length(A)
[1] 72.5
>
> growth_rate(B)
[1] NA 100.00000 50.00000 33.33333 25.00000 20.00000 16.66667 14.28571 12.50000
[10] 11.11111
> sum(growth_rate(B), na.rm=TRUE)/length(B)
[1] 28.28968
view raw 10.R hosted with ❤ by GitHub

As you can see, state A had an average daily rate of new cases more than twice that of state B. If this data accurately reflects reality, I would be a little more relaxed, but unfortunately, we cannot be sure. Perhaps, state A was state B in a parallel universe where the laboratory that performs the tests received many samples and ended up releasing the result from 7 patients at once on the ninth day. Although I hope that the states update their data to confirm the case on the day the patient had the sample collected, there is no guarantee that this is being done in this way, and by all states the same way.

In an analysis by Open Knowledge Brasil, which evaluates the content, format, and details of what is published on state and federal government portals, the authors concluded that transparency in relation to COVID-19 is insufficient in 90% of Brazilian states. I confess that I am actually impressed with some actions by the federal government, state governments, and even the private sector, but to do responsible and professional analysis, we need more.

I am publishing this post in order to disseminate these datasets, teach a little R and data analysis and encourage others who may have more / better data and can perform a more professional analysis. As a homework assignment, for those who are after one: razz: it would be interesting to go after the publications in the official journals of when each state instituted the lockdown (banning retail, for example) and make an analysis of the variations based on this: wink:

I will soon organize the codes and make them available on a GitHub repository.

PS: Rafael Gomes has written a code (and a blog post making use of it) to extract not only the value on March 29, but all over the period shown in plots of the report. I’m sharing it as a nudge for people who are looking for something more advanced to do 🙂