The unintended trap in bracket subsetting in R

Reading Time: 3 minutes
The silent [and maybe mortal?] trap in bracket subsetting.

Dear reader,

It should be clear to you that, as several other programming languages, R provides different ways to tackle the same problem. One common problem in data analysis is to subset your data frame and, as Google can show you, there are several blog posts and articles trying to teach you different ways to subset your data frame in R. Let’s do a quick review here:

Before starting to subset a data frame, we must first create one. I will create a data frame of patients named var_example with two columns, one for vital status (is_alive) and one for birth year (birthyear). Birth year values are 4-digit numbers representing the year of birth. The is_alive column can have one of three values:

  • TRUE: The person is alive;
  • FALSE: The person is dead;
  • NA: We do not know if this person is either alive or dead.
> var_example <- cbind(as.data.frame(sample(c(NA, TRUE, FALSE),
                                          size=100,
                                          replace=TRUE,
                                          prob = c(0.1, 0.5, 0.4))),
                     as.data.frame(sample(c(1980:1995),
                                          size=100,
                                          replace=TRUE)))
> colnames(var_example) <- c("is_alive", "birthyear")

(1) Subset function

The subset function is probably the first function a non-experienced R developer would guess. If I’m looking for the number of alive patients and I want to use the subset function, I would type:

> nrow(subset(var_example, is_alive == TRUE))
[1] 48

(2) Which function

The which function is also very famous. If I’m looking for the number of alive patients and I want to use the which function, I would type:

> length(which(var_example$is_alive == TRUE))
[1] 48

As you can see, so far we’ve obtaining the same number of rows, which is expected for we’re only using different methods to achieve the same result: Obtaining the number of patients that are alive. Instead of looking for the ones alive, we could also look for the ones that are neither dead or unknown and then subset the opposite. The code follows below:

> nrow(var_example[-c(which(var_example$is_alive == FALSE | var_example$is_alive == NA)), ])
[1] 56

Wait! Something weird happened here. It may surprise you if you’re not used to deal with NA values in R. The code below may make it clear to you why the code above returned us an unexpected result:

> TRUE == NA
[1] NA
> FALSE == NA
[1] NA
> NA == NA
[1] NA

As you can see, NA does not work according to what we would expect by looking at it as a logic value. In order to check if something is NA, one must use the is.na function.

> is.na(TRUE)
[1] FALSE
> is.na(FALSE)
[1] FALSE
> is.na(NA)
TRUE

Therefore, the right which function usage should be:

> nrow(var_example[-c(which(var_example$is_alive == FALSE | is.na(var_example$is_alive))), ])
[1] 48

(3) dplyr filter

Even though it is not part of the R base, the dplyr library is very famous among data scientists.

> library(dplyr)<br>
> nrow(filter(var_example, is_alive == TRUE))
[1] 48

(4) Plain brackets

Or we can use plain brackets, without calling any function.

> nrow(var_example[var_example$is_alive == TRUE,])
[1] 56

Wait! We should not have received a different value, should we? We’re not even playing with NA values, like the example a bit above. What is going on in here?

Even though some people may assume subset and subsetting by brackets do the same thing, they actually do not. The subset function is actually the equivalent to:

> var_example[!is.na(is_alive) & is_alive, ]

That is, it treats the NA values as FALSE, therefore returning the values we expect, that is, only TRUE values. If we did not have NA values, things would work out closer to what we expect but, unfortunately, NA values are a daily thing in the life of data scientists. The plain brackets, therefore, do not treat NA values as FALSE and will return them. We can verify the NA values are being summed up with the code below:

> sum(is.na(var_example$is_alive))
[1] 8
> sum(is.na(var_example$is_alive)) + nrow(subset(var_example, is_alive == TRUE))
[1] 56

Not being aware of it can be a burden, since it is a silent mistake. Sometimes, you may only notice something went wrong after you’re already analyzing your final results. It may seem silly to fall for this if you have a data frame with 100 rows and 10 columns, for it is easy to know your data by heart (or simply looking at it with View()). But what if you have 300.000 rows with 200 columns? It gets more complicated.

Randomly subsetting

As a gift for the ones that remained until the end of this post, this section is a gift :-). Sometimes, you do not want to subset your data frame according to some value in specific. Instead, you want to sample it for some reason. Take N rows out of N+X. Let’s say we would like to take 50 rows from our 100 rows, without repeating rows:

> new_variable <- var_example[sample(1:nrow(var_example), 50,
   replace=FALSE),]

And that’s it folks. I hope you enjoyed today’s lesson 🙂