One way to evaluate your model is in terms of error types. Let’s consider a scenario where you live in a city where it rains every once in a while. If you guessed that it would rain this morning, but it did not, your guess was a false positive, sometimes abbreviated as FP. If you said it would not rain, but it did, then you had a false negative (FN). Raining when you do not have an umbrella may be annoying, but life is not always that bad. You could have predicted that it would rain and it did (true positive, TP) or predicted that it would not rain and it did not (true negative, TN). In this example, it’s easy to see that in some contexts one error may be worse than the other and this will vary according to the problem. Bringing an umbrella with you in a day with no rain is not as bad as not bringing an umbrella on a rainy day, right?Continue…
Best links of the week from 20th May to 26th May
- UN (United Nations) data.
- A curated list of 200+ blogs related to Data Science at CybrHome.
- 25 Excellent Machine Learning Open Datasets.
- Group Chats Are Making the Internet Fun Again at Intelligencer.
- Do anything with dplyr.
- Starting out with R at Credibly Curious.
Como o atual presidente do Brasil se compara em termos de número de decretos com seus predecessores?Continue…
Best links of the week from 15th April to 21st April
- When it comes to clustering, depending on the algorithm used, one may have a hard time determining the appropriate k (number of clusters). Some algorithms do not require it, but for the ones that do, such as k-means, you should have a look at the elbow method to evaluate the appropriate k or at the silhouette of objects regarding the clusters.
- Dunder Data is a professional training company dedicated to teaching data science and machine learning. There is paid and free online material.
- Software Carpentry, teaching basic lab skills for research computing.
- ROpenSci, transforming science through open data and software.
- mlmaisleve, conceitos rápidos e leves sobre Machine Learning ?.
- kite, Code Faster in Python with Line-of-Code Completions.
The silent [and maybe mortal?] trap in bracket subsetting.
It should be clear to you that, as several other programming languages, R provides different ways to tackle the same problem. One common problem in data analysis is to subset your data frame and, as Google can show you, there are several blog posts and articles trying to teach you different ways to subset your data frame in R. Let’s do a quick review here:
Before starting to subset a data frame, we must first create one. I will create a data frame of patients named var_example with two columns, one for vital status (is_alive) and one for birth year (birthyear). Birth year values are 4-digit numbers representing the year of birth. The is_alive column can have one of three values:
- TRUE: The person is alive;
- FALSE: The person is dead;
- NA: We do not know if this person is either alive or dead.
> var_example <- cbind(as.data.frame(sample(c(NA, TRUE, FALSE), size=100, replace=TRUE, prob = c(0.1, 0.5, 0.4))), as.data.frame(sample(c(1980:1995), size=100, replace=TRUE))) > colnames(var_example) <- c("is_alive", "birthyear")Continue…