Best links of the week #12

Reading Time: < 1 minute

Best links of the week from 25th March to 31th March.

Links

  1. Harvard Dataverse is a repository of data currently hosting over 82 thousand datasets.
  2. The origins of the job title “data scientist” at Quartz at work.
  3. The Data Incubator offers [paid] courses and bootcamps in Data Analysis.
  4. Teenagers are better behaved and less hedonistic nowadays at The Economist.
  5. Why You Procrastinate (It Has Nothing to Do With Self-Control) at The New York Times.
  6. RATP, Régie Autonome des Transports Parisiens (English: Autonomous Operator of Parisian Transports) is data friendly!
  7. Data Science Meetups: A list of Data Science Meetups from around the world!
  8. A list of R conferences, groups and meetings at Jumping Rivers GitHub page.
Continue…

Best links of the week #9

Reading Time: < 1 minute

Best links of the week from 4th March to 10th March.

Links

  1. What are some of your favorite, but less well-known, packages for R? [1] [2] at Statistics and Data Science sub Reddits.
  2. Why is it wrong to stop an A/B test before optimal sample size is reached? at Cross Validated (Stack Exchange).
  3. How do I calculate statistical power? at Effect Size FAQs.
  4. Personal website generator.
  5. From hard drive to over-heard drive: Boffins convert spinning rust into eavesdropping mic at The Register.
  6. List of Machine Learning / Deep Learning conferences in 2019 at Tryo Labs.
  7. We Use Less Information to Make Decisions Than We Think at Harvard Business Review.
  8. Apple CEO Tim Cook explains why you don’t need a college degree to be successful at Business Insider.
  9. Jordan Peterson’s 10-step process for stronger writing at Big Think.
  10. R package primer at Karl Broman‘s website.
  11. Researchers Can Now Cheaply Turn Atmospheric CO2 Back Into Coal at IFLScience.
  12. Plano de estudos em machine learning com conteúdos em português at Italo José’s GitHub.
  13. O Brasil em dados libertos.
  14. Reconhecimento facial ajuda a prender criminoso no Carnaval de Salvador at Canal Tech.
  15. Conhecer o próprio genoma envolve surpresas e decepções at Folha de São Paulo.
  16. Qual a lógica do detector de mentiras? at Revista Questão de Ciência.
  17. Pesquisas que parecem medicina, mas não são at Revista Questão de Ciência.
  18. A distribuição de pessoas com doutorado pelo Brasil at Nexo Jornal.
  19. Programadores tornarão o caminho mais fácil para invasores dizem pesquisadores at Mundo Hacker.
Continue…

Best links of the week #7

Best links of the week #5

Reading Time: < 1 minute

Best links of the week from 4th February to 10th February.

Links

  1. Como controlar o braço de outra pessoa com o poder da sua mente? at UOL.
  2. vidente is an R package I am currently writing to parse and analyze data from the Surveillance, Epidemiology and End Results (SEER) Program, which covers over 1/3 of the US population on cancer incidence and survival.
  3. Ciência de Dados com R is a book on Data Science using R at Instituto Brasileiro de Pesquisa e Análise de Dados.
  4. Data Science & Machine Learning Course at Ivanovitch Silva’s GitHub repository.
  5. A receita dos candidatos a deputado federal em 2018 at Nexo Jornal.
  6. AI 100: The Artificial Intelligence Startups Redefining Industries at CB Insights.
  7. The open-source and crowd sourced conference website.
  8. Ranking of IT conferences.
Continue…

The unintended trap in bracket subsetting in R

Reading Time: 3 minutes
The silent [and maybe mortal?] trap in bracket subsetting.

Dear reader,

It should be clear to you that, as several other programming languages, R provides different ways to tackle the same problem. One common problem in data analysis is to subset your data frame and, as Google can show you, there are several blog posts and articles trying to teach you different ways to subset your data frame in R. Let’s do a quick review here:

Before starting to subset a data frame, we must first create one. I will create a data frame of patients named var_example with two columns, one for vital status (is_alive) and one for birth year (birthyear). Birth year values are 4-digit numbers representing the year of birth. The is_alive column can have one of three values:

  • TRUE: The person is alive;
  • FALSE: The person is dead;
  • NA: We do not know if this person is either alive or dead.
> var_example <- cbind(as.data.frame(sample(c(NA, TRUE, FALSE),
                                          size=100,
                                          replace=TRUE,
                                          prob = c(0.1, 0.5, 0.4))),
                     as.data.frame(sample(c(1980:1995),
                                          size=100,
                                          replace=TRUE)))
> colnames(var_example) <- c("is_alive", "birthyear")
Continue…