Friday, January 2, 2015

Blogging in R

Previous posts have discussed how computers, IT, and open source software have made incredible strides in recent years and how those advances are helpful for infectious disease epidemiology. On that theme, I recently found something very cool, which is illustrated here: This entire post was written in the RStudio IDE using the markdown and knitr packages in R.

Markdown is a text-to-HTML conversion tool that allows conversion of prose into structurally valid XHTML or HTML. Details can be found at http://rmarkdown.rstudio.com and a short tutorial is available at https://support.rstudio.com/hc/en-us/articles/200552086-Using-R-Markdown. knitr is an engine for dynamic report generation with R. It is possible to produce sophisticated documents that include bibliographic referencing and mathematical typesetting using markdown and knitr. Moreover, one can generate documents with embedded chunks of R code in order to display the code itself and/or the output.

As a simple example, below is a word cloud made from searching the Twitter API for tweets containing the terms “vaccine” or “vaccinated” in a 24 hour period. The code is shown first (it could be written better, clearly), followed by the output:

rm(list=ls())

library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(RJSONIO)

#====================================================

path_var <- "~/Dropbox/Docs/Blogs/"
auth_var <- "my_oauth.Rdata" 
load(paste(path_var, auth_var, sep=""))
registerTwitterOAuth(my_oauth)

the_search_term <- "-RT vaccine OR vaccinated"
today <- as.character(Sys.Date())
yesterday <- as.character(as.Date(today) - 1)

max_tweets <- 300 
the_filename <- paste(path_var, "twitter_", today, ".csv", sep="")

the_search <- searchTwitter(searchString=the_search_term, 
                            n=max_tweets, 
                            since = yesterday, 
                            until = today)
  
tweets_df <- twListToDF(the_search)
write.csv(tweets_df, file=the_filename)
search_text = sapply(the_search, function(x) x$getText())
search_corpus = Corpus(VectorSource(search_text))

tdm <- TermDocumentMatrix(
  search_corpus,
  control=list(
    removePunctuation=TRUE,
    stopwords=c("amp", "vaccine", "vaccinated", 
                  stopwords("english")), 
    removeNumbers=TRUE, 
    tolower=TRUE)
) # Note we do not display the search terms in the cloud 

m <- as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing=TRUE)  
dm <- data.frame(word=names(word_freqs), freq=word_freqs)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
fh <- paste("Twitter search API, ", 
            length(tweets_df[,1]), " tweets returned, ", 
            today, sep="")
text(x=0.5, y=0.5, fh)
wordcloud(dm$word, dm$freq, min.freq=5, 
          random.order = FALSE, 
          random.color=FALSE, 
          max.words=Inf, colors = brewer.pal(8, "Dark2"))

The user can control whether the R code appears in the document or not.

This blog was “knitted” into HTML with a single click, and the resulting HTML code was pasted into the blogspot composition tool. A very small amount of editing of this code was needed to make it work on the blogspot.com platform (though the R code is supposed to be depicted in a gray box, so there is at least one thing to sort out as a purest). Voila.

Were it necessary to change the code for some reason, say to add a new API search term or update the word cloud for the same search in a week – or even to fix a bug – then the resulting HTML document can be regenerated and the update published, seamlessly. Thus, R has become a method for producing dynamic documents. Imagine the power of working collaboratively with others through an R Markdown document saved in a shared folder (e.g., in Dropbox).

Do I plan on preparing future blogs in R? No -- it’s not a document preparation environment per se, and the other tools at my disposal are more than sufficient. However, if I were writing a blog that updated figures daily, weekly, or monthly based on changing data, or wanted to share fragments of code, say, I probably would use it.

Getting back to epidemiology, the word cloud is interesting. It shows that people are tweeting about the high levels of flu activity (see, e.g., http://www.cdc.gov/flu/) at present, as well as other topics including Ebola, cancer, and fraud. By reviewing the text of the tweets themselves (which is possible by inspecting the object “tweets_df” in R), a better sense of the diversity of conversation surrounding vaccines and vaccination can be had.

No comments:

Post a Comment