Previous posts have discussed how computers, IT, and open
source software have made incredible strides in recent years and how those advances are helpful for infectious disease epidemiology. On
that theme, I recently found something very cool, which is illustrated
here: This entire post was written in the RStudio IDE using the markdown and knitr packages in R.
Markdown is a text-to-HTML conversion tool that allows conversion of
prose into structurally valid XHTML or HTML. Details can be found at http://rmarkdown.rstudio.com and a short tutorial is available at https://support.rstudio.com/hc/en-us/articles/200552086-Using-R-Markdown. knitr
is an engine for dynamic report generation with R. It is possible to
produce sophisticated documents that include bibliographic referencing
and mathematical typesetting using markdown and knitr. Moreover, one can generate documents with embedded chunks of R code in order to display the code itself and/or the output.
As a simple example, below is a word cloud made from searching the
Twitter API for tweets containing the terms “vaccine” or “vaccinated” in
a 24 hour period. The code is shown first (it could be written better,
clearly), followed by the output:
rm(list=ls())
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(RJSONIO)
path_var <- "~/Dropbox/Docs/Blogs/"
auth_var <- "my_oauth.Rdata"
load(paste(path_var, auth_var, sep=""))
registerTwitterOAuth(my_oauth)
the_search_term <- "-RT vaccine OR vaccinated"
today <- as.character(Sys.Date())
yesterday <- as.character(as.Date(today) - 1)
max_tweets <- 300
the_filename <- paste(path_var, "twitter_", today, ".csv", sep="")
the_search <- searchTwitter(searchString=the_search_term,
n=max_tweets,
since = yesterday,
until = today)
tweets_df <- twListToDF(the_search)
write.csv(tweets_df, file=the_filename)
search_text = sapply(the_search, function(x) x$getText())
search_corpus = Corpus(VectorSource(search_text))
tdm <- TermDocumentMatrix(
search_corpus,
control=list(
removePunctuation=TRUE,
stopwords=c("amp", "vaccine", "vaccinated",
stopwords("english")),
removeNumbers=TRUE,
tolower=TRUE)
)
m <- as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing=TRUE)
dm <- data.frame(word=names(word_freqs), freq=word_freqs)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
fh <- paste("Twitter search API, ",
length(tweets_df[,1]), " tweets returned, ",
today, sep="")
text(x=0.5, y=0.5, fh)
wordcloud(dm$word, dm$freq, min.freq=5,
random.order = FALSE,
random.color=FALSE,
max.words=Inf, colors = brewer.pal(8, "Dark2"))
The user can control whether the R code appears in the document or not.
This blog was “knitted” into HTML with a single click, and the
resulting HTML code was pasted into the blogspot composition tool. A very small amount of editing of this code was needed to make it work on the blogspot.com platform (though the R code is supposed to be depicted in a gray box, so there is at least one thing to sort out as a purest). Voila.
Were it necessary to change the code for some reason, say to add a
new API search term or update the word cloud for the same search in a
week – or even to fix a bug – then the resulting HTML document can be
regenerated and the update published, seamlessly. Thus, R has become a
method for producing dynamic documents. Imagine the power of working
collaboratively with others through an R Markdown document saved in a
shared folder (e.g., in Dropbox).
Do I plan on preparing future blogs in R? No -- it’s not a document preparation environment per se,
and the other tools at my disposal are more than sufficient. However,
if I were writing a blog that updated figures daily, weekly, or monthly
based on changing data, or wanted to share fragments of code, say, I
probably would use it.
Getting back to epidemiology, the word cloud is interesting. It shows that people are tweeting about the high levels of flu activity (see, e.g., http://www.cdc.gov/flu/)
at present, as well as other topics including Ebola, cancer, and fraud.
By reviewing the text of the tweets
themselves (which is possible by inspecting the object “tweets_df” in R), a
better sense of the diversity of conversation surrounding vaccines and
vaccination can be had.