Parsing textual data from urls using R

I have developed several unique approaches to data gathering while working on the Local Elections in America Project (LEAP), and I could not resist sharing a function that I recently created in R to gather elections data in character-separated value text format. The function can take either a single url or a list of urls, and can be customized to handle different separation characters and column headings easily.

The following code can be altered to handle diverse data in character-separated value format, not just election data. While this function is intended to collect text data located at a web address, it might be altered to collect from html code with the addition of regular expressions removing common html tags.

[code language=”r”]
## Created by Nicholas Davis on 2013-08-07.
##
## set up the data frame
dat <- matrix(, 0, 5)
colnames(dat) <- c("precinct", "office", "cand.name", "vote.pct", "election.date")

## the urls targeted contain first 4 columns, no date
## function that takes a datafame, url or list of urls, and an election date
parse.url <- function(data, url, election.date){
d <- data.frame(matrix(,0,5))
colnames(d) <- colnames(data)
for(i in 1:length(url)){
tmp <- NULL
tmp <- readLines(url[[i]])
tmp <- gsub("’", "", tmp) ## remove troublesome characters from names
tmp <- gsub("\#", "", tmp) ## remove comment character from lines
write(tmp, "tmp.txt")
d <- rbind(d, read.table("tmp.txt",
sep=";", ## this can be changed to "," or "\t"
col.names=colnames(d),
fill=FALSE,
strip.white=TRUE))
}
d <- cbind(d, election.date) ## append election date manually
rbind(data, d)
}

## call the function and set dataframe to equal the output
dat <- parse.url(dat, c("url1", "url2" . . . "urlx"), "11-06-2012")
[/code]

Leave a Reply

Your email address will not be published. Required fields are marked *