I am working at UC Berkeley’s D-Lab as a Data Science Fellow. One of my responsibilities is to provide consulting to the UC Berkeley community on statistical and data science projects. A common request of late due to points at everything is to help with web scraping for projects.

Recently, a request came in to scrape a page and download the pdf files that were linked. Fortunately, the page was simple from an HTML perspective, and I could apply a few common patterns to pull the downloads. Over the break, I read about a few productivity systems, all of which suggested writing notes to your future self. In that spirit, here’s the current way I solve this kind of problem as an example script.

I make use of the purrr for clean, functional programming, rvest for scraping, and stringr because I suck at regular expressions.

library(purrr)
library(rvest)
library(stringr)

article <- str_extract(link, pattern = "article=[:digit:]+")
out <- paste0(output_dir,"/",article , ".pdf")
})

# Output folder name

# Sometimes it's helpful to make a specific directory on the fly
# This code will only create the directory if it does not
# currently exist
if(!dir.exists(output_dir)){
dir.create(output_dir)
}

# Website of interest in the case I was working on
# All of the documents are stored at links with "viewcontent.cgi"
# this provides a way to get just the pdfs instead of
url <- "https://scholarworks.utep.edu/border_region/"

# Now we just chain away to bulk download all the pdfs
# that exist on the page.

url %>%
html_nodes("a")%>%
html_attr("href")%>%
str_subset("viewcontent.cgi")%>%