Using pdftools to Extract Vote Data

Last updated on Jun 2, 2019 4 min read R

Introduction

Suppose you are a concerned citizen who would like to know how voters in a state voted. Perhaps you are a voter in a state with rampant corruption or perhaps you are a voter in a state that does not have paper backups for voting machines. Perhaps you are just masochistic enough to be interested in pulling tables out of reasonably well formed pdfs. The following is a code example for the last one.

Fortunately (or unfortunately) for us, the state of New Jersey still provides their official election results in pdf files instead of a common data format like csv or even an Excel file. Copying each item by hand risks user error through fat fingers, sheer tedium, or great displeasure at a state in the 21st Century still outputting results to pdf. Therefore, it is much preferable to search for a code solution.

Each county in New Jersey reports Senate results separately. Here’s Atlantic County’s precinct results for the 2018 Senate Race. The table that we are going to extract looks like this.

Here’s one way to pull out the data, using some knowledge about who ran. I make use of tidyverse functions and pdftools. Each step of the function is commented.

library(tidyverse)
library(pdftools)

Once those are loaded, we can write our function.

# Remove columns that are all NAs
not_all_na <- function(x){
    !all(is.na(x))
}

senate_clean_table <- function(tbl){
    # Remove commas and split on new lines 
    tables <- NULL
    tbl <- str_replace_all(tbl,pattern = ",", "")%>%
        # We can see the table ends with the string NJDOE
        str_replace_all("Total[:print:]+", "NJDOE")%>%
        # Split on new lines 
        str_split(pattern = "\n", simplify = TRUE)
        
    # Some pdfs may be more than one page, so loop over all pages
    # Atlantic County will only run this loop once
    for(i in 1:dim(tbl)[1]){
    
        # Find the county name and save name to object 
        county_cell <- stringr::str_which(tbl[i,], "County")
        county_name <- tbl[i,county_cell]%>%
        stringr::str_squish()

        # Find the senate candidates cell 
        # Pull out candidates, turn them into a vector 
        # Keep only the last names 
        candidates_cell <- stringr::str_which(tbl[i,], "Robert")
        candidates <- tbl[i, candidates_cell] %>% 
            trimws()%>%
            str_squish()%>%
            str_replace_all(pattern = " R. ", " ")%>%
            str_replace_all(pattern = " Lynn ", " ")%>%
            str_split(pattern = "\\s")
        candidates <- unlist(candidates)
        
        # Because each candidate only has two names, after we remove the 
        # initials R recycles and keeps every other cell, which is last names 
        candidates <- candidates[c(FALSE, TRUE)]
        
        # Find the party cell. The actual table starts after this one
        party_cell <- stringr::str_which(tbl[i,], "Democratic")
       
        # Start the data frame 
        table_start <- party_cell + 1
        
        # Find the line with totals. The last line of interest is 
        # directly before it 
        table_end <- stringr::str_which(tbl[i,], "NJDOE")[1] -1
    
        # Subset to the table of interest 
        table <- tbl[i, table_start:table_end]
        
        # Create a delimiter everywhere there are 2 spaces 
        table <- str_replace_all(table, "\\s{2,}", "|")
        
        # Now we can pull out the data
        
        # Make a text connection and read that in as a dataframe
        text_con <- textConnection(table)
        
        df <- read.csv(text_con, sep = "|", header = F, stringsAsFactors = F)%>%
            dplyr::select_if(not_all_na)
        # Put in the appropriate column names and then add US senate as office 
        colnames(df)<- c("precinct", candidates)
        df <- df %>% 
            mutate(office = "US Senate")%>%
            mutate(county = county_name)%>%
            select(county, precinct, office, everything())
        tables[[i]] <- df
    }
    if(length(tables)==1){
        out <- tables[[1]]
    }else{
       out <- dplyr::bind_rows(tables) 
    }
    out
}

Supposing that we have stored all the 2018 Senate pdf urls in a vector called senate_urls, we can then make use of purrr::map and purrr::map_dfr() functions to run each through pdf_text and then our function, followed by tidyr::gather() to get our data into a long format.

# pdf_text() and map to the rescue

urls <- c("https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-atlantic.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-bergen.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-burlington.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-camden.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-capemay.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-cumberland.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-essex.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-gloucester.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-hudson.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-hunterdon.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-mercer.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-middlesex.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-monmouth.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-morris.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-ocean.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-passaic.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-salem.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-somerset.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-sussex.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-union.pdf",
                 "https://www.state.nj.us/state/elections/assets/pdf/election-results/2018/2018-general-election-results-us-senate-warren.pdf")

senate_urls <- map(urls, pdf_text)

nj_senate <- senate_urls %>% 
    map_dfr(senate_clean_table)%>%
    gather(candidate, votes, -county, -precinct, -office)

We can now take at our now useful voting data.

head(nj_senate)

##            county         precinct    office candidate votes
## 1 Atlantic County     Absecon City US Senate  Menendez  1551
## 2 Atlantic County    Atlantic City US Senate  Menendez  6039
## 3 Atlantic County  Brigantine City US Senate  Menendez  1391
## 4 Atlantic County       Buena Boro US Senate  Menendez   583
## 5 Atlantic County Buena Vista Twp. US Senate  Menendez  1095
## 6 Atlantic County      Corbin City US Senate  Menendez    77

Using pdftools to Extract Vote Data

Introduction

Alex Stephenson

PhD Student