2016 Presidential Election in Maine

by Em Pierce

This report looks at Mainers in the 2016 election cycle. They made over 43,000 contributions to presidential candidates.

#setwd('C:/Users/emily/Desktop/Udacity/Maine_2016')

library(plyr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(alr3)
library(GGally)
library(reshape2)
library(car)
library(reshape)
library(RColorBrewer)
library(scales)
library(lattice)
library(data.table)
library(ggthemes)
library(maps)
library(maptools)
library(mapdata)
library(tibble)
library(viridis)
library(timeline)
library(stringr)
library(dplyr)
library(ggmap)

Load the Data

maine <- read.csv('donations_data.csv', header = T, sep = ',', row.names = NULL)
colnames(maine) <- c(colnames(maine)[-1], 'x')
maine$x <- NULL 

maine <- select(maine, cand_nm, contbr_nm, contbr_zip, contb_receipt_amt,
                contb_receipt_dt, contbr_employer, contbr_occupation)

colnames(maine) <- c('candidate', 'donor', 'zip', 'contribution', 'date',
                     'employer', 'occupation')

maine$zip <- strtrim(maine$zip, 4)

party <- read.csv('candidate-list.csv', header = T, sep = ',', row.names = NULL)
party <- select(party, cand_nm, cand_pty_aff)
colnames(party) <- c('candidate', 'party')
maine <- merge(maine, party)

maine$donor <- sub(',.*', '', maine$donor)
maine$candidate <- sub(',.*', '', maine$candidate)
maine$party <- sub(' Party.*', '', maine$party)
maine$date <- as.Date(maine$date, format = '%d-%b-%y')

zipcounty <- read.csv('zip county.csv', header = T, sep = ',', row.names = NULL)
zipcounty <- select(zipcounty, 'Zip', 'County')
colnames(zipcounty) <- c('zip', 'county')
maine <- merge(maine, zipcounty)

countydata <- read.csv('countydata.csv', header = T, sep = ',', 
                       row.names = NULL)
maine <- merge(maine, countydata)

maine <- maine[c('candidate', 'party', 'donor', 'contribution', 'date', 
                 'employer', 'occupation', 'zip', 'county', 'subregion', 
                 'fips', 'district', 'area', 'est', 'medhhinc', 'percapinc',
                 'population', 'ballots', 'trump', 'clinton')]


maine <- subset(maine, contribution > 0)

zip_codes_states <- read.csv('zip_codes_states.csv', header = T, 
                             sep = ',', row.names = NULL)
zip_codes_states <- subset(zip_codes_states, state == 'ME')
zip_codes_states <- select(zip_codes_states, zip_code, latitude, longitude)
colnames(zip_codes_states) <- c('zip', 'latitude', 'longitude')

maine <- merge(maine, zip_codes_states)

str(maine)
## 'data.frame':    43373 obs. of  22 variables:
##  $ zip         : chr  "3901" "3901" "3901" "3901" ...
##  $ candidate   : chr  "Sanders" "Clinton" "Sanders" "Clinton" ...
##  $ party       : chr  "Democratic" "Democratic" "Democratic" "Democratic" ...
##  $ donor       : chr  "GARLAND" "DAY" "COPP" "DAY" ...
##  $ contribution: num  500 25 50 25 50 25 25 25 25 25 ...
##  $ date        : Date, format: "2015-09-30" "2016-09-11" ...
##  $ employer    : Factor w/ 2307 levels "","'TIL DEATH TATTOO",..: 1431 1035 2009 1357 2009 1035 1035 1035 1035 1035 ...
##  $ occupation  : Factor w/ 1805 levels "","-","100% DISABLED VETERAN",..: 1088 827 1212 1088 1212 827 827 827 827 827 ...
##  $ county      : Factor w/ 16 levels "Androscoggin",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ subregion   : Factor w/ 16 levels "androscoggin",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ fips        : int  31 31 31 31 31 31 31 31 31 31 ...
##  $ district    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ area        : int  1271 1271 1271 1271 1271 1271 1271 1271 1271 1271 ...
##  $ est         : int  1652 1652 1652 1652 1652 1652 1652 1652 1652 1652 ...
##  $ medhhinc    : int  55008 55008 55008 55008 55008 55008 55008 55008 55008 55008 ...
##  $ percapinc   : int  27137 27137 27137 27137 27137 27137 27137 27137 27137 27137 ...
##  $ population  : int  197131 197131 197131 197131 197131 197131 197131 197131 197131 197131 ...
##  $ ballots     : int  118045 118045 118045 118045 118045 118045 118045 118045 118045 118045 ...
##  $ trump       : int  50403 50403 50403 50403 50403 50403 50403 50403 50403 50403 ...
##  $ clinton     : int  55844 55844 55844 55844 55844 55844 55844 55844 55844 55844 ...
##  $ latitude    : num  43.3 43.3 43.3 43.3 43.3 ...
##  $ longitude   : num  -70.7 -70.7 -70.7 -70.7 -70.7 ...

This data set has 43,373 observations and 20 variables. It started out different. I imported the CSV from the Federal Election Committee’s web site. I saw that it had a lot of information that was crucial, and some that was not. I kept info about the contribution: the candidate it was for, contribution amount, and the date of contribution. I also kept info about the donor: the last name, zip code, employer and occupation.

I then imported a database with the zip codes and counties of all the areas in Maine. I merged those together so that I could match each contribution with the corresponding county. Then, I added the party information for each candidate. I had to do some string manipulations to clean up some of the variables. Finally, I created a CSV with statistics about each county. This included the number of people from each county that voted for Clinton, Trump, and who voted at all. It includes the population and area of the district along with other data. I made sure all the contributions were greater than $0. Finally, I brought in another database with the latitude and longitude of each zip code. This should add some extra depth for my maps.

summary(maine)
##      zip             candidate            party          
##  Length:43373       Length:43373       Length:43373      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     donor            contribution          date           
##  Length:43373       Min.   :   0.16   Min.   :2014-09-18  
##  Class :character   1st Qu.:  15.00   1st Qu.:2016-03-06  
##  Mode  :character   Median :  27.00   Median :2016-04-29  
##                     Mean   :  77.03   Mean   :2016-05-10  
##                     3rd Qu.:  50.00   3rd Qu.:2016-08-04  
##                     Max.   :5400.00   Max.   :2016-12-20  
##                                                           
##           employer                     occupation           county     
##  N/A          : 5800   RETIRED              : 8894   Cumberland:15461  
##  NONE         : 5309   NOT EMPLOYED         : 7331   York      : 6501  
##  RETIRED      : 3882   TEACHER              :  986   Hancock   : 3398  
##  NOT EMPLOYED : 3144   INFORMATION REQUESTED:  932   Kennebec  : 2964  
##  SELF-EMPLOYED: 2974   ATTORNEY             :  759   Penobscot : 2496  
##  (Other)      :22255   (Other)              :24467   Knox      : 2314  
##  NA's         :    9   NA's                 :    4   (Other)   :10239  
##       subregion          fips          district          area     
##  cumberland:15461   Min.   : 1.00   Min.   :1.000   Min.   : 370  
##  york      : 6501   1st Qu.: 5.00   1st Qu.:1.000   1st Qu.:1142  
##  hancock   : 3398   Median :11.00   Median :1.000   Median :1217  
##  kennebec  : 2964   Mean   :13.73   Mean   :1.288   Mean   :1515  
##  penobscot : 2496   3rd Qu.:23.00   3rd Qu.:2.000   3rd Qu.:1271  
##  knox      : 2314   Max.   :31.00   Max.   :2.000   Max.   :6829  
##  (Other)   :10239                                                 
##       est          medhhinc       percapinc       population    
##  Min.   :1652   Min.   :34016   Min.   :19401   Min.   : 17535  
##  1st Qu.:1761   1st Qu.:45264   1st Qu.:24656   1st Qu.: 54418  
##  Median :1761   Median :55008   Median :27137   Median :197131  
##  Mean   :1771   Mean   :50179   Mean   :27165   Mean   :165273  
##  3rd Qu.:1809   3rd Qu.:55658   3rd Qu.:31041   3rd Qu.:281674  
##  Max.   :1860   Max.   :55658   Max.   :31041   Max.   :281674  
##                                                                 
##     ballots           trump          clinton          latitude    
##  Min.   :  9518   Min.   : 5406   Min.   :  3098   Min.   :43.09  
##  1st Qu.: 33055   1st Qu.:13705   1st Qu.: 13386   1st Qu.:43.69  
##  Median :118045   Median :50403   Median : 55844   Median :43.94  
##  Mean   :100752   Mean   :37421   Mean   : 53674   Mean   :44.05  
##  3rd Qu.:176935   3rd Qu.:57709   3rd Qu.:102981   3rd Qu.:44.36  
##  Max.   :176935   Max.   :57709   Max.   :102981   Max.   :47.30  
##                                                                   
##    longitude     
##  Min.   :-71.01  
##  1st Qu.:-70.40  
##  Median :-70.17  
##  Mean   :-69.84  
##  3rd Qu.:-69.22  
##  Max.   :-67.04  
## 

This summary gives some basic statistics for each variable. We see that the first contribution was two full years before the election. The last contribution was a month after the election. The mean district is 1.29. Maine has two congressional districts with even populations in both. Such a low mean says that the majority of contributions were from the first district.

The established date of the counties also has surprises. Maine was established as a state in 1850, thanks to the Missouri Maine compromise. I’m curious to see how many counties are older than the state. I can see that at least 75% are older.

table(maine$candidate, maine$party)
##           
##            Democratic Green Independent Libertarian Republican
##   Bush              0     0           0           0        121
##   Carson            0     0           0           0        924
##   Christie          0     0           0           0         15
##   Clinton       16776     0           0           0          0
##   Cruz              0     0           0           0       1818
##   Fiorina           0     0           0           0        181
##   Graham            0     0           0           0          2
##   Huckabee          0     0           0           0         11
##   Johnson           0     0           0          41          0
##   Kasich            0     0           0           0         65
##   Lessig            2     0           0           0          0
##   McMullin          0     0           6           0          0
##   O'Malley         10     0           0           0          0
##   Paul              0     0           0           0        176
##   Perry             0     0           0           0          1
##   Rubio             0     0           0           0        168
##   Sanders       20745     0           0           0          0
##   Stein             0    96           0           0          0
##   Trump             0     0           0           0       2197
##   Walker            0     0           0           0         13
##   Webb              5     0           0           0          0

The largest number of contributions went to Sanders and Clinton. Most recipients of contributions are Republicans. The Republican field was crowded so this makes sense.

Univariate Plots Section

timeline <- ggplot(data = maine, aes(x = as.Date(date))) +
 geom_vline(xintercept = as.Date('2016-03-05'), linetype = 1) +
 geom_vline(xintercept = as.Date('2016-7-23'), linetype = 2) +
 geom_vline(xintercept = as.Date('2016-11-08'), linetype = 3) +
 theme_minimal() + labs(x = 'Date', y = 'Number of Contributions')


timeline + geom_histogram(binwidth = 25, fill = 'red3', color = 'white') + 
 ggtitle('Contributions Count Timeline')

The first dashed line is the Maine primary caucus, held March 05, 2016. The second is National Convention, July 21. The final line is the election on November 8th. There is one peak around the caucus and a smaller peak during the convention. The contributions peak again for the election.

republican_contribution_timeline <- timeline +
 geom_histogram(aes(x = as.Date(date)), data = 
                subset(maine, party == 'Republican'), binwidth = 25, 
                fill = 'red3', color = 'white') +  
 ggtitle('Republican Contributions Timeline') 

republican_contribution_timeline

The Republican timeline is bimodal with one peak around the caucus and a second during the convention. There is quite a dip between the two. The contributions drop off again as the election nears.

democrat_contributions_timeline <- timeline +
  geom_histogram(aes(x = date), data = subset(maine, party != 'Republican'), 
                 binwidth = 20, fill = 'darkblue', color = 'white') + 
  ggtitle('Non-Republican Contributions Timeline') 

democrat_contributions_timeline

There were fourteen Republican candidates and eight non-Republicans. That is why I grouped them together for this graph. This timeline is also bimodal. It has a peak around the caucus but a plummet around the convention. There is another peak for the election in November.

ggplot(aes(x = candidate), data = subset(maine, party == 'Republican')) + 
 geom_bar(fill = 'red3') + 
 coord_flip() + 
 guides(fill = F) + 
 ggtitle('Republican Contributions by Candidate') + 
 labs(x = 'Candidate', y = 'Number of Contributions') + 
 theme_minimal() 

Trump, Cruz and Carson are the outliers here. Trump got the most contributions total of the Republicans. This makes sense since he won the national Republican primary. Cruz won the Maine primary though, which explains why he was in second place. Carson got the third most contributions with roughly half of Cruz’s. None of the other candidates got a considerable number. Perry only got one.

ggplot(aes(x = candidate), data = subset(maine, party != 'Republican')) + 
 geom_bar(fill = 'darkblue') + 
 coord_flip() + 
 guides(fill = F) + 
 ggtitle('Non-Republican Contributions by Candidate') + 
 labs(x = 'Candidate', y = 'Number of Contributions') +
 theme_minimal()

Clearly Sanders and Clinton are the outliers here. Sanders won the Maine primary and Clinton the national primary, so it makes sense for them to be ahead. Webb and Lessig appear to have no contributions because they have two and five contributions, so close to none. The other candidates are dwarfed by Sanders and Clinton.

ggplot(aes(x = party), data = maine) + 
 geom_bar(aes(fill = party)) + 
 coord_flip() + 
 guides(fill = F) + 
 ggtitle('Contributions by Party') + 
 theme_minimal() +  
 scale_fill_manual(values = c('darkblue', 'green4', 'deepskyblue', 
                              'yellow3', 'red3')) +
 labs(x = 'Party', y = 'Number of Contributions (log10)') +
 scale_y_log10(breaks = c(5, 10, 25, 50, 100, 200, 500, 
                          1000, 5000, 10000, 30000))

Democrats got more than Republicans even though they had fewer candidates. There was only one independent candidate. He got a grand total of six contributions. I made the scale for the count log 10 so we could see al five parties.

ggplot(aes(x = contribution), data = maine) + 
 geom_histogram(fill = 'darkblue', color = 'white', size = 1, binwidth = 100) + 
 ggtitle('Contribution Size') + 
 theme_minimal() + 
 labs(x = 'Contribution Amount (USD)', y = 'Number of Contributions (log10)') +
 scale_x_continuous(breaks = c(100, 500, 1000, 2000, 2700, 4000, 5400)) +
 scale_y_log10(breaks = c(1, 10, 100, 1000, 40000, 10000))

summary(maine$contribution)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.16   15.00   27.00   77.03   50.00 5400.00

Most contributions were less than $100 but ranged all the way up to the $5400. That is the most far flung outlier and also the maximum contribution amount that is legal for couples. You can see a spike at $2700, which is the maximum contribution amount for an individual.

I transformed the y-axis by log 10 because there was such a huge difference between donations under $100 and all the rest of the donations. You can see in the summary that 75% of all donations were $50 or less.

subset(maine, contribution <= 100) %>%
  summarize(count = n())
##   count
## 1 39401

Here is the count of all donations under $100, a total of 39,401. That is 91% of all donations. Every donation above $86.50 is a statistical outlier, so most of the chart is made of the outliers.

ggplot(aes(x = county), data = maine) + 
 geom_bar(fill = 'red3') + 
 ggtitle('Contribution by County') + 
 theme_minimal() + 
 coord_flip() +
 labs(x = 'County', y = 'Number of Contributions') +
 scale_y_continuous(breaks = c(0, 1000, 2500, 5000, 7500, 10000, 15000))

Cumberland has the highest number of contributions. That is not surprising considering it has the highest population. York has the second highest contribution count. I’m excited to compare these findings to the corresponding income and population data.

county <- select(maine, county, population, trump, clinton, ballots, percapinc,
                 medhhinc, est, area, district, subregion)
county <- unique(county)

str(county)
## 'data.frame':    16 obs. of  11 variables:
##  $ county    : Factor w/ 16 levels "Androscoggin",..: 16 3 12 9 1 4 6 8 14 10 ...
##  $ population: int  197131 281674 35293 57833 107702 30768 122151 34457 38786 153923 ...
##  $ trump     : int  50403 57709 9304 16210 28227 7918 31675 9727 10378 41622 ...
##  $ clinton   : int  55844 102981 10664 12172 23009 7016 29302 10241 10440 32838 ...
##  $ ballots   : int  118045 176935 22356 32405 57198 17085 68411 22237 23471 83015 ...
##  $ percapinc : int  27137 31041 26983 21254 22752 20838 24656 28003 22213 22977 ...
##  $ medhhinc  : int  55008 55658 55486 39748 44470 39831 45973 47678 41312 42658 ...
##  $ est       : int  1652 1761 1854 1805 1854 1838 1799 1760 1827 1816 ...
##  $ area      : int  1271 1217 370 2175 497 1744 951 700 853 3556 ...
##  $ district  : int  1 1 1 2 2 2 1 1 2 2 ...
##  $ subregion : Factor w/ 16 levels "androscoggin",..: 16 3 12 9 1 4 6 8 14 10 ...

I’ve created a new database here. It contains the sixteen counties of Maine and their corresponding statistics. These are in the original Maine database. However, this database has only 16 observations, one for each county. The original database had this same information repeated for each contribution. So this will be simpler to plot when I want county level data.

don_ct <- maine %>%
 count(donor)

ggplot(aes(y = n, x = donor), data = subset(don_ct, n > 99)) + 
 geom_col(fill = 'red3') + 
 coord_flip() + 
 theme_minimal() +
 labs(x = 'Surnames', y = 'Number of Contributions') +
 ggtitle('Commom Surnames of Donors')

For this I made a table of the surnames and the number of contributions each had. Here are the surnames with at least 100 contributions. I don’t see anything striking here.

occ_ct <- subset(maine) %>%
 count(occupation)

ggplot(aes(y = n, x = occupation), 
       data = subset(occ_ct, occupation != 'INFORMATION REQUESTED' & n > 249 &
                     occupation != 'INFORMATION REQUESTED PER BEST EFFORTS')) + 
 geom_col(fill = 'red3') + 
 theme_minimal() + 
 coord_flip() +
 labs(x = 'Occupations', y = 'Number of Contributions') +
 ggtitle('Occupations of Donors') 

count(subset(maine, occupation == 'RETIRED' | occupation == 'NOT EMPLOYED'))
## # A tibble: 1 x 1
##       n
##   <int>
## 1 16225

I made a count of each occupation and graphed the common ones. The unemployed and retired are significant since they made 16,255 contributions.

Do you see that RN is right next to ‘Registered Nurse?’ They also had lawyers and attorney’s separated. I need to clean that up.

maine$occupation[maine$occupation == 'RN'] <- 'REGISTERED NURSE' 
maine$occupation[maine$occupation == 'LAWYER'] <- 'ATTORNEY' 

occ_ct <- subset(maine) %>%
 count(occupation)


ggplot(aes(y = n, x = occupation), 
       data = subset(occ_ct, occupation != 'INFORMATION REQUESTED' & 
                     occupation != 'INFORMATION REQUESTED PER BEST EFFORTS' &
                     occupation != 'RETIRED' & occupation != 'NOT EMPLOYED' & 
                     n > 249)) + 
 geom_col(fill = 'red3') + 
 theme_minimal() + 
 coord_flip() +
 labs(x = 'Occupations', y = 'Number of Contributions') +
 ggtitle('Top Occupations of Donors')

The retired and unemployed are the two largest groups, by far. There were also a lot in the ‘Information Requested’ subcategory. After tidying the lawyers and nurses, I re-plotted. This time was without the retired and unemployed to zoom in on the other donors. Teachers, physicians and attorneys are the top donors by count. So many contributions from students and homemakers are kind of surprising. They are not always a wealthy demographic. It is not so surprising when you think of how large each demographic is. The same could is true for the unemployed.

emp_ct <- subset(maine) %>%
 count(employer)


emp_ct$employer <- sub('SELF EMPLOYED', 'SELF', emp_ct$employer)
emp_ct$employer <- sub('SELF - EMPLOYED', 'SELF', emp_ct$employer)
emp_ct$employer <- sub('N/A', 'NONE', emp_ct$employer)

ggplot(aes(y = n, x = employer), 
       data = subset(emp_ct, employer != 'INFORMATION REQUESTED' &
                     employer != 'INFORMATION REQUESTED PER BEST EFFORTS' &
                     employer != 'NONE' & employer != 'NOT EMPLOYED' & 
                     employer != 'SELF' & employer != 'RETIRED' & n > 59)) + 
 geom_col(fill = 'red3') + 
 theme_minimal() + 
 coord_flip() +
 labs(x = 'Employer', y = 'Number of Contributions (log10)') +
 ggtitle('Common Employers of Donors') +
  scale_y_log10(breaks = c(10, 100, 1000))

This is every employer with at least 60 donors. I did lump ‘Self’, ‘Self Employed’ and ‘Self-Employed’ into one employer for obvious reasons. Looking at this, I see a lot of schools and a lot of health care centers. That is in line with the top contribution occupations, teachers and physicians. I’m curious how the occupations look after dividing by party and candidate. That will involve a more complex graph.

Univariate Analysis

What is the structure of your dataset?

The data frame has 43,373 observations and 20 variables. The character variables are: candidate, party, donor, and zip code. The donor is the last name. The zip code is just the first four digits. The data set has one date format, which is the date column. The factors are: employer, county, and sub region. The integers are: fips, district, area, year established, median household income, per capita income, population, ballots, Trump’s total votes and Clinton’s total votes.

What is / are the main feature(s) of interest in your dataset?

The contributions to the candidates is the main feature of this data set. The contributions have several variables describing them. They have their date and how much money each contribution was.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

I think all the county level data will add a lot of depth to the investigation. Knowing more about each county gives me more to explore and compare to the original data.

Did you create any new variables from existing variables in the dataset?

I have not yet created new variables from existing data sets. I have some ideas for some in the more complex plots. One idea is population density, which is the population of each county, divided by the area. Voter turnout is the total ballots cast divided by the population.

I did create mini databases showing the count by employer, occupation and surname.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

First I trimmed the zip codes to four digits. All zip codes in Maine start with a zero. That was omitted since the data was processed as an integer and not a string. Some of the zip codes were longer than five digits. I needed the first four characters to match them to their corresponding counties.

I merged that data set with another that added the party affiliation of each candidate. Then, I cut the candidate’s name down so it showed only their last name. I figured this would make the graphs easier to read. I also reformatted the date to make timelines simpler to plot.

Next I matched the zip codes to their corresponding counties. That allowed me to merge in a larger file that had fascinating data on each county. It contained basic info about the county, such as area and population. It also contained the results of the 2016 election in that county.

Finally I took out all the refunds. I thought a lot about how to handle the refunds. In the end excluding them from the data was the simplest thing to do. After that, I made the sub database, ‘county,’ which only has the information about each county.

Bivariate Plots Section

cont_x_party <- ggplot(aes(party, contribution), data = maine) + 
 geom_boxplot(aes(color = party)) + 
 coord_flip() + 
 theme_minimal() + 
 ggtitle('Contribution Amount by Party') + 
 scale_color_manual(values = c('darkblue', 'green4', 'deepskyblue', 'yellow3',
                               'red3'), guide = F) +
 labs(x = 'Party', y = 'Contribution Amount (USD)')

cont_x_party

The contribution amount box plots are divided by party. The libertarian, independent and green parties had so few contributions. It is worth noting that they all have larger interquartile ranges than the two main parties, as well as fewer outliers. This is almost certainly due their smaller sample size.

ggplot(aes(candidate, contribution), 
       data = subset(maine, party == 'Republican')) + 
 geom_boxplot(color = 'red3') + 
 coord_flip() + 
 theme_minimal() + 
 ggtitle('Republican Contribution Amounts by Candidate') +
 labs(x = 'Candidate', y = 'Contribution Amount (USD)')

Most of these look similar, small boxes with high outliers. Walker’s median and third quartile look to be the same. Huckabee has the reverse, where his median and first quartile have merged into one thick line. Graham and Perry, with two and one donors respectively, have just a thick line and no box.

The weirdest one is definitely Bush, the outlier of the bunch. His contributions must have been high compared to the others. So his interquartile range is almost the same width as the plot. We’ll have to look into this again.

ggplot(aes(candidate, contribution), 
       data = subset(maine, party != 'Republican')) + 
 geom_boxplot(color = 'darkblue') + 
 coord_flip() + 
 theme_minimal() + 
 ggtitle('Non Republican Contribution Amounts by Candidate') +
 labs(x = 'Candidate', y = 'Contribution Amount (USD)')

Sanders and Clinton both appear to have skewed data. They have so many more data points than the other candidates. It is interesting to note that Clinton has a larger box and higher outliers. This indicates that she had higher contributions on average than Sanders. Let’s zoom in on that.

ggplot(aes(candidate, contribution), 
       data = subset(maine, candidate == 'Sanders' | candidate == 'Clinton')) + 
 geom_boxplot(color = 'darkblue') + 
 theme_minimal() + 
 ggtitle('Clinton and Sanders Contribution Amounts') + 
 coord_cartesian(y = c(0, 200)) +
 labs(x = 'Candidate', y = 'Contribution Amount (USD)')

Wow, so Clinton and Sanders appear to have the same median around $25. So that means that for both, half of all their contributions were less than or equal to $25. Three quarters of Sanders’ contributions are under $50. Three quarters of Clinton’s are under $75. As we saw from the previous chart Clinton has more and further flung outliers than Sanders.

ggplot(aes(x = date), 
       data = subset(maine, party == 'Republican' & candidate != 'Christie' &    
                     candidate != 'Graham' & candidate != 'Huckabee' & 
                     candidate != 'Christie' & candidate != 'Perry' & 
                     candidate != 'Walker')) + 
 geom_histogram(aes(fill = candidate), binwidth = 25) + 
 geom_vline(xintercept = as.numeric(as.Date('2016-03-05')), linetype = 6) + 
 geom_vline(xintercept = as.numeric(as.Date('2016-11-08')), linetype = 2) + 
 geom_vline(xintercept = as.numeric(as.Date('2016-7-21')), linetype = 4) + 
 ggtitle('Timeline of Republican Contributions by Candidate') + 
 labs(y = 'Number of Contributions', x = 'Date') + 
 theme_minimal() + 
 guides(fill = guide_legend(title = 'Candidates')) 

timeline +  geom_histogram(aes(x = date, fill = candidate), binwidth = 25, 
       data = subset(maine, party == 'Republican' & candidate != 'Christie' &    
                     candidate != 'Graham' & candidate != 'Huckabee' & 
                     candidate != 'Christie' & candidate != 'Perry' & 
                     candidate != 'Walker')) + 
 ggtitle('Timeline of Republican Contributions by Candidate') + 
 guides(fill = guide_legend(title = 'Candidates')) 

For this plot I used only the top eight Republican candidates. The others were so small they were invisible in the timeline but they cluttered the legend. This timeline is bimodal with one peak right before the Maine primary and the other right around the National Convention.

Here again we see Trump and Cruz dominating the field but in different times. Trump didn’t gain popularity until the other candidates were knocked out. So it seems like Trump was not the first choice of most Republican Mainers. At least he was not for those who donated and voted in the caucus.

dem_tl <- timeline + 
 geom_freqpoly(data = subset(maine,  candidate == 'Sanders' | 
                             candidate == 'Clinton'),
               aes(color = candidate, x = date), size = 2, binwidth = 25) + 
 scale_color_manual(values = c('darkblue', 'cornflowerblue'), 
                    name = 'Candidate') 

dem_tl + ggtitle('Timeline of Clinton and Sanders Contributions') 

You can see that Clinton was not the first choice for Mainers either. They backed Sanders before the Maine caucus, which he won. He peaked in Maine contributions shortly after. Clinton saw a jump leading up to the National Convention. She saw an even larger peak right before the election. So we got on board with Clinton when we had to.

timeline + ggtitle('Top Four Contribution Count Timeline') +
  geom_freqpoly(aes(x = date, color = candidate), size = 2, binwidth = 25,
                data = subset(maine, candidate == 'Sanders' | 
                              candidate == 'Clinton' | candidate == 'Trump' |
                              candidate == 'Cruz' )) + 
 scale_color_manual(values = c('darkblue', 'firebrick1', 'cornflowerblue', 
                               'red3'), name = 'Candidate') 

Here are the top two Democrat and top two Republican candidates. There are parallels between Cruz and Sanders and also between Clinton and Trump. It looks like Clinton consistently received more contributions than Trump.

rep_tl <- timeline + 
 geom_freqpoly(data = subset(maine, candidate == 'Trump' | candidate == 'Cruz'),
               aes(color = candidate, x = date), size = 2, binwidth = 25) +  
 scale_color_manual(values = c('firebrick1', 'red3'), name = 'Candidate') 

grid.arrange(dem_tl, rep_tl, ncol = 1,  
             top = 'Top Four Candidates Split Timeline')

Here is a more proportionate view of the two parties’ contributions. Cruz and Sanders peaked around the same time, right after they each won the Maine primary caucus. Trump had a big peak right before the National Convention and a smaller peak before the election. Clinton had a small peak before the convention and a larger peak right before the election.

topfour <- subset(maine, candidate == 'Sanders' | candidate == 'Clinton' |
      candidate == 'Trump' | candidate == 'Cruz' ) 

topfour.summary <- topfour %>%
 group_by(Candidate = candidate) %>%
 summarise(Mean = mean(contribution), 
      Median = median(contribution), 
      Sum = sum(contribution), 
      Count = n()) %>%
 ungroup() %>%
 arrange()

topfour.summary
## # A tibble: 4 x 5
##   Candidate      Mean Median       Sum Count
##       <chr>     <dbl>  <dbl>     <dbl> <int>
## 1   Clinton 101.18033     25 1697401.3 16776
## 2      Cruz  56.49195     35  102702.4  1818
## 3   Sanders  41.07251     25  852049.2 20745
## 4     Trump 122.00599     50  268047.2  2197

To create this table I subset the database so it only included the top four candidates. I grouped by candidate and summarized.

Clinton got nearly twice as much money as Sanders even though he got more contributions. Their medians were the same but their means were vastly different. The simplest explanation is that Clinton got a lot more big contributions than Sanders.

Trump got more than twice as much money as Cruz as well. Trump’s mean is more than double Cruz’s mean. This makes sense with Clinton and Trump. The big donors might wait for the final candidates before donating.

Trump has the largest median suggesting fewer small contributions. Sanders and Clinton have small medians. This suggest a lot of small contributions compared to the Republicans. However, this may be a consequence of them having almost 10 times more contributions. It may not have an ideological causation, i.e. the rich support Republicans more than they support Democrats.

Another interesting point is the total number of these four candidates contributions. Over 95% of all contributions went to these four candidates. Further, more than 85% went to Clinton and Sanders. So ~10 % went to the other two and the remaining ~5% was split among the other 16 candidates. This is the total count of contributions, not the total sum of contributions.

timeline + ggtitle('Lesser Candidates Contributions Timeline') +
  geom_histogram(data = subset(maine, party != 'Republican' & 
                               candidate != 'Sanders' & candidate != 'Clinton'),
                 aes(fill = candidate), binwidth = 15) +
  scale_fill_manual(values = c('yellow3', 'blue1', 'red2', 'navy', 'green4',
                              'deepskyblue'), name = 'Candidate')

This is a plot of all the candidates who were not Republicans and who were not Clinton or Sanders. So these are the lesser democrats and the third party candidates. Most of these candidates were not significant to most Mainers. Stein (Green) and Johnson (Libertarian) were the most popular of the lesser candidates. Stein’s plot is especially telling. You can see her contributions creep up as we approach the National Convention. They stayed strong until the election. Then she had a post election bump when she tried to do a recount on Clinton’s behalf. Still, this is a minor story in the grand scheme of this election in Maine. So I don’t think I’ll be focusing on these candidates much more.

ggplot(aes(x = date, y = contribution), 
       data = subset(maine, party == 'Republican' | party == 'Democratic')) + 
 geom_point(aes( color = party), alpha = .05) + 
 ggtitle('Size of Contribution Timeline') +
 geom_vline(xintercept = as.numeric(as.Date('2016-11-08')), linetype = 2) + 
 geom_vline(xintercept = as.numeric(as.Date('2016-7-28')), linetype = 4)  + 
 geom_vline(xintercept = as.numeric(as.Date('2016-03-06')), linetype = 6) + 
 theme_minimal() + 
 scale_color_manual(values = c('darkblue',
                               'red3'), guide = F)  +
 labs(x = 'Date', y = 'Contribution Amount in USD (log10)') +
  scale_y_log10(breaks = c(0, 1, 10, 100, 1000)) +
  facet_grid(party~.)

Another thing I was curious about was the size of the contribution and if it varied over time and by party. Both parties got large donations early on. However Republucans got fewer lare donations toward the end. Democrats got more large donations toward the end comparatively.

For Democrats you can see clear horizontal lines for many different denominations. Republicans have fewer clear lines, which may be a product of fewer donations.

Democrats have a crazy amount of donations right after the Maine primary, which corresponds to Sanders’ big spike. Republicans have a cloud of donations after the election where Democrats have a firm line where contributions end.

map <- map_data('county')
map <- subset(map, region == 'maine')

candidate_stats <- subset(maine, candidate == 'Clinton' | candidate =='Trump' 
                          | candidate =='Sanders' | candidate == 'Cruz') %>%
  group_by(county, candidate) %>%
  summarise(mean = mean(contribution), 
            median = median(contribution), 
            sum = sum(contribution), 
            count = n()) %>%
  ungroup() %>%
  arrange()
  
CL_stats <- subset(candidate_stats, candidate == 'Clinton')
CL_stats$candidate <- NULL
colnames(CL_stats) <- c('county', 'clmean', 'clmedian', 'clsum', 'clcount')

T_stats <- subset(candidate_stats, candidate == 'Trump')
T_stats$candidate <- NULL
colnames(T_stats) <- c('county', 'tmean', 'tmedian', 'tsum', 'tcount')

S_stats <- subset(candidate_stats, candidate == 'Sanders')
S_stats$candidate <- NULL
colnames(S_stats) <- c('county', 'smean', 'smedian', 'ssum', 'scount')

CZ_stats <- subset(candidate_stats, candidate == 'Cruz')
CZ_stats$candidate <- NULL
colnames(CZ_stats) <- c('county', 'czmean', 'czmedian', 'czsum', 'czcount')

county <- merge(CL_stats, county)
county <- merge(T_stats, county)
county <- merge(S_stats, county)
county <- merge(CZ_stats, county)

county$popdensity <- county$population / county$area
county$votturn <- county$ballots / county$population

map <- merge(map, county)

str(county)
## 'data.frame':    16 obs. of  29 variables:
##  $ county    : Factor w/ 16 levels "Androscoggin",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ czmean    : num  60.6 60.3 46.3 54.2 62.5 ...
##  $ czmedian  : num  35 35 25 35 25 50 100 25 42.5 50 ...
##  $ czsum     : num  7094 1145 17136 3251 6935 ...
##  $ czcount   : int  117 19 370 60 111 104 35 99 36 163 ...
##  $ smean     : num  36.6 28.6 44.9 42.7 46 ...
##  $ smedian   : num  25 21 27 27 27 25 25 27 27 15 ...
##  $ ssum      : num  25004 8426 293345 10373 80380 ...
##  $ scount    : int  683 295 6538 243 1748 1210 1237 1210 457 1286 ...
##  $ tmean     : num  103 114 149 220 103 ...
##  $ tmedian   : num  40 44.8 80 100 55.1 ...
##  $ tsum      : num  14159 9127 84335 9035 10989 ...
##  $ tcount    : int  138 80 565 41 107 162 100 100 80 205 ...
##  $ clmean    : num  100.2 39.3 132.5 118 76.3 ...
##  $ clmedian  : num  25 25 27 43.5 25 25 25 25 25 30 ...
##  $ clsum     : num  33874 4829 968166 16989 98883 ...
##  $ clcount   : int  338 123 7308 144 1296 1377 864 595 288 683 ...
##  $ population: int  107702 71870 281674 30768 54418 122151 39736 34457 57833 153923 ...
##  $ trump     : int  28227 19419 57709 7918 13705 31675 9148 9727 16210 41622 ...
##  $ clinton   : int  23009 13386 102981 7016 16117 29302 12443 10241 12172 32838 ...
##  $ ballots   : int  57198 36257 176935 17085 33055 68411 23926 22237 32405 83015 ...
##  $ percapinc : int  22752 20251 31041 20838 26876 24656 25291 28003 21254 22977 ...
##  $ medhhinc  : int  44470 36574 55658 39831 47533 45973 45264 47678 39748 42658 ...
##  $ est       : int  1854 1839 1761 1838 1790 1799 1860 1760 1805 1816 ...
##  $ area      : int  497 6829 1217 1744 2351 951 1142 700 2175 3556 ...
##  $ district  : int  2 2 1 2 2 1 1 1 2 2 ...
##  $ subregion : Factor w/ 16 levels "androscoggin",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ popdensity: num  216.7 10.5 231.4 17.6 23.1 ...
##  $ votturn   : num  0.531 0.504 0.628 0.555 0.607 ...

I created a simple map database from one of ggplot’s map packages. It has the latitude and longitude points to outline each county in Maine.

Next I added the statistics of the top four candidates, broken down by county, to the county database. Then I added the population density and voter turnout for each county. Now we can do more county level investigations. This brings more of the original database into the county level database.

Finally I merged the map database and county database together. This will help me create more dynamic map plots.

maine_map <- ggplot(aes(x = long, y = lat), data = map) +
 theme_minimal() + 
 coord_fixed(1.3) + 
 theme(axis.title.x = element_blank(), axis.title.y = element_blank(), 
       axis.text.x = element_blank(), axis.text.y = element_blank())

district_map <- maine_map +
  geom_polygon(aes(fill = as.factor(district), group = group), 
               color = 'white') + 
  scale_fill_manual(values = c('darkblue', 'red3'), guide = F) +
  ggtitle('Districts') 

geo_map <- get_map(location = c( - 69, 45), zoom = 6, maptype = 'terrain') 

terrain_map <- ggmap(geo_map) + 
 coord_cartesian(ylim = c(42.8, 47.5), xlim = c( - 71.3, - 66.7)) + 
 theme_minimal() +
 ggtitle('Geography') +
 theme(axis.title.x = element_blank(), axis.title.y = element_blank(), 
       axis.text.x = element_blank(), axis.text.y = element_blank())

grid.arrange(district_map, terrain_map, ncol = 2, top = 'Maine')

Can you believe that the blue and red districts have roughly the same population? Actually the blue one has slightly more. These are the two congressional districts of Maine. A huge part of Maine’s story is our geography, thus the second map made with the ggmap package. We have miles and miles of coast land, dotted with hard-to-reach islands. Then we have a mountain range slicing through the state. From the mountains come the rivers and lakes that help make Maine famously beautiful. People have predictably built cities along the coast and rivers. That white line cutting through the state is intestate 95. It goes all the way from Florida to beyond Bangor, Maine. Much of the rest of the state is remote. Vast swaths of land can only be accessed by poorly maintained private roads. Yet, some people still live there. People living different ways think differently. So we’ll examine how some of that affected the presidential election of 2016.

maine_map + geom_polygon(aes(x = long, y = lat, group = group), data = map, 
                         color = 'black', fill='white' ) + 
  geom_point(aes(x = longitude, y = latitude, color = party), alpha = .01,
             data = subset(maine, party == 'Republican' | 
                           party == 'Democratic')) +
  ggtitle('Location of Each Contribution') +
  facet_grid(.~party) + 
  scale_color_manual(values = c('darkblue', 'red3'), guide = F) 

When I first ran this graph it looked like both parties got the same amount of contributions in the same regions. I knew that wasn’t correct so I had to fiddle with the alpha a lot, finally setting it to .01. So it takes 100 contributions in one location to create a fully darkened dot.

By the way, those dots in the ocean are islands, in case you were wondering.

maine$popdensity
## NULL
pop_dens <- maine_map +
 geom_polygon(aes(fill = popdensity, group = group), color = 'black') + 
 scale_fill_gradient2(low = 'white', mid = '#76a7f7', high = 'darkblue', 
                      midpoint = 100, name = 'People Per Square Mile', 
                      guide = guide_colourbar(title.position = 'top')) + 
 ggtitle('Population Density') +
   theme(legend.position='bottom')

pop_dens 

This long with the previous graphs show where the population is distributed in Maine. One problem with population density is that it is skewed for smaller counties. Some large counties have some decent sized urban areas. Yet the large woodland surrounding them lowers the population density.

I measured population density as the population per square mile. The pale counties on Canadian border contain the end of the Appalachian Mountains. The darkness in the south gets gradually lighter up the interstate 95 corridor. Piscataquis is the palest with a population density of four people per square mile. It homes Baxter State Park, a good chunk of the Appalachian, and interstate 95 does not cross through it. We’ll dig a little deeper in the multivariate plots.

vote_turnout <- maine_map +
  geom_polygon(aes(fill = votturn, group = group), color = 'black') +
  scale_fill_gradient2(low = 'white', mid = '#76a7f7', high = 'darkblue', 
                      midpoint = .57, name = 'Ballots / Population',
                      guide = guide_colourbar(title.position = 'top')) + 
 ggtitle('Voter Turnout by County') +
 theme(legend.position = 'bottom')

grid.arrange(pop_dens, vote_turnout, ncol = 2)

cor(county$popdensity, county$votturn)
## [1] 0.3104624

I suspected there would be a correlation between population density and voter turnout. The correlation, 0.31, is slight but present. There is a lot more variation in population density than there is in voter turnout.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I explored the box plots of the contribution amounts and saw how it varied by candidate and party. I made some timelines showing the candidates contribution count. Then I created a few simple maps highlighting the first things I was curious about and wanted to see in map form.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Most of this section did not analyze the main feature of the data set, the contributions themselves. The ones that did weren’t plots. The map plots showed the two districts in Maine, the population density and voter turnout, and the location of each donor. I’m not seeing anything super interesting here, at least not yet, but am going to explore further.

What was the strongest relationship you found?

So far the only correlation is between population density voter turnout, but that’s only a 0.31. I expect I am going to find a lot more interesting correlations as we go on.

Multivariate Plots Section

mapcountsblue <- maine_map + 
  scale_fill_gradient2(low = '#e2edff', mid = '#2d7dff', 
                      high = '#00112d', midpoint = .01, 
                      name = 'Proportion', 
                      guide = guide_colourbar(title.position = 'top'),
                      breaks = c(.01, .02, .03))  +
 theme(legend.position='bottom')

mapcountsred <- maine_map + 
  scale_fill_gradient2(low = '#ffdbdb', mid = '#ff4242', 
                      high = 'red3', midpoint = .01, 
                      name = 'Proportion', 
                      guide = guide_colourbar(title.position = 'top'),
                      breaks = c(.0005, .0015, .0025))  +
 theme(legend.position='bottom')

mapscount <- mapcountsblue +
 geom_polygon(aes(fill = scount / population, group = group), color = 'black') + 
 ggtitle('Sanders') 

mapczcount <-  mapcountsred +
 geom_polygon(aes(fill = czcount / population, group = group), 
              color = 'black') + 
 ggtitle('Cruz') 

mapclcount <-  mapcountsblue +
 geom_polygon(aes(fill = clcount / population, group = group), 
              color = 'black') + 
 ggtitle('Clinton') 

maptcount <-  mapcountsred +
 geom_polygon(aes(fill = tcount / population, group = group), color = 'black') + 
 ggtitle('Trump') 

grid.arrange(mapscount, mapclcount, maptcount, mapczcount, ncol = 4, 
       top = 'Proportion of Population that Donated')

The lighter the color, the fewer donors per person in the population. The blue are the Democrat candidates and the red/pink and the Republican candidates.

For these I divided the count of contributions for each county by the population of the county. This is showing the density of each candidate’s donors. This shows that Sanders had more widespread support in most counties than Clinton. Clinton beat him in Cumberland, the highest population county.

These graphs might trick some into thinking that Cruz had more donors than Trump. We know that is not the case. It looks like that because Cruz had more supporters in the low population density areas.

I find it interesting that all the graphs are darkest along the southern coast.

total_sum <- sum(maine$contribution)
ggplot(aes(x = candidate, y = contribution / total_sum), 
       data = subset(maine, candidate == 'Sanders' | candidate == 'Clinton' |
                     candidate == 'Trump' | candidate == 'Cruz' )) + 
 geom_col(aes(fill = party)) + 
 ggtitle('Proportion of Total Dollars by Candidate') + 
 theme_minimal() + 
 scale_fill_manual(values = c('darkblue', 'red3'), guide = F) +
 labs(x = 'Candidate', y = 'Proportion')

sum((topfour$contribution) / sum(maine$contribution))
## [1] 0.873989

This is the sum of the top four candidate’s contributions divided by the sum of all contributions. So 87% of the money donated was donated to these four candidates. Half of the total was to Clinton and about a quarter to Sanders. This is in line with the other data, not surprising.

total_sum <- sum(maine$contribution)
ggplot(aes(x = candidate, y = contribution / total_sum), 
       data = subset(maine, candidate != 'Sanders' & candidate != 'Clinton' &
                     candidate != 'Trump' )) + 
 geom_col(aes(fill = party)) + 
 coord_flip() + 
 guides(fill = F) + 
 ggtitle('Proportion of Total Dollars by Less Popular Candidates') + 
 theme_minimal() + 
 scale_fill_manual(values = c('darkblue', 'green4', 'deepskyblue', 'yellow3',
                              'red3')) +
 labs(x = 'Candidate', y = 'Proportion')

I wanted to see if there was anything interesting in other candidates’ proportions. Bush got 4%, Carson and Fiorina each go 2%. This is interesting since Cruz only got 3% of total contributions by dollar. Yet Cruz got over 4% of the total count of contributions. Bush had less than 1% of the total contributions count yet got 4% of the money.

bushdata2.summary <- subset(maine, candidate == 'Bush') %>%
 group_by(County = county) %>%
 summarise(Mean = mean(contribution), 
      Median = median(contribution), 
      Sum = sum(contribution), 
      Count = n(), 
      First = min(date), 
      Last = max(date)) %>%
 ungroup() %>%
 arrange()

bushdata2.summary
## # A tibble: 5 x 7
##       County     Mean Median   Sum Count      First       Last
##       <fctr>    <dbl>  <dbl> <dbl> <int>     <date>     <date>
## 1 Cumberland 1218.615    250 79210    65 2015-06-18 2016-02-19
## 2   Kennebec  768.750    150  6150     8 2015-06-29 2016-02-05
## 3    Lincoln 1833.333   2700  5500     3 2015-07-14 2016-02-08
## 4  Penobscot  290.000    250  1450     5 2015-06-17 2016-01-12
## 5       York 1065.000    100 42600    40 2015-06-25 2016-02-16
bushdata1.summary <- subset(maine, candidate == 'Bush') %>%
 summarise(Mean = mean(contribution), 
      Median = median(contribution), 
      Sum = sum(contribution), 
      Count = n(), 
      First = min(date), 
      Last = max(date)) %>%
 arrange()

bushdata1.summary
##       Mean Median    Sum Count      First       Last
## 1 1114.959    200 134910   121 2015-06-17 2016-02-19

This explains everything so clearly about Bush’s contributions. First, I ran the numbers by county, then at the state level. Across the board, Bush got big contributions. His statewide median is $200, four times larger than Trump’s, and I previously though Trump’s median was large. The mean is over $1000. None of this is too surprising as the Bush family have ties in Maine. Most of the contributions are from York, where the Bush Compound looms, and Cumberland, the next county over.

mean_map <- maine_map + 
  scale_fill_gradient2(low = '#e2ffd8', mid = '#6cb775', high = '#123d17', 
                      midpoint = .4, name = 'Contribution (USD)', 
                      guide = guide_colourbar(title.position = 'top')) + 
 theme(legend.position='bottom')

sanders_mean <- mean_map + ggtitle('Sanders') +
 geom_polygon(aes(group = group, fill = ssum / population),color = 'black')

clinton_mean <- mean_map + ggtitle('Clinton') +
 geom_polygon(aes(group = group, fill = clsum / population),color = 'black')

cruz_mean <- mean_map + ggtitle('Cruz') +
 geom_polygon(aes(group = group, fill = czsum / population),color = 'black')

trump_mean <- mean_map + ggtitle('Trump') +
 geom_polygon(aes(group = group, fill = tsum / population),color = 'black')

grid.arrange(sanders_mean, clinton_mean, trump_mean, cruz_mean, ncol = 4, 
             top = 'Mean Contribution Per Capita by County')

Dividing the sum of each candidate’s contributions per county by the population of the county gives us the mean contribution per capita.

Cruz’s Maine is looking super pale. That’s because he didn’t get much money compared to the other three. Trump didn’t get much either. His best county was Lincoln. It looks like he might have done better in Lincoln than Clinton but it may be an optical illusion. Let’s see.

lincoln_comparison <- select(county, population, county, clsum, ssum, tsum, 
                             czsum) %>%
 group_by(county) %>%
 summarise(Clinton = clsum / population, 
      Sanders = ssum / population, 
      Trump = tsum / population, 
      Cruz = czsum / population) %>%
 ungroup() %>%
 arrange()

lincoln_comparison
## # A tibble: 16 x 5
##          county    Clinton   Sanders      Trump       Cruz
##          <fctr>      <dbl>     <dbl>      <dbl>      <dbl>
##  1 Androscoggin 0.31451552 0.2321602 0.13146079 0.06586693
##  2    Aroostook 0.06719215 0.1172383 0.12699583 0.01593154
##  3   Cumberland 3.43718480 1.0414352 0.29940598 0.06083629
##  4     Franklin 0.55217629 0.3371204 0.29365770 0.10566173
##  5      Hancock 1.81709692 1.4770842 0.20194311 0.12743945
##  6     Kennebec 0.71448862 0.3663468 0.12278311 0.04696892
##  7         Knox 2.66499295 1.6064611 0.35719700 0.07361083
##  8      Lincoln 1.02254114 1.4959628 0.47092028 0.18124039
##  9       Oxford 0.32094237 0.4113670 0.11579237 0.04235592
## 10    Penobscot 0.42738876 0.2308172 0.19587807 0.08478915
## 11  Piscataquis 0.01448646 0.1896863 0.17617451 0.05446250
## 12    Sagadahoc 1.56789788 0.9888508 0.13525997 0.12906242
## 13     Somerset 0.16554530 0.2066717 0.09148254 0.02565674
## 14        Waldo 0.69538854 1.1374439 0.09053885 0.05958335
## 15   Washington 0.43963934 0.3704711 0.11556093 0.09436633
## 16         York 0.79265580 0.5568055 0.19371677 0.12414080

So Clinton was averaging $1.02 contribution per capita and Trump was only getting $0.47 in Lincoln county. So it was an optical illusion. Trump’s Lincoln looked darker since all the surrounding counties were pale. The opposite was happening on Clinton’s map.

Sanders generally did better up the coast. Cumberland is where Clinton shone. She got an average of $3.43 per capita, which is crazy since Cumberland has the highest population. The next best mean is from Knox and that goes to Clinton too. If you want to know the worst, that award goes to Cruz who got less that $0.02 per citizen of Aroostook.

maine_map + geom_polygon(aes(group = group), color = 'black', fill='white' ) + 
  geom_point(aes(x = longitude, y = latitude, color = party, 
                 alpha = contribution),  
             data = subset(maine, party == 'Republican' | 
                           party == 'Democratic')) +
  ggtitle('Mapped Contribution Amount') +
  facet_grid(.~party) +
  scale_color_manual(values = c('darkblue', 'red3'), guide = F) +
  guides(alpha = guide_legend(title = 'Amount')) 

Here is a similar map to the one above plotting the count of each zip codes and their contributions. On this one, the alpha is set to the contribution amount so the dot is darker for darker contributions. Here is it fully dark at $5000.

cor(maine$latitude, maine$contribution)
## [1] -0.0355686

So I thought there might be a correlation between the contribution amount and the latitude but I was wrong. The correlation is -0.036.

memedhh <- maine_map +
 geom_polygon(aes(fill = percapinc, group = group), color = 'black') + 
 ggtitle('Median Household Income') + 
 scale_fill_gradient2(low = '#e2edff', mid = '#2d7dff', high = '#00112d', 
                      midpoint = 25000,  name = 'Income (USD)', 
                      guide = guide_colourbar(title.position = 'top'),
                      breaks = c(20000, 25000, 30000)) + 
 theme(legend.position = 'bottom')

mepcap <- maine_map +
 geom_polygon(aes(group = group, fill = medhhinc), color = 'black') +
 ggtitle('Per Capita Income') + 
 scale_fill_gradient2(low = '#e2edff', mid = '#2d7dff', high = '#00112d', 
                      midpoint = 45000,  name = 'Income (USD)', 
                      guide = guide_colourbar(title.position = 'top'), 
                      breaks = c(35000, 45000, 55000)) + 
 theme(legend.position='bottom')
  
grid.arrange(memedhh, mepcap, ncol = 2)

With all of these graphs, the southern coast is the darkest. Across the board we are seeing more involvement from the coast.

cpv_trump <- maine_map +
 geom_polygon(aes(group = group, fill = tsum / trump), color = 'black') + 
 ggtitle('Trump') + 
 scale_fill_gradient2(low = '#ffdbdb', mid = '#ff4242', high = 'red3', 
                      midpoint = 3,  name = 'Cost (USD)', 
                      guide = guide_colourbar(title.position = 'top')) + 
 theme(legend.position='bottom')

  

cpv_clinton <- maine_map +
 geom_polygon(aes(group = group, fill = clsum / clinton), color = 'black') + 
 ggtitle('Clinton') + 
 theme(legend.position='bottom') +
 scale_fill_gradient2(low = '#e2edff', mid = '#2d7dff', high = 'darkblue', 
                      midpoint = 2,  name = 'Cost (USD)', 
                      guide = guide_colourbar(title.position = 'top'))  

grid.arrange(cpv_trump, cpv_clinton, ncol = 2, top = 'Cost Per Vote')

This is the sum of the contributions from each county, divided by the number of votes they got from that county. So this is the cost per vote. Since donors paid so much more to Clinton, the cost in this plot is higher for her across the board. In most elections, whoever gets the most money wins. That is part of why Trump’s victory was such an upset.

county$incineq <- county$medhhinc / county$percapinc
map$incineq <- map$medhhinc / map$percapinc

maine_map +
 geom_polygon(data = map, aes(group = group, fill = incineq), color = 'black') +
 ggtitle('Income Inequality') + 
 scale_fill_gradient2(low = '#e2edff', mid = '#2d7dff', high = '#00112d', 
                      midpoint = 1.9, name = 'Median / Mean',
                      guide = guide_colourbar(title.position = 'top')) +
 theme(legend.position='bottom')

This is something else I was curious about, Maine’s income inequality. The best figures I could find were median household income and per capita income. These are different measurements, one being per person and one being per household. Still, I figured the higher the mean is compared to the median, the more inequality there is. So, I divided the per capita income by the median household income. The results were not surprising. York and Sagadahoc, the darkest counties, both have notoriously wealthy communities. They also have low income rural and urban populations.

ggpairs(data = county, columns = c(3:8, 28:30), title = 'Correlations')

I was curious if income inequality had any interesting correlations. It is correlated to median household income, but not to per capita income. It is also correlated to population density, which makes some sense. Voter turnout also correlates strongly to the income statistics.

ggplot(data = county, aes(x = county), show.legend = T) + 
 geom_col(aes(y = 1), fill = 'gold', show.legend = T) + 
 geom_col(aes(y = ballots / population), fill = 'orange4', show.legend = T) + 
 geom_col(aes(y = (trump + clinton) / population), fill = 'darkblue', 
          show.legend = T) + 
 geom_col(aes(y = trump / population), fill = 'red3', show.legend = T) + 
 coord_flip() + 
 theme_minimal() + 
 ylab('Voters') + xlab('County') + 
 ggtitle('Proportion of Voters by County')

Red is Trumps’s proportion of votes and blue is Clinton’s. Brown is other voters and yellow is for those who didn’t vote. The highest voter turnout counties seem to have a larger proportion of Clinton voters.

turnout_results <- ggplot(data = county) + 
 geom_point(aes(x = votturn, y = county, color = clinton > trump), size = 3) + 
 xlab('Voter Turnout') + 
 scale_color_manual(values = c('red3', 'darkblue'), 
                    labels = c('Trump', 'Clinton'), name = 'Winner') + 
 theme_minimal() +
 ggtitle('Turnout Correlation to Outcome') +
 labs(y = 'County', x = 'Voter Turnout')

turnout_results

So, here is a clear distinction, higher voter turnout appears better for Clinton in Maine.

total_contributions <- select(maine, county) %>%
 group_by(county) %>%
 summarize(total_contributions = n()) %>%
 ungroup() %>%
 arrange()

county <- merge(county, total_contributions)
county$perc_dons <- (county$total_contributions * 100) / county$population

percent_donors <- ggplot(data = county) + 
 geom_point(aes(x = perc_dons, y = county, color = clinton > trump), size = 3) + 
 xlab('Contributions Per 100 People') + 
 scale_color_manual(values = c('red3', 'darkblue'), 
                    labels = c('Trump', 'Clinton'), name = 'Winner') +
 theme_minimal() +
 ggtitle('Contributions Per Capita Correlation to Outcome') +
 labs(y = 'County', x = 'Percentage of Population Contributing')

percent_donors

Here again, a higher percentage of people donating is better for Clinton. These contributions were all contributions counted together. So higher rates of political activism favors Clinton in this case.

These suggest a correlation between participation in the election and the outcome of the election. Maybe when more people participate it’s better for Democrats, and when fewer people participate it’s better for Republicans.

participation <- ggpairs(data = county, columns = c(4, 5, 30, 32), 
                         title = 'Participation Correlation')

participation

Wow, so, the correlation between Clinton and participation is weak. It is .27 for the percentage of donors and .32 for voter turnout. However the correlation with Trump and these factors are basically non existent at .1 and .02.

ggplot(data = subset(maine, (party == 'Democratic' | party == 'Republican') &
                    (occupation == 'RETIRED' | occupation == 'NOT EMPLOYED')),
       aes(x = date)) + 
  geom_freqpoly(aes(color = occupation), size = 1, binwidth = 25) + 
  geom_vline(xintercept = as.numeric(as.Date('2016-03-05')), linetype = 6) + 
  geom_vline(xintercept = as.numeric(as.Date('2016-11-08')), linetype = 2) + 
  geom_vline(xintercept = as.numeric(as.Date('2016-7-28')), linetype = 4)  + 
  theme_minimal() + 
  guides(color = guide_legend(title = ''), 
         guide = guide_colourbar(title.position = 'top')) +
  ggtitle('Unemployed and Retired Contributions by Party Timeline') +
  labs(x = 'Date', y = 'Number of Contributions') +
  facet_grid(party~.) +
  scale_color_manual(values = c('mediumorchid4', 'mediumseagreen'),
                     guide = guide_colourbar(title.position = 'top')) +
  theme(legend.position='bottom')

I want to revisit some of the data broken down by occupation. I plotted the retired and Unemployed together because they gave much more than the other occupations. It looks like the unemployed contribution line follows Sanders closely. They gave little to the Republicans. I’m curious who the retired gave to in each party.

retired_donors <- subset(maine, occupation == 'RETIRED') %>%
 group_by(candidate) %>%
 summarize(total = n()) %>%
 ungroup() %>%
 arrange()

subset(retired_donors, total > 19)
## # A tibble: 10 x 2
##    candidate total
##        <chr> <int>
##  1      Bush    50
##  2    Carson   518
##  3   Clinton  5137
##  4      Cruz   857
##  5   Fiorina    76
##  6      Paul    52
##  7     Rubio    82
##  8   Sanders  1183
##  9     Stein    27
## 10     Trump   867

The retired gave a lot across the board. They gave the most to Clinton and the second most to Sanders. Trump and Cruz were the most popular Republicans.

Interestingly, the spike of the unemployed to Democrats correlates to Sander’s spike right after the Maine primary caucuses. I wonder how many of the unemployed contributions went to Sanders.

subset(maine, occupation == 'NOT EMPLOYED') %>%
 group_by(candidate) %>%
 summarize(total = n()) %>%
 ungroup() %>%
 arrange()
## # A tibble: 3 x 2
##   candidate total
##       <chr> <int>
## 1   Clinton   394
## 2   Sanders  6923
## 3     Trump    14

So over 94% of contributions from unemployed individuals went to Sanders. This makes a lot of sense considering Sanders’ ideology. However I don’t think that public knowledge of this statistic would help Sanders politically.

unemployed_sanders <- timeline +
 geom_freqpoly(data = subset(maine, candidate == 'Sanders'),
               aes(x = date), size = 1.01, binwidth = 10, color = 'blue3') + 
 geom_freqpoly(aes(x = date), size = 1.01, binwidth = 10, color = 'blue4',
               data = subset(maine, candidate == 'Sanders' & 
                             occupation == 'NOT EMPLOYED')) +
 ggtitle('Sanders and the Unemployed') 

unemployed_sanders

Oh my! Peak for peak, valley for valley, these two plots are a match. The unemployed were with Sanders every step of the way. The unemployed gave Sander 6,923 contributions. That is one third of his total 20,745 contributions. So it is not surprising that their graphs are similar. I just did not expect them to be such an exact match.

I want to zoom in on these lower numbers at the beginning and the end of Sanders’ run. Sanders total and the total just from the unemployed are almost touching until July 2015 when he hits his first spike. It seems like his momentum is on the backs of the unemployed who put their hopes in him. Then, as his campaign is ending, a year later, the unemployed drop their support a little more steeply, quicker then the general population.

maine_map +
  geom_polygon(aes(x = long, y = lat, group = group), data = map, 
               color = 'black', fill='white' ) + 
  geom_point(aes(x = longitude, y = latitude, color = party), alpha = .05, 
             color = 'darkblue', data = subset(maine, candidate == 'Sanders' |
                                               occupation == 'NOT EMPLOYED')) +
  ggtitle('Sander and The Unemployed Map') 

A map of all the contributions from the unemployed shows few surprises. The alpha level is set to 0.05, so 20 contributions are required for a fully darkened dot. The contributions are dark around the southern coast. However there are also some dots that appear to be in the middle of nowhere. There is one right on the edge of Baxter state park.

Let’s zoom in a little more on the other occupations.

ggplot(data = subset(maine, (party == 'Democratic' | party == 'Republican') & 
                    (occupation == 'PHYSICIAN' | occupation == 'TEACHER' |
                     occupation == 'ATTORNEY')), aes(x = date)) + 
  geom_freqpoly(aes(color = occupation), size = 1, binwidth = 25) + 
  geom_vline(xintercept = as.numeric(as.Date('2016-03-05')), linetype = 6) + 
  geom_vline(xintercept = as.numeric(as.Date('2016-11-08')), linetype = 2) + 
  geom_vline(xintercept = as.numeric(as.Date('2016-7-28')), linetype = 4)  + 
  theme_minimal() + 
  guides(color = guide_legend(title = ''), 
         guide = guide_colourbar(title.position = 'top')) +
  ggtitle('Physicians, Teachers, Lawyers') +
  labs(x = 'Date', y = 'Number of Contributions') +
  facet_grid(party~.) +
  scale_color_manual(guide = guide_colourbar(title.position = 'top'),
                     values = c('mediumorchid4', 'mediumseagreen', 
                                'lightsalmon2')) +
  theme(legend.position='bottom')

For the Democrats, I see a bimodal timeline so these occupations were supporting Democrats when Sanders was popular and when Clinton was popular.

I thought attorneys might support Republicans over Democrats, but no. So who is supporting Republicans?

republican_donors <- subset(maine, party == 'Republican' &
                            occupation != 'INFORMATION REQUESTED' &
                            occupation != 'INFORMATION REQUESTED PER BEST 
                            EFFORTS') %>%
 group_by(occupation) %>%
 summarize(occount = n()) %>%
 ungroup() %>%
 arrange()

ggplot(data = subset(republican_donors, occount >= 35)) + 
 geom_col(aes(x = occupation, y = occount), fill = 'red3') + 
 coord_flip() + 
 theme_minimal() + 
 ggtitle('Occupations with Over 35 Contributions to Republicans')+
 labs(x = 'Occupations', y = 'Number of Contributions (log10)') +
 scale_y_log10(breaks = c(1, 10, 100, 1000))

So these are the top occupations donating to Republican candidates. For the most part these are more business oriented than the top donors in general. Let’s do the same with the Democrats.

democratic_donors <- subset(maine, occupation != 'INFORMATION REQUESTED' &
                            occupation != 'INFORMATION REQUESTED PER BEST 
                            EFFORTS' & party == 'Democratic') %>%
 group_by(occupation) %>%
 summarize(occount = n()) %>%
 ungroup() %>%
 arrange()

ggplot(data = subset(democratic_donors, occount >= 200)) + 
 geom_col(aes(x = occupation, y = occount), fill = 'darkblue') + 
 coord_flip() + 
 theme_minimal() + 
 ggtitle('Occupations With over 200 Contributions to Democrats') +
 labs(x = 'Occupations', y = 'Number of Contributions (log10)') +
 scale_y_log10(breaks = c(1, 10, 100, 1000))

So there are several overlaps. The democratic donors appear to be in occupations that require more academic credentials. Ten out of twenty require more than a high school education. For the Republican occupations I’m counting two out of twelve, physician and medical transcription.

county_results <- maine_map +
 geom_polygon(aes(group = group, fill = clinton < trump),color = 'white') + 
 ggtitle('Results') + 
 scale_fill_manual(values = c('darkblue', 'red3'), name = '',
                   labels = c('Clinton', 'Trump'))+
 theme(legend.position='bottom')

district_results <- maine_map + 
 geom_polygon(aes(group = group,fill = as.factor(district)), color = 'white') + 
 ggtitle('Districts') + 
 scale_fill_manual(values = c('darkblue', 'red3'), name = '',
                   labels = c('Clinton', 'Trump'))+
 theme(legend.position='bottom')

grid.arrange(county_results, district_results, ncol = 2)

On the left, the red counties went to Trump and the blue to Clinton. On the right are the Maine Congressional districts. On the national scale, that means that Clinton got the electoral college votes from the first district and Trump from the second. Overall, things look fair here.

final_tally <- select(county, clinton, trump) %>%
 summarize(Trump = sum(trump), 
      Clinton = sum(clinton)) %>%
 arrange()

final_tally
##    Trump Clinton
## 1 334945  354718

Clinton won the state by 20,000 votes. I like that we don’t have a winner-takes- all state, as it gives each individual more of a voice.

sum(county$trump) / sum(county$ballots)
## [1] 0.4362155
sum(county$clinton) / sum(county$ballots)
## [1] 0.4619668

It is fair that half of our delegation went to Trump even though only 43% of voters voted for him. At least is more fair than 100% going to Clinton even though she only got 46% of the vote.

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this section we took a good look at the donors. We looked at what occupations each party’s donors had. I zoomed in on the unemployed because I found it fascinating. It was neat to see differences and similarities between the top donor occupations for each party.

The voter turnout and percentage of the population that donated appeared to strengthen the correlation between political participation and a Clinton win for those areas. However when I ran the numbers the correlation was kind of weak.

Were there any interesting or surprising interactions between features?

Sanders and his unemployed supporters were the most surprising thing for me. I wasn’t surprised that most of the unemployed supported Sanders. I was surprised that 1/3 of the contributions to senders were from unemployed individuals. I was also surprised to see how perfectly the two plots matched.

Final Plots and Summary

Plot One

grid.arrange(turnout_results, percent_donors, ncol = 1)

participation

Description One

The correlation between voter participation and the election outcome was a neat discovery. I thought I was on to something big with the first two plots. To me it looked like a strong correlation between both voter turnout and contributions per capita to Clinton winning.

The first thing I tried to do with these was overlay the two graphs. Since the coordinates on the x-axis were different that was a challenge. I did the math to try to multiply the percentage of contributions by the right number to get them to overlay neatly. However, the right number for the Republicans was 35 and the right number for Democrats was 12. That was my first hint that the correlation might not be what it seems.

That’s why I had to run the ggpairs, which is probably what I should have done to begin with. This showed me that there indeed was a stronger correlation for Clinton than for Trump. There just was not as strong a correlation as I was expecting.

Plot Two

unemployed_sanders

Description Two

The saga of Sanders and the unemployed tells a lot about who Sanders appealed to. It also tells about who built Sanders up in the Maine primary. In this timeline you can see that the unemployed who supported Sanders had the same peaks and valleys as the general population that supported Sanders. In the end they dropped monetary support for him before the general public did.

Plot Three

grid.arrange(republican_contribution_timeline, democrat_contributions_timeline,
             ncol = 1)

Description Three

Timelines counting the contributions of each party tell a strong story. Republican Mainers were enthusiastic for the primary candidates, mostly Cruz and Carson. They gave generously. When their first choice candidates did not win, they jumped right over to Trump. Yet Mainers stopped donating well before the election. This may be because of the predictions saying he didn’t have a chance.

Mainers fell in love with Sanders. Many of them were down on their luck, such as the unemployed. When Clinton was crowned as the nominee, Mainers were hesitant. This is why you don’t see a peak at the national convention line. As time went on they reconciled with Clinton. They backed her strongly for the November election.

These two stories are different. Yet they both are stories of high hopes that are dashed, then rise with the peaks of the timelines and are dashed again.

Reflection

Most of my struggles were in learning some of the basics of R. I had trouble learning string manipulation through regex and other data manipulation. Every time I went to create a new database I went to the documentation to see the format. The same was true with each merge and correlation. Some of it I got easily, like the structure of ggplots. A lot of stuff came through trial and error and brute force of will.

The exploration was what went well for me. This is a topic I had a lot of questions about. I also came at it with a good amount of knowledge, Maine being my home state where I have lived for most of my life. I was careful not to let that knowledge cloud by observations, but instead to color them. The Maine knowledge also made it so that I was curious about things that I didn’t expect, such as the Bush anomaly where his mean contribution was over $1000.

There were many surprises. Trump was not a first choice candidate and did not get much support until he was the national nominee. I didn’t expect that. I didn’t expect that Clinton would have twice as much money as Sanders even though he had 4,000 more contributions than her. Although now that I know that a third of his contribution are from the unemployed that does make a bit more sense.

This is a data set that has a lot to offer to the curious. A lot more could be done with income distribution in Maine and how that affects the political outcomes. It’s also interesting that the rural coast seems more liberal than the rural inland regions. Maine is truly a dynamic place with many complex individuals.