Earlier this month I was fighting a pretty bad cold. While I probably should have been doing “real” work during the sinusy waking hours at home, I decided instead to finish season two of The Walking Dead and scratch an itch regarding the 2012 presidential election. I feel like I’ve seen some sloppy journalism and “data porn” floating around in the wake of the election: simple mash-ups of data and ill-drawn conclusions about how race, poverty, religion, and historical baggage (e.g., slavery) influenced the vote.
So I threw together some data, too, to try and understand these thing for myself (and you, if you care). In short, I built a statistical computer model that predicts the vote distribution between Obama and Romney at the U.S. county level, using some demographic info about each county as evidence. Here is a quick summary of what the data seems to say:
- Obama-leaning communities tend to: live in more multi-unit housing, have more businesses owned by women and minorities, be more educated, and have an economy driven by manufacturing. (Some weak associations include: population density, retail or food & service economies, and percentage speaking a language other than English at home).
- Romney-leaning communities tend to: be more White or Hispanic, have more children, and own their own homes. (Some weak associations include: a wholesale-trade economy, percent of residents who were foreign-born, and whether the county was part of a former slave state or territory).
- Apparently irrelevant features include: income level or poverty rate, household size, and presence of a voter ID law.
Juicy details below. Note that these findings are correlations, and correlation does not imply causality. Also, these are results for counties, not individuals. For example, the percent of the population under 18 years of age is a predictive feature of pro-Romney counties. Clearly, these children were not voting for Romney (they are too young!), so it must have been the adults in counties with more children. We should be similarly cautious about drawing hasty conclusions about people described by other demographic features. (There is an interesting result regarding Hispanic populations that I discuss below.) Sometimes such conclusions are reasonable, but you have to be careful, and what sets my little study apart from most of the mash-ups I have seen is that these different demographic features are all taken into account. In other words, we’re considering the role of, say, voter ID laws in the context of everything else.
Before we dive in, a caveat: I am not a political scientist. I am a computer scientist with a background in machine learning and natural language processing. I am learning more social science, but my methods and interpretations do not benefit from a ton of social theory or training (yet).
The Data Set
The core of the data is a mash-up of two sources: county-level election results taken from this Google Fusion table made available by The Guardian newspaper on November 7, 2012, and U.S. Census data downloaded from the State & County QuickFacts repository. The latter represents most of the demographic features I used. I only focused on Obama vs. Romney and ignored third-party candidates, as there were simply too few votes for a meaningful analysis. Besides, I didn’t study up on the other candidates’ positions, so I wouldn’t even be sure how to interpret the results.
For each county I a few additional features to test two hypotheses. First, there has been some speculation that voter ID laws hurt President Obama in some way. To test this, I added binary variables with a value of one to counties in states that (1) required photo ID, (2) requested photo ID, or (3) required non-photo ID. Second, this juxtaposition circulated widely on the Interwebs, comparing an 1860 map of free vs. slave territories to the 2012 election results; the suggestion being that regions where slavery was legal back then are anti-Obama today. To test this, I added a binary variable with value of one if the county was part of an 1860 slave region, and zero otherwise.
In summary, the data set consists of 3,077 counties with both vote (outcome) and demographic (evidence) information, and there are 55 demographic variables used to try and predict each county’s vote distribution.
To my non-machine-learning friends: I’ve tried to write this section clearly with you in mind, but it uses a little pro-level math jargon and notation. I won’t be offended if you just skip to the figures and such below.
The model is a kind of logistic regression (LR), which is a pretty common method for observational studies in social science. That is, I make the simple assumption that each feature (“percent under age 18,” “percent in poverty,” etc.) makes an additive contribution to whether the county as a whole swings toward Obama vs. Romney. For the equation-lovers out there, the predicted proportion of votes that go to Obama is given by:
where is a vector of model weights, is a vector of demographic features for a particular county, and is the logistic function. So if denotes the “percent in poverty,” is the corresponding weight for that feature: positive values are Obama-leaning, and negative values are Romney-leaning. Multiply these together, add up the products for all demographic features/weights, shove it through , and out comes the predicted proportion of votes that go to Obama. The trick is to pick the right weights in to make accurate predictions and understand how each of these demographics may affect voting habits.
There are a few differences between the way I estimate these weights and a standard LR analysis. First, one usually predicts a true/false variable (like “did Obama get the most votes?”), and picks weights that minimize a loss function like the “log-loss” of the actual observed true/false outcomes. With this data set, though, the outcomes are not true/false but distributions (i.e., the proportion of votes that go to Obama or to Romney). So instead, I pick weights that minimize the KL-divergence between the actual vote distribution and the predicted distribution :
If you are familiar with regression analysis, the standard LR is just a special case of this: is either zero or one, and the equation above reduces to the log-loss function. The second summation in the equation is an “L1 regularization” prior on the weights (a.k.a. the LASSO). This does two things: (1) it penalizes large weights and makes better predictions by not overfitting, and (2) it encourages a sparser, more easily interpretable model by “throwing away” demographic features that seem to be statistically irrelevant, by driving those weights toward zero. The parameter controls the regularization level, and after a little tuning I picked a value of 50.
(Insert obligatory “the loss function is convex and can be minimized using an orthant-wise limited-memory quasi-Newton optimization method” statement here. Also, I guess you could do an ordinary least-squares regression — I haven’t tried that — but this approach guarantees that predictions are properly probabilistic.)
If this is all new to you, this method basically just tries to pick model weights whose predictions are as similar as possible to actual vote distributions. The only other minor modeling note is that features exhibiting long-tail distributions were log-tempered: . This is a pretty common adjustment and leads to better predictions, too.
So How Accurate is This Model?
I tried a few ways of fitting the weights, and the KL-divergence approach described above seems to work best. Here’s a comparison of that vs. a standard true/false logistic regression, both using the L1 regularization:
The scatter plots show the actual vs. predicted proportion of votes going to Obama (all predictions pooled over ten folds using cross-validation). Each dot represents a county: blue dots are correctly predicted as majority Obama, red dots correctly predict majority Romney, and grey dots are misclassifications. The standard LR (right) has a little bit better classification accuracy, but it gets the vote distributions waaaaay off: by 25.7% on average. The KL-divergence LR (left) is only off by 8.9%, which is significantly better (p<.00001, paired 2-tailed t-test). For comparison, guessing 50/50 for every county would be off by 15.5% on average. Furthermore, you can see the straight “diagonal line” type relationship between the two axes in the left plot (as opposed to the “S” shape on the right), which is sort of the goal. So KL-divergence LR it is.
To better visualize what’s going on, here are some maps:
In broad strokes, the demographic model’s predictions (top) capture a lot of the trends in the actual vote distributions (middle; similar maps elsewhere). However, the model errs toward a 50/50 split; it doesn’t predict the really polarized counties as well (this is possibly due to weight shrinkage from the L1 term). The difference map (bottom) shows the model’s errors: bluer regions are actually more pro-Obama than predicted, redder regions are more pro-Romney, and whiter regions are spot-on. (Note that a few counties are greyed out in all the maps… this is because there I had no voter data for these counties).
So the model isn’t perfect, but it’s pretty darn good, especially considering it only uses demographic features as evidence and no explicit vote history or geography information (which should improve predictions, but is tangential to my interest in the demographics). It predicts voting proportions within 8.9% on average, explains nearly half the variance among those proportions, and correctly picks the county winner 85.4% of the time. Awesome!
(Update: Larry Wasserman indirectly pointed out over email that counties are not weighted by population or voter turnout here. This might affect the model.)
Now let’s have a look at the model weights and what they tell us about voting patterns in different kinds of communities. Here’s a figure illustrating the learned weights (or download a more detailed PDF version):
Pro-Obama features are typeset in blue, pro-Romney features are in red, and features with zero weight (those “thrown away” as irrelevant) are in black. The features I added are highlighted with yellow. The colored squares on the right indicate the magnitude of the weight: darker colors have more impact. As you can see, there are only about eleven pro-Obama and six pro-Romney features of much practical significance. Let’s dig in.
Voter ID Laws and Slavery
Let’s begin by looking at the two “Internet hypotheses” I tried to test by adding new features.
First, voter ID laws have no discernible impact. It may be that at the individual level, these laws do reduce voter fraud and/or prevent disadvantaged citizens from voting, but at the county level the model simply threw these features away. I re-ran the analysis with all three variables collapsed into one (i.e., “any kind of ID law”) and got the same result. As a sanity check, I tried the model without the L1 regularization, and the ID law features did turn out to be negatively associated with Obama, but this alternative non-regularized model was also much worse at prediction for all the measures I considered.
So my model says that voter ID laws had zero effect on on Obama vs. Romney outcomes at the county level. Oh yeah? That isn’t the story implied by the news stories, or even by this figure showing violin plots of the data:
It looks like Obama took 44% of the vote in counties with no voter ID laws, compared to 34% in counties with any kind of law (pkinds of communities that were open to voter ID laws, which might also happen to prefer Romney to Obama to begin with. As evidence of this, the model takes these other demographic features into account, and since the ID laws do not help improve the predictions, they get ignored.
(Note: I didn’t throw any latent-variable models at the data and I’m not a causality expert, so all of these are best guesses as to what’s really going on.)
(Also: How did Obama win if he took only 34-44% of the vote, on average, in counties both with and without ID laws? These statistics aren’t normalized for population. Romney won more counties than Obama, but Obama won many of the more populous counties, thus winning both the overall popular and electoral votes.)
Second, being part of a historically slave state or territory does have a slightly negative effect on predicted Obama votes. It is statistically significant, and may indeed reflect racial prejudices that 152 years of progress in civil rights has yet to overcome. However, the weight isn’t too practically significant (the percent of the population under age 5 and percent of owner-occupied homes have even stronger effects). I suspect that the “former slave territory” feature is actually being used by the model as a proxy for geography. Note that the model’s errors (in the “difference” map above) are geographically systematic, and the free/slave state variable partially overlaps with these errors. It might be that with better geographic information, the model would ignore this feature, but then again maybe not. Besides, it’s hard to tease things like this apart, since associations with slavery are partially encoded in present-day racial demographics as well.
Race and Ethnicity
Not surprisingly, whiter communities lean strongly toward Romney and blacker communities lean strongly toward Obama. This doesn’t mean that all (or even most) Whites voted for Romney, but people who live in mostly White communities did. Likewise for Blacks and people (regardless of color) who live in mostly Black communities. For example, my White friends who live in Pittsburgh’s Hill District tend to support Obama. Asian presence leans slightly toward Obama as well. Native Hawaiian and Pacific Islander presence has no discernible impact. Perhaps the most interesting observation has to do with the presence of a Hispanic community. The percent of residents who are of Hispanic (or Latino) origin strongly predicts more Romney votes, but the percent of Hispanic-owned businesses strongly predicts more Obama votes. For the other non-White groups, the percent of population and percent of business ownership lean in the same direction, but not with Hispanics. Why?
We can “poke the box” a little bit to try and get a sense of what’s going on. By that I mean let the model make predictions both with and without these features, and see how they differ. Here is a map that visualizes the effects:
Redder regions are predicted to be more pro-Romney if the model takes Hispanic population and Hispanic business features into account. Bluer regions are similarly more pro-Obama. Whiter regions are unaffected, either because they do not have as many Hispanics around, or the opposite effects just cancel each other out (as in Las Vegas and Phoenix, apparently). As a sanity check, the counties that seem most affected in either direction are located in Florida or near the Mexican border, so that makes sense.
I surmise that what’s going on here has something to do with a balance of power. Consider the scene: counties with more Hispanics tend to vote for Romney, unless there are more Hispanic-owned businesses. Well, some of the Hispanics in these places are immigrants (both legal and illegal), and thus probably cannot vote. Perhaps these aliens are migrant workers, which in turn might stir up resentment among native non-Hispanic citizens who vote for Republicans in support of tougher immigration laws. West Texas might be that kind of place, for example (not that I would really know, that just fits with my own preconceptions). On the flipside, Hispanic citizens with voting rights might be more likely to own businesses, create jobs, rise to positions of leadership, and also vote Democrat. Florida, Southern California, and New Mexico might be places with a Hispanic population more like this.
The National Election Pool (NEP) (as summarized by The New York Times) says that Obama won among all non-White ethnic groups — including Hispanics — at the individual voter level. So the model of county-level voting behavior here suggests that where more non-voting Hispanics are present, the voting community might actually swing the other way. This is just my speculation, but I something like that might really be happening. Another lesson in correlation vs. causality: we shouldn’t simply conclude from a single statistical association which way Hispanics are actually voting.
(Update: See Steve Sailer’s interpretation in the comments below.)
The strongest predictor of Obama votes is the percent of housing units that are in multi-dwelling structures. My first guess is that these buildings are (1) apartments and (2) projects, which are more common in urban areas that tend to be more left-leaning, and also indicate renters rather than home-owners. This is in contrast to the percent of homes that are owner-occupied (more common in rural and suburban areas), which is a predictor of Romney votes.
Another possibility is that the Obama campaign was just more well-mobilized in getting out to canvass these multi-dwelling units, which had an impact on voter turnout. In 2008, my wife and I lived in a four-unit urban complex in Madison, Wis., and we had a lot of Obama folks come by. In Pittsburgh for this election, we had one Obama campaigner darken the door of our rented row house. We never had a single McCain or Romney supporter either time.
Economy and Wealth
The value of goods from manufacturing firms is another top-five predictor of Obama votes. The other economic sectors are tossups, with most leaning Obama except wholesale. However, the data set I used lacks any information on universities, military presence, farms, and other interesting economic drivers. I suspect these would have greatly improved the predictive power of the model, and would have been interesting to inspect as well. (All that info exists in the full Census report, but that data is messy and this isn’t a real research project…)
Also, I felt that during this election, Romney supporters were (unfairly) stereotyped as rich and out-of-touch while Obama supporters were seen as poor and government-dependent (given comments like “the 47 percent“). So I was particularly interested in features that directly describe a community’s distribution of wealth (percent in poverty, median household income, per capita income, etc.). They were all thrown away by the model. The general wealth of a community (or lack thereof) appears to have no substantial impact. There might be interesting interactions (e.g., communities that have both high education and high income, rather than treating these variables independently), but I poked around a little bit and couldn’t find any interesting or sensible ones (and they all seemed to hurt the model’s predictive power, too).
(Update: Andrew Gelman pointed out over email that affluence really does appear related at the individual level. I’m in the process of reading his popular book about the 2008 election, Red State Blue State.)
Education, Gender, and Age
Communities with more education definitely voted for Obama: the association with college degrees is even stronger than high school degrees. This doesn’t necessarily mean that “smarter people voted Obama,” though. Counties with a concentration of educated people are probably urban centers or university towns, which tend to be more left-leaning. Note that the aforementioned NEP pool found that individuals with college degrees (though not post-graduate degrees) leaned slightly toward Romney, and that Obama won among individuals who didn’t finish high school by a 29% margin.
The proportion of females did not appear to have a large impact either way (barely Romney-leaning), which is no surprise since most counties have a 50/50 gender split. However, the NEP pool also shows that Obama won by an 11% margin among women voters, and my model significantly associates more women-owned businesses with Obama.
I’m not sure what to conclude about age, though, in part because the data set only has a few age variables in it. As I mentioned at the beginning, the percent of the population under age 18 is a strong predictor of Romney votes (so is the percent under age 5). Since these young citizens are too young to vote, I assumed it must be their parents who were voting for Romney, but the NEP pool claims that Obama won parents of children under age 18 by a small margin. That confuses the story a bit. Another puzzler is the percent of the population over age 65, which is mildly associated with Obama in the model, whereas the NEP pool suggests that Romney won that group by 9%. There might be another “who is actually showing up to vote” story here (as with the presence of children and Hispanics), but I don’t know what it might be…
This was a fun little study, although it ended up being more involved than I intended… I needed the Thanksgiving weekend to finally write it up! There are some other observations one could make with the data, but these are the ones I found most interesting. I feel like the model dispels a couple of sloppy theories regarding voter behavior that I’ve heard (like voter ID laws and former slave states), and also uncovers some interesting trends (regarding ethnicity and wealth in particular). If you have any other thoughts or insights, please leave them in the comments below.