Factors affecting an American County’s Politics: A Statistical Approach
I always enjoyed looking at election results by county (a geographic subdivision) and the factors that affect it (like education, race, population density). While I could do simple calculations that could compare one variable to another, I always wanted to compare multiple variables and see how they would combine together. This would be much more interesting than just one variable.
Journey of reaching the data:
This is data that I combined from an online source for election results and the US Census’ Bureau’s data on demographics. From here, I managed to put all the necessary variables into R (a statistical programming language) to do a multivariable regression. I then took the constants from the linear regression, which I could then use to create an estimate for how a county would vote, based on the demographics of the county.
The variables I took for the regression were the share of population in a county that is non-Hispanic White, African-American, Asian and Hisapnic, along with the common (base 10) logarithm of the population density and the education levels (the share of residents that are aged 25 or above and that have a Bachelor’s Degree or higher). I had studied these variables earlier and found that these had an effect on the county’s politics on their own.
Visual depiction :
Here is roughly what you get when you try to estimate how the counties vote just by using their demographics and the regression constants. The Bluer the dot, the more Biden (the Democrat) should have won it by. This is not too far from the actual results.
This map measures the difference between what the regression estimates and the actual results. The Bluer it is, the more Democrats beat the regression estimates by, while the Redder it is, the more the Republicans beat the regression estimates. You can see some patterns here, of certain areas generally being more Democratic or more Republican than the regression model expects.
The linear regression gives a “line of best fit” based on the values you already have, which can then be used to estimate an output value with multiple inputs, even if you do not have the actual output for those inputs.
This regression works rather well, with the expected values having a strong relationship with the actual values. It is much more reliable than using a few variables on their own. However, its accuracy has limits because it uses the education level of a county too heavily in it’s analysis. Education is a large factor in a county’s results, but the model often makes errors in more-educated Republican counties and less-educated Democratic ones. This leads to many suburbs around big cities being underestimated for Republicans while many working-class areas are underestimated for Democrats, which includes much of the rural Midwest, New England and working-class areas in cities with many minorities. Still, it is a useful tool to estimate how a place will vote, just given its demographics.