Thursday, January 10, 2008

Statistical exploration of New Hampshire Democratic primary results

A number of people have been surprised by the unexpectedly high score of Hillary Clinton in New Hampshire. Some people noticed that there was a strong correlation between precincts using Diebold equipment and Hillary's score. Other noticed notable discrepancies between polls and the results. Yet others have pointed to some exit polls that do not match with the official results.

Some people offered the explanation that smaller precincts tend not to use Diebold machines and also tend to favor Obama, for whatever sociological reasons. As someone put the election data in computer-readable format on the web, and as I am slightly versed in statistical analysis using the R package, I decided to run some tests.

First, the data. The data has been posted as a Google spreadsheet on the web by Reddit user brfox.

It is missing some data in Dummer Hand, Franconia, Greenfield, Groton Hand, Harts Location, Manchester, Temple Hand, Waterville, Wenworth's Location and Windsor Hand. It has a total of 286139 Democratic voters.

Hillary Clinton wins with 39.1%. Obama gets 36.3%. Of all the Democratic votes, 57837 were hand-counted, 207251 (72%) votes were Diebold-counted.

So far, so good.

In hand-counted precincts, which make up 20.2% of the votes, Obama gets 38.6% and Clinton gets 34.9%. In Diebold-counted ones, Clinton makes 39.6% and Obama gets 36.3%. This is the basis for the initial claims of vote rigging.

Claims which are countered by the observation that precincts where the votes are hand-counted are small, non-urban precincts. Urbanity is, of course, a well-known factor affecting political choices.

Indeed, the mean precinct size, counted by number of Democratic votes, for hand-counted precincts is 431 (0 to 2602, median 323), against 2159 (269 to 17160, median 1320) for Diebold-counted ones.

Actually there is a very significant correlation at p < 0.002 between Clinton's score and the precinct size, and an even better correlation between Clinton's score and voting method, and yet a better correlation between precinct size and voting method.

We cannot say much more without going to multivariate statistics. Fortunately, thanks to GNU R, mere mortals can benefit from multi-variate statistical modeling.

So, we have three variables: voting method, Clinton's score and the precinct size. We ask R to compute the best linear model that links them, and then run an analysis of variance test on the model to see how well it fits the data.


> l <- lm(cliv ~ d$dem_size + hand)
> l

Call:
lm(formula = cliv ~ d$dem_size + hand)

Coefficients:
(Intercept) d$dem_size hand
3.859e-01 2.598e-06 -4.644e-02


These cryptic lines mean that Hillary's score can be computed by 38.59% plus the Democratic size divided by 384911.5 (which is 1/2.598e-6) minus 4.64 percentage points whenever the voting method is by hand.

So it is estimated that voting method accounts for 4.64 percentage points of Hillary's score.

How much variability does this linear formula remove from the data? The standard deviation (on a precinct by precinct basis) of Hillary's score is about 7.8 percentage points.


> summary(l)

Call:
lm(formula = cliv ~ d$dem_size + hand)

Residuals:
Min 1Q Median 3Q Max
-0.339495 -0.042209 -0.001204 0.046007 0.327182

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.859e-01 1.009e-02 38.247 < 2e-16 ***
d$dem_size 2.598e-06 3.115e-06 0.834 0.405
hand -4.644e-02 1.126e-02 -4.123 5.28e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07371 on 223 degrees of freedom
Multiple R-Squared: 0.1081, Adjusted R-squared: 0.1001
F-statistic: 13.51 on 2 and 223 DF, p-value: 2.894e-06


Look at the t value! As you can see, voting method explains a lot better than precinct size.

And if you care about ANOVA:

> anova(l)
Analysis of Variance Table

Response: cliv
Df Sum Sq Mean Sq F value Pr(>F)
d$dem_size 1 0.05446 0.05446 10.024 0.001761 **
hand 1 0.09235 0.09235 16.998 5.28e-05 ***
Residuals 223 1.21158 0.00543
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Now let's think a little bit. There could very well be a politically meaningful parameter correlated with voting method besides precinct size. As Diebold has connections with Republicans, it could be that Republicans favor Diebold. Could it be that the Republican to Democrat size ratio explains the voting method?

I'll spare you the R screen dump: the p-value of the correlation coefficient being 0.69, the R to D size ratio doesn't seem to explain anything.

Hence voting method explains Hillary Clinton's score accross precincts better than precinct size or I fail Statistics 101.

All of this is of course armchair politology. I would like the opinion of an independent expert sociologist. However I would like to finish this post with the following quote:

Bilderberg guests from previous years include Senator Clinton and a former governor of Virginia, Mark Warner, both of whom who are considering running for president in 2008. President Clinton also hunkered down with the club one year. One of the most famous rumors associated with the Bilderberg is that it "anointed" Mr. Clinton in the spring of 1992.

http://www.nysun.com/article/34231?page_no=1


10 comments:

David Diez said...

Gotta love R when it comes to statistics. It's refreshing to see you do the exact (basic) analysis I had in mind -- when blocking for district/precinct size, voting method was very significant.

To put this in perspective for non-stats people or at least assist with the understanding, the p-value is the chance (probability) that we would get the result or a more extreme result if the corresponding variable actually doesn't matter. That is, if the variable didn't actually matter, it is the chance of getting this extreme of a result (and the p-value is so extreme for the counting method that it is like flipping a coin 14 times and getting heads every single time... very unlikely).

Miguel said...

This is great work, thanks!

For good measure I'm going to try to reproduce your calculations, but the argument is solid.

Would you like to cross-post your entry on the community blog EuroTrib.com?

It's good enough to be on the recommended diary list on Daily Kos, IMHO.

Unknown said...

Look at the financing behind Diebold and ES&S. There is a heavy Jewish influence. Clinton has scored big with the Israeli lobby because of her "we will take no options off the table" stance on Iran. One of the big Isreali news organizations ranked her second behind Giuliani in terms of her willingness to serve Israeli interests.

This alone does not prove fraud.

Unknown said...

It appears that the Google spreadsheet you reference has outdated/questionable hand vs. Diebold assignments. It doesn't match the NH government listing at http://www.sos.nh.gov/voting%20machines2006.htm. There are many discrepancies, (e.g. Claremont is Diebold). I wouldn't tend to trust the data source used for the spreadsheet-- Manchester being listed as "unknown" is dubious.

I'd like to see a rerun of your analysis with the corrected data.

Also, out of curiosity, I'd like to see a run with Clinton and Obama's Diebold votes swapped.

Unknown said...

Here are final results of NH primary in easy to read chart.

http://www.boston.com/news/politics/2008/nh/nh_primary_dem_results_by_town/

semmelweis said...

Thanks for all the comments, I'll re-run the scripts with the new data shortly.

Unknown said...

Thursday 1/10: Bruce O'Dell writes:

Theron Horton and I have confirmed that based on the official results on the New Hampshire Secretary of State web site, there is a remarkable relationship between Obama and Clinton votes, when you look at votes tabulated by op-scan versus votes tabulated by hand:

Clinton Optical scan 91,717 52.95%
Obama Optical scan 81,495 47.05%

Clinton Hand-counted 20,889 47.05%
Obama Hand-counted 23,509 52.95%

The percentages appear to be swapped. This seems highly unusual.

Recall that the specific model of Diebold op-scan [1.94w] and central tabulator in use in New Hampshire are proven by demonstration [Hursti Hack] to be vulnerable to insider manipulation.

Theron Horton and I are proceeding with the intra-county and demographic analysis.

More to come.

Bruce O'Dell
Co-Coordinator for Data Analysis
Election Defense Alliance
Bodell[at]ElectionDefenseAlliance[dot]org


What is going on here?

You already demonstrated extreme statistical anomalies. Now add this one. These numbers check out to .0001% variance.

What kind of spreadsheet code could yield such extreme events? Is it possible that this extreme statistical event is the result of a program that is calculating the computer counted votes as a function of the hand counted votes?

I'm assuming that the hand counted votes cannot be altered by the central tabulator but the computer counted votes can be altered.

Perhaps, the central tabulator is not only receiving info from subordinate terminals but it is also sending info to the subordinate terminals, adjusting the numbers according to a preset formula.

Let me know what you think. Could you write a program that would similate these events?

Unknown said...

I'm no stats expert, and this could be a spurious question, but I think you have a problem in the model. In a linear multivariate regression, the explanatory variables should be uncorrelated; otherwise, some of the effect of one could be attributed to the other. It's called multicollinearity.

If size and method are so highly correlated, you have a problem: you cannot use a simple regression like this to judge the separate effect of each.

Like I said, I'm no stats expert, but if I'm just seeing things, I'd really appreciate it if you'd explain why it's not an issue.

Flipperbw said...

I have an idea. I haven't seen this suggested anywhere.

What I want to do is find the results from the 2004, 2000, 1996 etc. New Hampshire primaries, and take a look at the differences between the Diebold and Hand-count votes for those elections. If we could show that it is very rare to have such a large discrepancy between machine vs hand, I think we'd have a pretty strong case.

But I can't find those documents. Anyone know where we can find them? This would be a smoking gun if we could prove it.

Anyone want to help me take this on?

Flipperbw said...

Whoops, it appears those machines were ushered in just last year, and so there are no previous records for NH...

can anyone find the results from other states?