Sunday, January 27, 2008

Obama's South Carolina victory not simply a consequence of high black voter turnout

Brian has done some no-nonsense calculations showing that even with a black voter turnout as low as 18%, Obama would have won South Carolina.

Tuesday, January 15, 2008

Complete map of Diebold precincts

Here you go. Red is Diebold.

Diebold effect sticks around, need a proper statistician

I have incorporated Brian's work in my latest analysis. I tried different linear and generalized linear models combining Clinton's score, the total number of votes, the usual demographic data, employment rates, education and latitude and longitude.

I took the latest available data sets, got the list of towns using electronic voting from the official site, completed and corrected town names by hand.

The significant factors I found are, in order of decreasing signifiance:
  • Percent of people holding bachelor's degrees,
  • Voting method.
It is interesting to note that when with the new, improved data, the percent of people holding a bachelor's degree becomes extremely significant (about p = 3e-9 vs. about p = 0.001 for voting method.)

I'd like someone who know his statistics well to check the data and tell us if the voting method is indeed significant. The fitted models are linear and for all I know, it could be acting as a non-linear proxy for population size or some other funny explanation...

Anyway the new data is available here, feel free to check, improve and re-publish it. Note that you need the maptools R package. You can install it by typing install.packages(c("maptools", "maps"),dependencies=T)) in R.

Monday, January 14, 2008

Geographical explanation might not be as watertight after all...

Reddit user brfox has started his blog on election statistics and did some analysis taking the geographic distribution of states into account. It seems that there is still a Diebold effect.

Meanwhile a column at argues geography is the explanation.

We need to double- and triple-check check all this data carefully. If, after controlling for various parameters, we still have what looks like a statistical anomaly, we must get some expert opinion.

Request voting precinct geographical data

Hello all,

After brfox's message, I have been scavenging the net for a database that would allow me to map New Hampshire voting precincts.

I found some data at the USGS, however voting precincts do not match ZIP codes exactly. Also I need latitude/longitude data.

So I parsed the latlong data from Wikipedia, which lists 223 towns. Here it is, feel free to use it or to complete it.

We can do a paired study on adjacent voting precints with different methods, or do a regression analysis with latitude/longitude as extra data.

Strong geographical clustering of Diebold precincts

Someone from Reddit has just forwarded me a map showing that precincts using Diebold machines are geographically aggregated in the south-east of the state. Finally, this could be the explanation!

We still need to do a study of geographically and demographically similar pairs with distinct voting methods.

Sunday, January 13, 2008

Archive of R files and data

You can download a .tar.gz file containing the R scripts and the CSV data here.

College education is a significant factor, Diebold effect remains

On Reddit, ohno linked to a study suggesting that Clinton's result might be due to a shift in the opinions of college-educated women along with their under-representation in polls.

Accordingly, I included the percent holding bachelor's degrees (which I left out of inattention) in the regression analysis. It appears that college education is a very significant factor in explaining Clinton's result: F value 14.3 at p = 0.0002276.

However the Diebold effect remains at F = 16.6 and p = 7.859e-05.

Controlling for employment, age, housing or income data doesn't remove the Diebold effect

While I was spending a quiet Sunday I didn't notice the large amount of activity here. Thanks for all your comments and your data! I thought this analysis was lost in the blogosphere.

Meanwhile other have done their own analyses and it seems that the effect doesn't disappear. They have also published their own data in computer-readable format.

The mainstream media has also picked up the buzz.

I quote:

Lenski said it's all of a piece: Education, income and age -- factors that influence voters' candidate choices, also play into where they choose to live.

So I loaded one of these data sets which includes the following data:

  • Primary results

  • Vote method

  • Sociodemographic data: age distribution, housing units, unemployment rate, median household income, number of single family homes.

I have then attempted to explain Clinton's result by various combinations of these variables, using multivariate linear regression in GNU R.

In all these attempts, voting method remains the most important variable (besides, of course, results of other candidates) explaining Clinton's score, with a F-value of 15 and p < 0.0002. Next we have unemployment rate (with a F-value of 4 and p < 0.04).

Here is the R command:

> model <- lm(
nh$Clinton ~
nh$Obama +
nh$Biden +
nh$Dodd +
nh$Edwards +
nh$Gravel +
nh$Kucinich +
nh$D1H0 +
nh$Votes +
nh$Totalpopulation +
nh$Percapitaincome * nh$Totalemployed * nh$Totalunemployed +
nh$Singlefamilyhomes +
nh$Multifamilyunits +
nh$Medianage +
nh$Percenthighschoolgraduates +
nh$Age5andunder * nh$Age5to19 * nh$Age35to54 * nh$Age55to64 + nh$Age65andup +
nh$Employeesinlargestbusiness +
nh$Municipalwater * nh$Municipalsewer * nh$Totalhousingunits

And its results:

nh$Obama 205.6910 < 2.2e-16 ***
nh$Biden 0.3304 0.5663645
nh$Dodd 11.2074 0.0010484 **
nh$Edwards 0.8830 0.3490086
nh$Gravel 3.9644 0.0484338 *
nh$Kucinich 21.2929 8.867e-06 ***
nh$D1H0 15.5007 0.0001299 ***
nh$Votes 3.0066 0.0851407 .
nh$Totalpopulation 0.2982 0.5859160
nh$Percapitaincome 0.1636 0.6865105
nh$Totalemployed 0.0337 0.8546765
nh$Totalunemployed 0.8149 0.3682376
nh$Singlefamilyhomes 0.1271 0.7219829
nh$Multifamilyunits 0.7231 0.3965950
nh$Medianage 1.1115 0.2935781
nh$Percenthighschoolgraduates 7.536e-07 0.9993086
nh$Age5andunder 3.3983 0.0673950 .
nh$Age5to19 1.3064 0.2550066
nh$Age35to54 0.1011 0.7509720
nh$Age55to64 0.1961 0.6585892
nh$Age65andup 0.6695 0.4146369
nh$Employeesinlargestbusiness 0.5505 0.4593790
nh$Municipalwater 0.1546 0.6947648
nh$Municipalsewer 2.6180 0.1079231
nh$Totalhousingunits 3.4622 0.0648983 .
nh$Percapitaincome:nh$Totalemployed 0.1181 0.7316529
nh$Percapitaincome:nh$Totalunemployed 0.0032 0.9548340
nh$Totalemployed:nh$Totalunemployed 4.3509 0.0388181 *
nh$Age5andunder:nh$Age5to19 0.0871 0.7682941
nh$Age5andunder:nh$Age35to54 0.3209 0.5719690
nh$Age5to19:nh$Age35to54 0.1458 0.7031766
nh$Age5andunder:nh$Age55to64 1.1076 0.2944332
nh$Age5to19:nh$Age55to64 0.0133 0.9082441
nh$Age35to54:nh$Age55to64 0.1195 0.7300958
nh$Municipalwater:nh$Municipalsewer 0.9774 0.3245594
nh$Municipalwater:nh$Totalhousingunits 0.5824 0.4466709
nh$Municipalsewer:nh$Totalhousingunits 0.0487 0.8256904
nh$Percapitaincome:nh$Totalemployed:nh$Totalunemployed 2.7098 0.1019924
nh$Age5andunder:nh$Age5to19:nh$Age35to54 0.1625 0.6874457
nh$Age5andunder:nh$Age5to19:nh$Age55to64 0.7882 0.3761842
nh$Age5andunder:nh$Age35to54:nh$Age55to64 1.6764 0.1975536
nh$Age5to19:nh$Age35to54:nh$Age55to64 0.0148 0.9033289
nh$Age5andunder:nh$Age5to19:nh$Age35to54:nh$Age55to64 0.4226 0.5167068
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now let me repeat that I am not a statistician or a sociologist. I have a Ph.D in theoretical computer science and a slight interest in statistics.

The mantra is "correlation is not causation" and, as claimed in the Associated Press article, there could very well be an unaccounted sociological factor that correlated with the presence of Diebold machines.

However explanations such as "there is no case for concern since precincts with Diebold machines have, strangely enough, always favored such and such a class of candidates" are not very satisfying since it is the very reliability of
these Diebold machines that is under question.

Hence, under the light of the general surprise of the press at Clinton's victory in New Hampshire, the large discrepancies between some polls and the results and Diebold's history.

Saturday, January 12, 2008

Friday, January 11, 2008

Request for New Hampshire sociodemographic data

To check for sociologically significant variables that might be correlated to voting method, we need sociodemographic data on New Hampshire by voting precinct. Unfortunately it seems that the US Census Bureau won't give these for free. Does anyone have such data?

Thursday, January 10, 2008


In precincts where Diebold machines are used, Hillary gets a 7 point advantage over Obama.

According to a best-fit linear model of the effect of precinct size and voting method, voting method accounts for about 4.6 points with high statistical confidence.

That's 4.6 points not explained by precinct size, which is included in the model.

Including variables such as number of Republican voters or other candidates does not significantly alter this result.

Hence the debunking that "Diebold use correlated with large cities correlated with Hillary supporters, thus no conspiracy" is insufficient and further analysis is warranted.

Especially given the wild poll discrepancies.

Statistical exploration of New Hampshire Democratic primary results

A number of people have been surprised by the unexpectedly high score of Hillary Clinton in New Hampshire. Some people noticed that there was a strong correlation between precincts using Diebold equipment and Hillary's score. Other noticed notable discrepancies between polls and the results. Yet others have pointed to some exit polls that do not match with the official results.

Some people offered the explanation that smaller precincts tend not to use Diebold machines and also tend to favor Obama, for whatever sociological reasons. As someone put the election data in computer-readable format on the web, and as I am slightly versed in statistical analysis using the R package, I decided to run some tests.

First, the data. The data has been posted as a Google spreadsheet on the web by Reddit user brfox.

It is missing some data in Dummer Hand, Franconia, Greenfield, Groton Hand, Harts Location, Manchester, Temple Hand, Waterville, Wenworth's Location and Windsor Hand. It has a total of 286139 Democratic voters.

Hillary Clinton wins with 39.1%. Obama gets 36.3%. Of all the Democratic votes, 57837 were hand-counted, 207251 (72%) votes were Diebold-counted.

So far, so good.

In hand-counted precincts, which make up 20.2% of the votes, Obama gets 38.6% and Clinton gets 34.9%. In Diebold-counted ones, Clinton makes 39.6% and Obama gets 36.3%. This is the basis for the initial claims of vote rigging.

Claims which are countered by the observation that precincts where the votes are hand-counted are small, non-urban precincts. Urbanity is, of course, a well-known factor affecting political choices.

Indeed, the mean precinct size, counted by number of Democratic votes, for hand-counted precincts is 431 (0 to 2602, median 323), against 2159 (269 to 17160, median 1320) for Diebold-counted ones.

Actually there is a very significant correlation at p < 0.002 between Clinton's score and the precinct size, and an even better correlation between Clinton's score and voting method, and yet a better correlation between precinct size and voting method.

We cannot say much more without going to multivariate statistics. Fortunately, thanks to GNU R, mere mortals can benefit from multi-variate statistical modeling.

So, we have three variables: voting method, Clinton's score and the precinct size. We ask R to compute the best linear model that links them, and then run an analysis of variance test on the model to see how well it fits the data.

> l <- lm(cliv ~ d$dem_size + hand)
> l

lm(formula = cliv ~ d$dem_size + hand)

(Intercept) d$dem_size hand
3.859e-01 2.598e-06 -4.644e-02

These cryptic lines mean that Hillary's score can be computed by 38.59% plus the Democratic size divided by 384911.5 (which is 1/2.598e-6) minus 4.64 percentage points whenever the voting method is by hand.

So it is estimated that voting method accounts for 4.64 percentage points of Hillary's score.

How much variability does this linear formula remove from the data? The standard deviation (on a precinct by precinct basis) of Hillary's score is about 7.8 percentage points.

> summary(l)

lm(formula = cliv ~ d$dem_size + hand)

Min 1Q Median 3Q Max
-0.339495 -0.042209 -0.001204 0.046007 0.327182

Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.859e-01 1.009e-02 38.247 < 2e-16 ***
d$dem_size 2.598e-06 3.115e-06 0.834 0.405
hand -4.644e-02 1.126e-02 -4.123 5.28e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07371 on 223 degrees of freedom
Multiple R-Squared: 0.1081, Adjusted R-squared: 0.1001
F-statistic: 13.51 on 2 and 223 DF, p-value: 2.894e-06

Look at the t value! As you can see, voting method explains a lot better than precinct size.

And if you care about ANOVA:

> anova(l)
Analysis of Variance Table

Response: cliv
Df Sum Sq Mean Sq F value Pr(>F)
d$dem_size 1 0.05446 0.05446 10.024 0.001761 **
hand 1 0.09235 0.09235 16.998 5.28e-05 ***
Residuals 223 1.21158 0.00543
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now let's think a little bit. There could very well be a politically meaningful parameter correlated with voting method besides precinct size. As Diebold has connections with Republicans, it could be that Republicans favor Diebold. Could it be that the Republican to Democrat size ratio explains the voting method?

I'll spare you the R screen dump: the p-value of the correlation coefficient being 0.69, the R to D size ratio doesn't seem to explain anything.

Hence voting method explains Hillary Clinton's score accross precincts better than precinct size or I fail Statistics 101.

All of this is of course armchair politology. I would like the opinion of an independent expert sociologist. However I would like to finish this post with the following quote:

Bilderberg guests from previous years include Senator Clinton and a former governor of Virginia, Mark Warner, both of whom who are considering running for president in 2008. President Clinton also hunkered down with the club one year. One of the most famous rumors associated with the Bilderberg is that it "anointed" Mr. Clinton in the spring of 1992.