Sunday, January 13, 2008

Archive of R files and data

You can download a .tar.gz file containing the R scripts and the CSV data here.

3 comments:

Dealie said...

So I stumbled across this tonight and decided to take a whack at it. I haven't added anything to the data, though gender and race would both be invaluable additions I'm guessing. For all the models I ran I was using Clinton's vote share as the dependent variable (DV) and a dummy indicating hand-counting as the independent variable (IV) of interest, with other variables as controls. If the IV is significant despite the controls, then it strengthens our ability to reject the null hypothesis that there was no cheating.

First off, when the DV is a proportion (e.g. lives on the unit interval), we need to use a binary model, so I ran a series of probits attempting to minimize the Aikake Information Criterion (http://en.wikipedia.org/wiki/Akaike_information_criterion) in order to get the best fit. I found that stripping it down to a handful of variables was best, unsurprising giving how few observations we really have. However, when I calculated the McFadden's pseudo-R squared (which penalizes for using degrees of freedom), the result was actually negative, which is discouraging as far as the explanatory value of the model goes.

I also switched the proportions back into vote counts and decided to run robust linear regressions that help control for heteroskedasticity and outliers. This model had trouble converging (as it uses iterated least squares, and again we have a pretty small sample), ultimately I tweaked it down to the same "ideal" model from the probit and found quite discouragingly that hand-counting was no longer significant. Votes just screamed and absorbed almost all the action, with the educational variables still achieving a bit of significance (in the same directions as before, high school pro-Clinton and college negative).

All my code is up at: http://www.proles.net/nh/

The code.r file is my commented analysis, and the .csv file is just a mirror of the same data from here. You can just step through my analysis, using summary() on the models as you go to see the results for yourself.

Sorry to rain on the parade a bit, it's entirely possible that there are lots of errors in the above and I encourage people to please pick through it. In fact I did this very quickly and sloppily and normally wouldn't share something at this stage were it not for the timely urgency of this. I do think that the original ols models ran on this site look nice (decent R-squares, even adjusted) but are a little too kitchen-sinky for the number of observations we have, and that a lot of variation may be due to heteroskedasticity (e.g. difference variances in districts with Diebold versus handcounting or something along those lines) or other issues that we need robustness to control for.

Anyway, hope this helps...

Dealie said...

Just a brief followup on heteroskedasticity - I ran these quick commands:

> var(Clinton,na.rm=T)
[1] 0.006180386
> var(Clinton[hand==1],na.rm=T)
[1] 0.006538795
> var(Clinton[hand==0],na.rm=T)
[1] 0.003943916
>

Basically, this shows that there is more variance in the Clinton vote in districts with hand-counting than in Diebold districts. The same is true of Obama:

> var(Obama,na.rm=T)
[1] 0.006988419
> var(Obama[hand==1],na.rm=T)
[1] 0.008372794
> var(Obama[hand==0],na.rm=T)
[1] 0.004406466
>

And I'm not going to run it now but I imagine this is just generally the case. This is likely a reflection of the fact that hand districts are smaller and thus a few voters swinging one way or the other results in a large variability in terms of the proportions received by the candidates. Anyway, this is heteroskedasticity, and this is what the robust linear model controls for.

I'll leave it at that for now and let someone more enlightened fix my errors and complete my work, as I need to sleep and get my own work done...

semmelweis said...

You are obviously, unlike me, well-versed in statistics. However before looking at advanced statistical explanations we should check if there is an obvious hidden variable that could explain the discrepancy. It seems that Diebold machines are geographically clustered, however I couldn't find a list of NH towns that has geographic data (say, latitude/longitude coordinates for each town) and that matches the election data.