Was this dataset analysed correctly?



There's a Twitter thread going around that claims there are signs of voter fraud due to anomalies in the election vote count data set. You can download the dataset here and find the script to generate the analysis here.

Was the dataset analysed correctly? If yes, are these 'anomalies' significant enough to indicate fraud? If no, in what way is the analysis lacking?

PS: I'm not looking to start a political discussion, or to make any claims of fraud myself.


Posted 2020-11-10T08:43:24.510

Reputation: 149

Question was closed 2020-11-10T21:59:48.053


Somebody asked almost the same question on SkepticsSE.

– Erwan – 2020-11-10T21:22:10.777




While the data comes from the NYTimes and seems legit, the presentation is intentionally misleading and the subsequent assertions are baseless. I say "intentionally" because an unbiased and reputable analysis would not propagate such major allegations from the data they have presented. The data does not prove nor disprove voter fraud, so the unfounded assertions are simply disinformation.

Before diving into the data

We are going to see many more such threads/posts in the coming months and years, so I need to address this bit before delving into the question.

Consider the following whenever you view anything online:

How this Twitter thread is disinformation (step-by-step)

1. False premise

He/she first establishes the fact that, in total, batches of votes do not conspicuously favor one candidate or another (they do so via the tweets below). The distribution of the D/R ratio in vote batches over time is not reflected in this aggregate plot. However, this could mislead viewers to believe that they should expect 50/50 Democrat/Republican consistency throughout the vote counting process, which cannot be assumed from this information. The "anonymous data scientist" seems to encourage this logical fallacy by using this as his/her premise and posting his/her follow-up tweet to the plot (image below).

enter image description here

2. Misleading visualizations

He/she then establishes that the initial counts are "random" and noisy as one might expect from a 50/50 split. However, these batches do not have associated sizes with them! A "batch" may be 100K votes or it could fewer than 10. An actually informative portrayal of the data would have made the size of the circle/point reflect the size of the batch (bubble chart) rather than a scatter plot. Although this plot may technically be accurate, this can still be quite misleading, particularly because the "anonymous data scientist" made no note of this deficiency (lack of batch size information). But for the sake of argument, let's go with it.

enter image description here

3. Unsubstantiated claims and ignoring legitimate possibilities

Now they start talking about an "anomaly" and introducing conspiratorial allegations. They note the jump toward D, given the D/R ratio per batch in Milwaukee at some point in time. There are many legitimate explanations for this.

For example, a high/dense population such as a city (likely Democrat-leaning) would of course take more time to collect/count ballots, so they may not have started their reporting until after low/sparse populations (likely Republican-leaning). This can cause a sudden jump in favor of Democrats.

Also, the majority of the early ballots are in favor of Republican. This is to be expected; of course the candidate who consistently attempted to delegitimize mail-in ballots would receive fewer mail-in ballots from his followers. By contrast, Democrats were encouraged to vote by mail (and they did), making the later mail-in vote counts lean Democrat. This is another factor that could explain why the initial vote counts are skewed toward Republican in most states when compared to the final counts.

enter image description here

Benji Albert

Posted 2020-11-10T08:43:24.510

Reputation: 2 179

I think the OP's suspicion of fraud was a combination of the jumps and the fact that these jumps mostly occurred in the battleground states, so how would that factor into the analysis? That being said, I greatly appreciate your explanation and the mental checklist at the beginning. – JansthcirlU – 2020-11-10T20:40:09.893


I read the thread but didn't analyze the data.

It's very difficult to answer this question in any conclusive way: assuming the graphs are correct, interpreting them is a highly risky/subjective game because there's so many hidden factors: the way votes are collected at different times, in different places which have different population density, under different state laws and procedures...

I would simply emphasize that in statistics the word "anomaly" only describes data points which deviate from the norm (regular pattern). Note that this is a quite vague notion (how far from the norm is an anomaly?), and more importantly that statistical analysis by itself does not explain the reason why anomalies happen, it can only detect them. The explanation must rely on what is usually called "expert knowledge", i.e. indications which are not present in the data itself, obtained by human analysis of how the data was obtained.

The linked analysis is quite interpretative and possibly biased:

  • The "slight drift from D to R" in almost all states is "likely due to outlying rural areas having more R votes. These outlying areas take longer to ship their ballots to the polling centers." This explanation is based on expert knowledge, with no way for a non-expert to check its validity.
  • The Wisconsin shift is explained by "Around 3am Wisconsin time, a fresh batch of 169k new absentee ballots arrived. They were supposed to stop accepting new ballots, but eh, whatever I guess." One may note again the use of external knowledge (and a suggestion that some illegal stuff happened)
  • The explanation for this case looks like forensics analysis: "quite possibly bc additional ballots were added to the batch, either through backdating or ballot manufacturing or software tampering. This of this being kind of analogous to carbon-14 dating, but for ballot batch authenticity." Yet all of this is hypothetical, there could be other explanations.
  • About the Pennsylvania shift: "But then as counting continues, the D to R ratio in mail-in ballots inexplicably begin "increasing". Again, this should not happen, and it is observed almost nowhere else in the country, because all of the ballots are randomly shuffled..." Saying that this shift is "inexplicable" is interpretative: one just doesn't have an explanation, that doesn't mean there isn't one. "the ballots are randomly shuffled": not as far as I know: the ballots are collected by county and different counties can have a different distribution.
  • ...

My point is: there's no way to know if this analysis is correct just from the data, most of the author's conclusion are based on external explanations.


Posted 2020-11-10T08:43:24.510

Reputation: 12 600

So if I'm not mistaken, there's no strong case for voter fraud because that's just one possible explanation for the results in the analysis, is that correct? If so, is it possible that the methodology to analyse the data was faulty or contrived to show these particular results? Or is it inevitable that such a big data set can be interpreted in many - though not necessarily wrong - ways? – JansthcirlU – 2020-11-10T20:48:02.893

@JansthcirlU yes, correct: the authors mixes analysis of the data proper with their own interpretations/explanations, so their conclusions rely mostly on their subjective interpretation. It's also quite clear that the author is looking for some kind of fraud favoring Biden, it's not a neutral interpretation. From just the thread, I can't say if the methodology is correct or not. In any case a serious analysis would be much more complex than this, given all the external factors to take into account. It's obvious that the authors doesn't really try to explain the anomalies, they just want to ... – Erwan – 2020-11-10T21:09:25.647

... conclude that there is fraud. – Erwan – 2020-11-10T21:09:36.307