I just had Excel compute the percentage of each nucleotide appearing as the first letter in each position in a raw DNA file. Then I had it compute the percentage of each nucleotide in the second position.
I got the following percentages (in positions with calls):
The first thing that stands out to me is that this looks decidedly non-random. The second is that the A and T percentages in the first spot are very close to the T and A percentages in the second spot, and similarly with the C and G. So while the data in each position seems non-random, the percentage of A, T, C, and G looks much closer to random when all of the letters are grouped together, regardless of position. (I haven't yet tried a chi-squared test to see if they are believably random).
Question 1: does everybody observe this? Is it special to this kit?
Question 2: How can there be such a tight relationship between what's happening in each position?
Some background for Question 2: It seems that sequencing machines have no ability to determine at each position which letter came from which chromosome -- i.e., the first position at consecutive positions won't necessarily correspond to nucleotides attached to the same chromosome -- i.e. the letters in the first position can be jumping back and forth between chromosomes. Or at least, so I thought comparing the data files of a brother and sister. In areas where there was half-match, the matching letter would jump back and forth between the first and second position. But if that is the case, I'd expect to find equal percentages of A's occuring in first position as in the second, equal percentages of T's, etc.