## Percentages of nucleotides in raw DNA

4

I just had Excel compute the percentage of each nucleotide appearing as the first letter in each position in a raw DNA file. Then I had it compute the percentage of each nucleotide in the second position.

I got the following percentages (in positions with calls):

1st position

A 31.05

T 16.21

C 31.11

G 21.64

2nd position

A 16.30

T 31.02

C 21.78

G 30.89

The first thing that stands out to me is that this looks decidedly non-random. The second is that the A and T percentages in the first spot are very close to the T and A percentages in the second spot, and similarly with the C and G. So while the data in each position seems non-random, the percentage of A, T, C, and G looks much closer to random when all of the letters are grouped together, regardless of position. (I haven't yet tried a chi-squared test to see if they are believably random).

Question 1: does everybody observe this? Is it special to this kit?

Question 2: How can there be such a tight relationship between what's happening in each position?

Some background for Question 2: It seems that sequencing machines have no ability to determine at each position which letter came from which chromosome -- i.e., the first position at consecutive positions won't necessarily correspond to nucleotides attached to the same chromosome -- i.e. the letters in the first position can be jumping back and forth between chromosomes. Or at least, so I thought comparing the data files of a brother and sister. In areas where there was half-match, the matching letter would jump back and forth between the first and second position. But if that is the case, I'd expect to find equal percentages of A's occuring in first position as in the second, equal percentages of T's, etc.

4

Please see my very recent article (I wrote it 2 days ago) titled Comparing Raw Data from 5 DNA Testing Companies. Among the other analyses in it, I compared the number of result values reported for each company:

You will see that homogenous values (AA, CC, GG, TT) occur more often than heterogenous (AC, AG, AT, CG, CT and GT), and the order is generally meaningless because they can't determine which allele is from the forward strand of the paternal chromosome versus which is from the forward strand of the maternal chromosome. (Only the forwards strands of each chromosome pair are read.)

Most often then, companies order the two alleles alphabetically (my "Standardized" column), but for some unknown reason, they are not always consistent about it. Thus allele letters earlier in the alphabet (A and C) will tend to be in first position more often than letters later in the alphabet (G and T), which is what you are observing as well.

e.g. In my table above, you'll see that two companies use CT and two use TC and one uses both. But the majority of any company's values are alphabetically ordered.

Cool post. The alphabetization explains the lower percentages in the G and T categories. But the "duality" between A and T and C and G, reflecting the the duality of the two strands in a single chromosome, is even more pronounced in your table. If you take a given count for some allele pair XY and replace each letter with its "dual", swapping any A with T or vice-versa and C with G or vice-versa, the resulting pair will have a count very close to the XY count. The counts for the two "self-dual" cases, AT and CG, are far lower than the others. What accounts for this? – – Barry – 2018-09-03T11:43:42.307

Every SNP has its own set of most possible values, with one or two of them happening 95% of the time. I would presume of the 700,000 SNPs tested, AT and CG are simply rarer results. – lkessler – 2018-09-03T14:22:54.940

They are just as rare in my own file, and presumably we are unrelated. So it seems maybe to be a general phenomenon. But my question remains: why the similarity of counts between "dual" allele combinations? And from that observation, it seems perhaps no accident that the rarest combinations are the self-dual ones. – Barry – 2018-09-03T19:12:55.917

Yes, I think they might be rarer for everyone. The similarity of dual allele counts may be simply because they are shown in the order they are found and there's a 50% chance one will be found before the other. – lkessler – 2018-09-03T21:06:08.197