## DNA results from Ancestry

7

A friend of mine took a DNA test with Ancestry.com a while back. Now I (a trained mathematician) am being asked to help interpret the results, and I am finding myself totally out of my depth.

The main traits come from three regions, A shows 36%, B shows 35% and C shows 23%, respectively. The parents are known to have migrated from region A and B, respectively, and from relatively homogeneous populations, and I was wondering about region C. My take on this is that percentages ought to hew closely to integer multiples of (1/2)^n\$; a fraction of 1/3 seems unlikely. So I read the 36% and 35% as really both being approximations of 3/8, and the 23% being an approximation of 1/2. I then figured that this means that each parent is 3/4 of their respective base region, and 1/4 of region C. Interpreting this further, I ventured the guess that of the four great-grandparents on both the maternal and paternal lines one was a carrier from region C.

Does any of this make sense? Is there any place where I could read up on the mathematics of genetics?

(FWIW, Ancestry also gives confidence bounds. For Region A these are 21% and 49%, for region B they are 28% and 42%, and for Region C they are 15% and 30%. Given that the percentages are highly correlated --- they have to add up tom something close to 1, for instance --- this seems consistent with my interpretation.)

6

The family tree of ancestors is a binary tree (although as you move up, you will start to see nodes that are identical -- this is pedigree collapse). So if you think of each ancestor as making an equivalent "contribution", then yes, the total contribution from ancestors from one places should be a dyadic rational number.

But Ancestry is determining ethnicity by analyzing DNA, not by analyzing the family tree. Each parent contributes half of the DNA to a child, but already at the level of grandparents, it is usually the case that some of the grandparents contribute more DNA to the grandchild than others. So even if it makes sense to model the percentage of ethnicity as a sum of contributions from each ancestor at a certain number of generations back, the resulting distribution for the ethnicity is not restricted to dyadics.

Regarding the mathematics of genetics, I'd love such a source myself. But the first thing to do before getting into the math is to learn all of the complications that would go into building reasonable models of DNA descent. For instance, when you look at the Ancestry DNA data, you will find that segments have lengths given in units called centiMorgans (cM). DNA is measured at sporadic places along its length. If two people have exactly the same sequence of A, T, G, or C along some portion, then the length of that segment is measured and reported as a match. You might guess that if two segments cover the same number of positions of DNA, then they will have the same cM. But that is not the case. First of all, because the positions of measurement are not equally spread, a match at 500 consecutive positions on one segment may actually be over a longer strand a DNA than a match at 500 consecutive positions on a different segment. Furthermore, it is known that some areas of certain chromosomes recombine faster than others. If two people match on a segment that recombines very quickly, you may expect it to come from a more recent common ancestor than a common segment with the same number of measured positions and same amount of DNA which is known to recombine more slowly.

Then there's the problem with how DNA is measured. At each measured position is recorded a nucleotide reading on each of two corresponding places on paired chromosomes. The result might be AA, or AT, or even ?C (where ? means "no-call", meaning the measurement failed to give a result). If you look at the raw data file, you may see three consecutive positions recorded as AA, AG, and CT. That does not mean that one chromosome showed AAC while the other showed AGT. The measurement process does not know, for a given pair, which chromosome produced which member of the pair. So it can often happen by chance that over a huge number of potential matches and a huge amount of DNA in each match, you can find a sequence of 500 consecutive positions or so where one person has a letter that matches one of the two letters of another person at the same positions, but where the letters that match at each position are actually jumping back and forth between the chromosomes. These people would be declared a match, even though the match isn't arising from a single strand of DNA. This is is a false positive, called Indentical by State, as opposed to Identical by Descent.

Another complication is introduced by endogamy, which causes significant pedigree collapse. If a person's parents are siblings, for instance, then that person got all of his DNA from just two grandparents instead of four. Thus, even though they are two generations away from those grandparents, the expected DNA contribution of each grandparent to the grandchild is instead comparable to what a parent would provide. Using a regression algorithm or the like to predict how closely related the grandchild is to a grandparent would overestimate the closeness of the relationship, predicting a parent/child relationship instead. This is an extreme case, but endogamy causes lots of problems with predicting relationships in common practice. For instance, it makes it significantly harder for people with Ashkenazi Jewish heritage to use DNA to test hypotheses about their family trees.

I haven't investigated the ethnicity algorithms -- I spend my time trying to use DNA to identify cousins and find common ancestors. But one reason I haven't approached ethnicity is that I've seen lots of anecdotal accounts that ethnicity results aren't reproducible -- testing the same person at different companies gives very different results. So you should take the percentages you gave with a grain of salt.

The main concept in finding cousins with DNA is predicting relationship closeness from DNA results. The two approaches I see people use are gathering data from pairs of people about their relationships and their DNA comparisons to build distributions DNA matches for each type of relationship (Blaine Bettinger has done the most visible such analysis). The other is selecting a model for how recombination happens, creating an initial population, and simulating the propagation of DNA through generations. In the simulation, you will know how each person is related to each other.

Both have their drawbacks, but I'm prone to trust more the simulation method. The data used in the first method is self-reported, from different companies with different standards, and people are not necessarily using good standards of proof for the relationships they are submitting. Garbage in, garbage out. But perhaps for some of the closest relationships, the data may be pretty good. The second method is only as good as the model of recombination and coding in the simulation, and that's the sort of thing I'd like to find a good book about.

2

You're way overthinking this. The test is reporting ethnicity percentages based on sets of specific DNA markers. Those translate fairly directly to actual numbers and do not require rounding to account for number of ancestors. While it is possible (sometimes common) for someone to only have markers from one ethnic group of the list used by the testing service (every service has their own list and they also change over time), nobody (with rare exceptions like isolated populations) has just one ancestry and it all depends on what markers you use.

For example, I am 94-100% Ashkenazi (most common European Jewish) over all the major testing sites. This is correct for my recent ancestry, going back a few hundred years. But if you look at different types of markers, you get what's called deep ancestry. Jews came to Europe with the Romans 1000-2000 years ago (some came later of course and maybe some came earlier but I don't know). There were a lot of intermarriages with Europeans then the population stabilized and just stayed Jewish (a lot more complex than I'm making it out to be, it's just an example). My point being that, the way my ancestry is calculated makes an enormous difference. My deep ancestry is about 1/3 European and the rest from the Levant, Northern African, West Asian, etc.

Since you don't give examples, I don't know how homogeneous your friend's ancestral populations really are. You might be surprised at how often this assumption is wrong.

Here are some resources for you to learn more:

Start there and see where it takes you. There's a lot of great info out there.

2

The other answers are correct with respect to the complications of ethnicity estimation.

But going back to your simple mathematics: If ethnicity was passed down exactly 50% from each parent and was measured perfectly, then any percentages are possible.

That's because:

• Each parent would be 50%. You have 2 of them.
• Each grandparent would be 25%. You have 4 of them.
• Each great-grandparent would be 12.5%. You have 8 of them.
• Each great-great-grandparent would be 6.25%. You have 16 of them.
• Each g3-grandparent would be 3.125%. You have 32 of them.
• Each g4-grandparent would be 1.5625%. You have 64 of them.
• Each g5-grandparent would be 0.78125%. You have 128 of them.
• Etc...

There could be pedigree collapse and some of your ancestors may represent more than one ancestor, but lets ignore that.

If you need to get to ethnicities of 36%, 35% and 23%, then:

• The 36% could both have come from one grandparent (25%), one great-great-grandparent (6.25%), one g3 grandparent (3.125%) and one g4-grandparent (1.5625%). Total = 35.9375%
• The 35% could both have come from one grandparent (25%), one great-great-grandparent (6.25%), one g3 grandparent (3.125%) and one g5-grandparent (0.78125%). Total = 35.15625%
• The 23% could both have come from one great-grandparent (12.5%), one great-great-grandparent (6.25%), one g3 grandparent (3.125%) and one g5-grandparent (0.78125%). Total = 22.65625%

This can be drawn like this, with the colors highlighting the ancestors who were 100% from an ethnicity and passed it down. The percentages represent the amount the person received from them, assuming exactly half was obtained from each parent:

The red ancestors total 35.9375% (36%)

The green ancestors total 35.15625% (35%)

The yellow ancestors total 22.65625% (23%)

And the grey ancestors are the unspecified ethnicities and total 6.25% (6%)