Explaining and interpreting YFull raw data statistics?


YFull displays several statistics about the BAM / Full Genomes sample used to create the YFull report.

Unfortunately there is no explanation of what the statistics really mean in terms of impacts on interpretation or in terms of quality.

enter image description here

enter image description here

The numbers on NoCall also do not match making it a little bit more confusing to interpret the same kit.

The YFull.com FAQ provides no explanation. The FullGeneomes.com Q&A FAQ provides no explanation but does charge differently based on the resolution of test you buy and it will produce different numbers of SNPs / coverage.

The ISOGG NextGen Testing comparison matrix provides some explanation between the depth (x) / resolution comparisons in this comparison chart that shows FG's 10x test is 'about' equivalent of the ftDNA BigY, but the FG 20x, 30x, & Elite tests provide much more detail. But neither that page nor the ISOGG YFull page provide an explanation of these statistics.

The question is raised as for the 8 FamilyTreeDNA.com BigY tests I have completed, the resolution and size of the files have been all over the map and not necessarily consistent.

enter image description here

I know BigY is a 'scan and see what it can detect' test but not sure this is a sample quality issue, a testing issue, or no issue at all. I am absolutely still getting value from the tests, it is I would just like to better understand the statistics being presented.

Question: Has anyone found a good explanation of each of these statistics presented and what they mean in terms of quality and impact on interpretation?


Posted 2016-05-24T19:51:32.323

Reputation: 5 469



I've been doing a bit of research on the subject myself and might be of some assistance.

ChrY BAM file size is straight forward, pretty much just the file size of the raw data that maps to the Y chromosome and also includes position information and quality scores.

The "Reads" refers to the number of separate lengths of DNA read by the sequencer. In the case of BigY, they use the Illumina HiSeq2000 that does paired end reads at 2x100bp (when prepared, the DNA sample is fragmented into lengths of around 250 to 500 base pairs, and the first and last 100 base pairs of each fragment are sequenced). For your example, there are 9,867,748 100bp reads in the ChrY portion of the BAM file.

For mapped vs. unmapped, sometimes measured reads don't align with the reference human genome and a position in the genome can't be assigned. I can't recall if FTDNA removes unmapped reads from the BAM, or if YFull just considers reads on the Y chromosome (which are likely mapped) for the raw data numbers. Either way, all the reads used in the analysis are considered mapped.

So at 9,867,748 reads and 100bp per read, 986,774,800 base pairs sequenced. When mapped, there is usually some overlap, and when using a BAM viewing tool like IGV this is represented like the figure on the YFull report below "medium depth coverage", where each white bar here is a read of a certain length in its presumed position and the orange line is the single base pair being examined. A good result has more overlapping reads, called depth, with hopefully identical results - more depth and concurrence in the data improves confidence of an accurate result.

Taking into account this overlap, the amount of the Y chromosome covered here (at a minimum of 1 read to a maximum of 8008 reads, at a mean of 85 reads and a median of 59 reads) is 13,461,595 bases. The analysis uses the GRCh37/hg19 version of the reference human genome, which has the overall length of the Y chromosome at about 59 megabase pairs. Of this, certain portions, including heterochromatic regions like the centomere and a large section of the "q" or long arm of the Y, are currently unreadable/unmappable with current technology due to the structure and repetitiveness. So the remainder of the readable length is about 26 megabase pairs. Your length coverage percentage and no calls in the "raw data" page uses the 26 megabase value, the report measures against the 59 megabase value.

Finally, of this readable euchromatic part of the Y, only portions of it are useful for calling SNPs/STRs with high confidence. Nearly a total of 3 megabases at the ends can be discounted since they are pseudoautosomal regions - these portions can recombine during meiosis. Ampliconic regions (~10 Mbp) have large portions that reappear multiple times with 99%+ similarity on the Y and can be difficult to place. The X-transposed region (~4 Mbp), which was carried over from the X chromosome only a couple of million years in the past, is 99% similar to the X and can be difficult to segregate from X chromosome reads. The most useful and reliable results sit in the X-degenerate region, basically what remains of the ancestral version of the chromosome when it stopped recombining and diverged from the X chromosome. Filtering out the less reliable regions leaves 8,473,821 bp in "good regions", what YFull refers to as the "combBED area", and where it applies its mutation rate per year per base pair to predict the age of branches. For your example, 7,564,885 bp of sufficient quality reads are in their combBED area and any SNPs that occur there are used in the age estimations. Their paper from 2015 is available on the site and covers this as well.

Here's my understanding When comparing the FTDNA and FGC products. BigY targets what they refer to as the "gold standard" regions (which is described in FTDNA's whitepaper on the subject) covering about 11 Mbp (BigY overlaps beyond this), while FGC tries to cover more of the readable Y. FGC's Y Elite 2.0/2.1 also has a read length of 250 bp, and the Whole Genome Sequence is 150 bp. This longer length helps improve read mapping, call accuracy, and gap closure. As before, lower read depth means lower confidence and not all reads are evenly distributed, so a wider coverage 10x WGS only has a similar coverage on the Y to BigY after filtering through a certain quality threshold.

Sorry for the wall of text, but I hope this helps!


Posted 2016-05-24T19:51:32.323

Reputation: 41

1Hi, welcome to G&FH.SE! If you need to edit your answer (e.g. to add links to resources like FTDNA's whitepaper) you can use the edit link underneath. If you'd like to learn more about the site, there is information in the [help] center and on our companion site [Meta]. Or for a quick overview, the [tour] is available. Have fun exploring the site! – Jan Murphy – 2016-05-25T00:13:29.723