Title: The human telomere-to-telomere genomic reference reduces short read mapping errors
Legend: DNA libraries generated from 235 post mortem human brain samples were sequenced using 150 bp, paired end Illumina chemistry to a desired depth of 30X coverage. Reads were aligned to either the linear GRCh38 alt-free reference genome, the linear telomere to telomere hs1 reference genome, or the draft human pangenome. Pangenomic alignments were created in either GRCh38 coordinates or T2T coordinates. Linear alignments were created with bwamem2, while pangenomic alignments were created with vg Giraffe and surjected to linear coordinates with vg Surject. Mosdepth was used to calculate average depth per contig, while samtools stats was used to generate per-sample alignment statistics. A) The percentage of total library reads that aligned with a mapping quality of 0, indicating multiple equally likely alignments were present. Each point represents one aligned library. The alignment software is indicated on the x axis, while the reference genome/coordinate system used is indicated by point color. B) The percentage of total library reads that were “properly paired” as defined by samtools. To be properly paired, the reads must align inwardly oriented with a predicted insert size less than six standard deviations away from the average insert size. C) Violin plot of average depth of coverage on chromosome Y. Overlayed points illustrate the average depth of coverage per female or male sample. Box plots indicate 5th, 25th, 50th, 75th and 95th percentile values. D) Average autosomal depth of coverage obtained after read alignment, separated by genomic coordinates and genomic format (pangenomic or linear).
Citation: Lalli J., Kovner R., Sestan N., Sanders S., Werling D. Use of pangenomic, telomere-to-telomere references reduces short read mapping errors. (2024) Manuscript in preparation.
Abstract: The recent publication of a complete, telomere-to-telomere reference human genome (T2T-CHM13) has given researchers the ability to investigate genetic variation in previously inaccessible regions of the genome such as telomeres, centromeres, and highly repetitive regions. In addition, the recent publication of a draft human pangenome promises to capture a representative portion of human diversity in one reference data object, allowing researchers to map short reads from highly diverse human samples in a less biased manner. Work with individual, well-characterized reference datasets has shown that using these references reduces mapping error and increases the accuracy and precision of variant calling, even in difficult-to-call regions of the genome. Nevertheless, it is unclear how the use of these references would affect cohort-level variant calling, or how ancestry would impact the effect of reference choice on downstream variant calling performance.
In this study we have taken a cohort of 235 human samples and produced alignments of 30X short-read data from each participant either a linear GRCh38 reference or a linear T2T-CHM13 reference. These alignments were produced via the traditional GATK germline short variant pipeline, or via mapping with vg Giraffe to the draft human pangenome followed by surjection to either reference coordinate system. The cohort is highly diverse and is comprised of a roughly even distribution of individuals of American Admixed, African, and European ancestries. We then examine how choice of reference affects alignment. (Work is ongoing to identify the impact of reference choice on variant calling.)
We find that aligning to a T2T reference results in a substantial reduction in off-target mapping, along with an increase in mapping rate. For example, the number of reads aligning to chromosome Y in female samples drops 100-fold. Genomic coverage is more even on a chromosome-by-chromosome basis with a T2T reference. The inclusion of highly repetitive regions of the genome does result in an increase in ambiguously mapped reads (defined as an MQ of 0). Use of a pangenomic alignment method results in an increased read mapping rate, though the subsequent surjection process results in a net reduction in coverage depth as reads mapping to large genomic insertions are dropped from the resulting linear alignments. Pangenomic alignment appears to provide enough reference variation in otherwise repetitive genomic regions to reduce the ambiguous mapping rate to standard GATK-GRCh38 levels.
Investigator: Donna Werling, PhD
About the Lab: Donna Werling is interested in characterizing sex-differential risk mechanisms in autism spectrum disorder (ASD). During her doctoral work in the laboratory of Dan Geschwind at the University of California, Los Angeles, Werling used functional genomics, human genetics and bioinformatics approaches to understand the relationship between sex and genetic risk in ASD. Visit The Werling Lab for more information.