More Recent Comments

Wednesday, April 06, 2022

Genetic variation and the complete human genome sequence

The new complete human genome sequence adds an extra 8% of DNA sequence that's a source of variation in the human population. The sequence also corrects some errors in the current standard reference genome.

This is my fifth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The focus of this paper is on variation within the human population. The authors found one million additional variants in the complete T2T-CHM13 sequence that were not found in the current standard reference genome (GRCh38). In addition, the authors identify thousands of spurious single nucleotide polymorphisms (SNPs) and correct many inaccuracies in the standard reference genome.

Aganezov, S., Yan, S.M., Soto, D.C., Kirsche, M., Zarate, S., Avdeyev, P., Taylor, D.J., Shafin, K., Shumate, A., Xiao, C. et al. (2022) A complete reference genome improves analysis of human genetic variation. Science 376:54. [doi: 10.1126/science.abl3533]

Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.

There's no doubt that the complete T2T-CHM13 sequence is a more complete sequence than the current reference standard. However, there's an enormous amount of annotation information in the GRCh38 sequence that won't be easy to transfer to T2T-CHM13. This information is crucial for current GWAS studies to identify genetic diseases and it's also an accurate record of all the variants that have been identified to date. It's possible to transfer that information to the T2T-CHM13 genome but given that the nucleotide numbering system is different and that there are corrections to the GRCh38 genome, it's going to take a lot of time and effort to replace GRCh38 with a new standard reference genome.

The authors of this paper are clearly in favor of replacing GRCh13 and they recognize some of the problems.

Given these advances, we advocate for a rapid transition to the T2T-CHM13 genome as a reference. Although we appreciate that transitioning institutional databases, pipelines, and clinical knowledge from GRCh38 to T2T-CHM13 will require substantial bioinformatics and clinical effort, we provide several resources to advance this goal. On a practical level, improvements to large genomic regions, such as entire p-arms of the acrocentric chromosomes, and the discovery of clinically relevant genes and disease-causing variants justify the labor and cost required to incorporate T2T-CHM13 into basic science and clinical genomic studies. On a technical level, T2T-CHM13 simplifies genome analysis and interpretation because it consists of 23 complete linear sequences and is free of “patch,” unplaced, or unlocalized sequences. Many of the corrections introduced by T2T-CHM13 were previously noted and addressed by the GRC as “fix patches,” but few studies use these existing resources.

This assumes that the T2T-CHM13 sequence is as good as it's ever going to get and no "patches" or corrections will ever be needed. That's not a good assumption.

If all the data from CHCh38 is transferred to T2T-CHM13, then there's no question that the complete sequence would be a better tool for genome wide association studies (GWAS) but that's not the only consideration. GWAS studies have been useful in identifying some diseases with obvious genetic defects but over the past ten years the results have not been particularly useful. There are weak associations with several polygenic traits but it's not clear that these are leading to any significant treatments. The addition of extra markers isn't going to make much of a difference. We have to ask ourselves whether the trouble and expense of switching reference genomes (assuming it can be done) is going to pay off in the long run.

There are similar issues with studying human genetic variation. We already have enough markers in the current standard genome to address global diversity and human evolution. There's no way to justify spending millions of dollars for a slight improvement in studying genetic variation.

If I had tens of millions of dollars, I would rather spend it on educating genomics researchers about evolution and critical thinking in an effort to improve the quality of the scientific literature. For example, I would give some of it (one million?) to a scientist who could write a book about junk DNA. :-)



No comments :