Sunday, April 03, 2022

Epigenetic markers in the last 8% of the human genome sequence

The newly sequenced part of the human genome contains the same chromatin regions as the rest of the genome and they don't tell us very much about which regions are functional and which ones are junk.

This is my second post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Gershman et al. mapped histone modifications and methylation patterns in about 225Mb of DNA that are found in the T2T-CHM13 sequence and not in the standard reference sequence (GRCh38). They were primarily interested in correlating epigenetic markers with the expression of 2680 putative new genes in the extra sequence. As they say in the paper, "These gene predictions require detailed study for functionality and validation."

Gershman, A., Sauria, M.E., Hook, P.W., Hoyt, S.J., Razaghi, R., Koren, S., Altemose, N., Caldas, G.V., Vollger, M.R. and Logsdon, G.A. (2021) Epigenetic patterns in a complete human genome. Science 376:58. [doi: 10.1126/science.abj5089]

The completion of a telomere-to-telomere human reference genome, T2T-CHM13, has resolved complex regions of the genome, including repetitive and homologous regions. Here, we present a high-resolution epigenetic study of previously unresolved sequences, representing entire acrocentric chromosome short arms, gene family expansions, and a diverse collection of repeat classes. This resource precisely maps CpG methylation (32.28 million CpGs), DNA accessibility, and short-read datasets (166,058 previously unresolved chromatin immunoprecipitation sequencing peaks) to provide evidence of activity across previously unidentified or corrected genes and reveals clinically relevant paralog-specific regulation. Probing CpG methylation across human centromeres from six diverse individuals generated an estimate of variability in kinetochore localization. This analysis provides a framework with which to investigate the most elusive regions of the human genome, granting insights into epigenetic regulation.

This work extends the ENCODE results to cover the extra 8% of the genome. They map chromatin features that may be associated with biologically relevant promoter regions and they discovered such features for 57 of the 2680 putative genes. Twenty of these putative genes were (putative) lncRNAs and 19 were pseudogenes. Three of them look like possible protein-coding genes.

The authors refer to all these features as epigenetic markers although it's obvious (to me) that most of them have nothing to do with genuine regulation. It's another case where the terminology is inconsistent and tends to imply function when it shouldn't. My main beef with this paper is the complete lack of discussion about the functional relevance of the markers and the implicit assumption that merely identifying a transcript is good evidence for function. For example, here's what they say in the Discussion.

Of the previously unresolved genes, we found 57 with evidence of active promoters, including H3K4me3 or H3K27ac marks, in more than one cell type. We found 82 genes with a single cell type supporting active promoters, providing evidence that these previously unresolved gene annotations are functionally active across tissues. With more data from different tissue types, we may identify even more functional genes.

Now, don't get me wrong, I agree that evidence of expression is a requirement in identifying functional genes. But there's an additional step that's required and that's the demonstration that the transcript has a biological function. This is not just nitpicking since there is overwhelming evidence (IMHO) for thousands of transcripts that are junk RNA and the thus DNA sequence that's transcribed is not a gene. The absence of any sensible discussion of this fact is very troubling but it's a characteristic of such papers. They avoid any mention of junk DNA but at least they are careful not to make any claims about the proportion of the genome that's functional.



No comments:

Post a Comment