More Recent Comments

Saturday, May 14, 2022

Editing the Wikipedia article on non-coding DNA

I decided to edit the Wikipedia article on non-coding DNA by adding new sections on "Noncoding genes," "Promoters and regulatory sequences," "Centromeres," and "Origins of replication." That didn't go over very well with the Wikipedia police so they deleted the sections on "Noncoding genes" and "Origins of replication." (I'm trying to restore them so you may see them come back when you check the link.)

I also decided to re-write the introduction to make it more accurate but my version has been deleted three times in favor of the original version you see now on the website. I have been threatened with being reported to Wikipedia for disruptive edits.

The introduction has been restored to the version that talks about the ENCODE project and references Nessa Carey's book. I tried to move that paragraph to the section on the ENCODE project and I deleted the reference to Carey's book on the grounds that it is not scientifically accurate [see Nessa Carey doesn't understand junk DNA]. The Wikipedia police have restored the original version three times without explaining why they think we should mention the ENCODE results in the introduction to an article on non-coding DNA and without explaining why Nessa Carey's book needs to be referenced.

The group that's objecting includes Ramos1990, Qzd, and Trappist the monk. (I am Genome42.) They seem to be part of a group that is opposed to junk DNA and resists the creation of a separate article for junk DNA. They want junk DNA to be part of the article on non-coding DNA for reasons that they don't/won't explain.

The main problem is the confusion between "noncoding DNA" and "junk DNA." Some parts of the article are reasonably balanced but other parts imply that any function found in noncoding DNA is a blow against junk DNA. The best way to solve this problem is to have two separate articles; one on noncoding DNA and it's functions and another on junk DNA. There has been a lot of resistance to this among the current editors and I can only assume that this is because they don't see the distinction. I tried to explain it in the discussion thread on splitting by pointing out that we don't talk about non-regulatory DNA, non-centromeric DNA, non-telomeric DNA, or non-origin DNA and there's no confusion about the distinction between these parts of the genome and junk DNA. So why do we single out noncoding DNA and get confused?

It looks like it's going to be a challenge to fix the current Wikipedia page(s) and even more of a challenge to get a separate entry for junk DNA.

Here is the warning that I have received from Ramos1990.

Your recent editing history shows that you are currently engaged in an edit war; that means that you are repeatedly changing content back to how you think it should be, when you have seen that other editors disagree. To resolve the content dispute, please do not revert or change the edits of others when you are reverted. Instead of reverting, please use the talk page to work toward making a version that represents consensus among editors. The best practice at this stage is to discuss, not edit-war. See the bold, revert, discuss cycle for how this is done. If discussions reach an impasse, you can then post a request for help at a relevant noticeboard or seek dispute resolution. In some cases, you may wish to request temporary page protection.

Being involved in an edit war can result in you being blocked from editing—especially if you violate the three-revert rule, which states that an editor must not perform more than three reverts on a single page within a 24-hour period. Undoing another editor's work—whether in whole or in part, whether involving the same or different material each time—counts as a revert. Also keep in mind that while violating the three-revert rule often leads to a block, you can still be blocked for edit warring—even if you do not violate the three-revert rule—should your behavior indicate that you intend to continue reverting repeatedly.

I guess that's very clear. You can't correct content to the way you think it should be as long as other editors disagree. I explained the reason for all my changes in the "history" but none of the other editors have bothered to explain why they reverted to the old version. Strange.


Friday, April 15, 2022

Most lncRNAs are junk

A hard-hitting review will be published in Annual Review of Genomics and Human Genetics. It shows that the case for large numbers of functional lncRNAs is grossly exaggerated.

A long-time Sandwalk reader (Ole Kristian Tørresen) alerted me to a paper that's coming out next October in Annual Review of Genomics and Human Genetics. (Thank-you Ole.) The authors of the review are Chris Ponting from the University of Edinburgh (Edinburgh, Scotland, UK) and Wilfried Haerty at the Earlham Institute in Norwich, UK. They have been arguing the case for junk DNA for the past two decades but most of their arguments are ignored. This paper won't be so easy to ignore because it makes the case forcibly and critically reviews all the false claims for function. I'm going to quote a few juicy parts because I know that many of you will not be able to access the preprint.

Friday, April 08, 2022

The structures of centromeres

The new complete human genome sequence gives us a first-time look at the structures of human centromeres.

This is my sixth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The new long-read and ultra-long-read sequencing techniques have revealed the organization of centromeric regions of human chromosomes. The basic structure of these regions has been known for many years [Centromere DNA] but the overall arrangement of the various repeats and the large scale organizaton of the centromere was not clear.

The core functional regions of centromeres consist of multiple copies of tandemly repeated alpha-satellite sequences. These are 171 bp AT-rich sequences that serve as attachment sites for kinetocore proteins. The kinetochore proteins interact with spindle fibers that pull the chromosomes to the opposite ends of a dividing cell. The core region is surrounded by pericentromeric regions containing additional repeats (mostly HSat2 and HSat3). The alpha-satellite repeats take up almost 3% of the genome and the pericentromeric repeats occupy an additional 3%.1 That's why centromeres are a major component of the functional part of the human genome. (Centromeres are classic examples of functional noncoding DNA and knowledgeable scientists have known about them for half a century.2

Altemose, N., Logsdon, G.A., Bzikadze, A.V., Sidhwani, P., Langley, S.A., Caldas, G.V., Hoyt, S.J., Uralsky, L., Ryabov, F.D., Shew, C.J. and et al. (2021) Complete genomic and epigenetic maps of human centromeres. Science 376:56. [doi: 10.1126/science.abl4178]

Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.

The details of the organization of each centromere aren't important. There's a lot of variation between centromeres on different chromosomes and between specific centromeres in different individuals. The authors looked at the organization of X chromosome centromeres in a variety of different individuals from different parts of the world. As expected, there was considerable variation and, as expected, there was more variation within Africans than in all other populations combined.

It shouldn't come as a surprise to find that the authors want more T2T sequences.

This high degree of satellite DNA polymorphism underlines the need to produce T2T assemblies from genetically diverse individuals, to fully capture the extent of human variation in these regions, and to shed light on their recent evolution.

I really hope the granting agencies don't fall for this. It would be much better to spend the resources on exploring the biological function of splice variants (alternative splicing?) and putative noncoding genes in order to resolve the junk DNA controversy. It would also help to devote some of this money to the proper education of science undergraduates.

The authors claim to have discovered 676 genes and pseudogenes within the centromeres. They claim that this includes 23 protein coding genes and 141 lncRNAs genes. They present evidence that three of these genes might have a function which means that 161/164 of these "genes" are "putative" genes until we see evidence of function.3


1. It's unlikely that most of this 6% is absolutely required for the proper functioning of the centromeres because there are many individuals with much less centromere DNA. That's why I only attribute about 1% of the genome to functional centromere sequence.

2. Unknowledgeable scientists continue to be shocked when they discover that noncoding DNA can have a biological function. This is because they weren't taught properly as undergraduates.

3. I don't understand why so many scientists are unable to see the difference between a putative gene and a real gene.

Wednesday, April 06, 2022

Genetic variation and the complete human genome sequence

The new complete human genome sequence adds an extra 8% of DNA sequence that's a source of variation in the human population. The sequence also corrects some errors in the current standard reference genome.

This is my fifth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Tuesday, April 05, 2022

Two different views of the history of molecular biology

How can different molecular biologists have such opposite views of the history of their field?

I'm posting links to two papers without comment. One of them is from my friend and colleague Alex Palazzo and the other is from James Shapiro who is not my friend or colleague. Both papers have been published in reputable peer-review journals.

Transcription activity in repeat regions of the human genome

A detailed examination of the new complete human genome reveals that 54% of it consists of various repetitive elements. Some of them are transcribed and some aren't.

This is my fourth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The fourth paper extends the ENCODE-type analysis of the T2T-CHM13 sequence by focusing on repeats.

Hoyt, S.J., Storer, J.M., Hartley, G.A., Grady, P.G., Gershman, A., de Lima, L.G., Limouse, C., Halabian, R., Wojenski, L., Rodriguez, M. et al. (2021) From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376:57. [doi: 10.1126/science.abk3112]

Mobile elements and repetitive genomic regions are sources of lineage-specific genomic innovation and uniquely fingerprint individual genomes. Comprehensive analyses of such repeat elements, including those found in more complex regions of the genome, require a complete, linear genome assembly. We present a de novo repeat discovery and annotation of the T2T-CHM13 human reference genome. We identified previously unknown satellite arrays, expanded the catalog of variants and families for repeats and mobile elements, characterized classes of complex composite repeats, and located retroelement transduction events. We detected nascent transcription and delineated CpG methylation profiles to define the structure of transcriptionally active retroelements in humans, including those in centromeres. These data expand our insight into the diversity, distribution, and evolution of repetitive regions that have shaped the human genome.

The most useful part of this paper is the complete analysis of all repetitive elements in the T2T-CHM13 genome. This gives us, for the first time, a complete picture of a human genome. The exact values of the various components aren't important because there's considerable variation with the human population but the big picture is informative.

These are the percentages of the human genome occupied by the different classes of repetitive DNA.

  • SINEs 12.8%
  • Retrotransposon 0.15%
  • LINEs 20.7%
  • LTRs 8.8%
  • DNA transposons 3.6%
  • simple repeats 8%

The total comes to 54%. There are other estimates that are higher because of a more lenient cutoff value for sequence similarity but this gives you a pretty good idea of what the genome looks like. Most of the transposon-related sequence consists of fragments of once active transposons so the fraction of the genome consisting of true selfish DNA capable of transposing is a small fraction of this 54%.

We have every reason to believe that most of this DNA is junk DNA based on several lines of evidence developed over the past 50 years but most of the authors of this paper are reluctant to reach that conclusion so the fact that these repetitive sequences might be junk isn't mentioned in the paper. Instead, the authors concentrate on mapping CpG methylation sites and transcribed regions. They refer to this as "functional annotation" but they don't provide a definition of function.

We provide a high-confidence functional annotation of repeats across the human genome.

As you might expect, the repeat elements that retain vestiges of promoters are often transcribed and this includes adjacent genomic sequences that are found near these promoter (e.g. near LTRs). The long stretches of short tandem repeats (e.g. satellite DNA) do not contain any sequences that resemble promoters so these regions are not transcribed. (The authors seem to be a bit surprised by this result.) Further work is needed to decide how much of this DNA is truly functional and which parts contribute to human uniqueness. Naturally, that will require much more ENCODE-type work and T2T sequencing of other primates.

Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Although we find repeat variants that appear enriched or specific to the human lineage, in the absence of T2T-level assemblies from other primate species, we cannot truly attribute these elements to specific human phenotypes. Thus, the extent of variation described herein highlights the need to expand the effort to create human and nonhuman primate pan-genome references to support exploration of repeats that define the true extent of human variation.

This will cost millions of dollars. I suspect the grant applications have already been sent.



Monday, April 04, 2022

If you were a Harvard freshman you could take a course on the dark matter of the genome.

Check out this freshperson seminar course on Parts Unknown: The Dark Matter of the Genome at Harvard. It is offfered by Amanda J. Whipple of the Department of Molecular and Cellular Biology. She works on noncoding RNAs in the brain. Harvard likes to think of itself as one of the top universities in the world so this seminar course must be an example of world class critical thinking.

Heaven help us if this is what future American leaders are being taught.

Did you know that genes, traditionally defined as DNA encoding protein, only account for two percent of the entire human genome? What is the purpose of the remaining 98% of the genome? Is it simply “junk DNA”? This seminar will explore the large portion of our genome that has been neglected by scientists for many years because its purpose was not known. We will examine research findings which demonstrate non-coding sequences, previously assigned as “junk DNA”, play crucial roles in the development and maintenance of a healthy organism. We will further discuss how these non-coding sequences are promising targets for drug design and disease diagnosis. We will then visit a local research laboratory (either virtually or in person as deemed appropriate) and engage with active scientists regarding the scientific research enterprise.

A thorough understanding of the human genome not only provides a foundation for any student interested in the life sciences, it enables one to engage more deeply in related political and societal debates, which is expected to become even more central as scientists further uncover the dark matter of our genomes.

Setting aside the sarcasm, how did we get to a stage where a prominent researcher at one of the top research universities in the world could write such a course description?



Sunday, April 03, 2022

Karen Miga and the telomere-to-telomere consortium

Karen Miga deserves a lot of the credit for the complete human genome sequence.

Karen Miga is a professor at the University of California, Santa Cruz, and she's been working for several years on sequencing the repetitive regions of the genome. She is a co-founder of the telomere-to-telomere consortium that just published a complete sequece of the human genome. She made a signficant contribution to long-read (~20 Kb) and ultra-long-read (>100 kb) sequencing and that's a major technological achievement that's worthy of prizes.

Read the interview on CBC (Canada) Quirks & Quarks at Scientists sequence complete, gap-free human genome for the first time and watch the YouTube video.


Miga did her Ph.D. with Huntington Willard at Duke University. Hunt has been working on centromeres for more than 40 yeas years and some of my colleagues may remember him when he was a professor at the University of Toronto in the Department of Medical Genetics.



What do we do with two different human genome reference sequences?

It's going to be extremely difficult, perhaps impossible, to merge the new complete human genome sequence with the current standard reference genome.

The source DNA for the new telomere-to-telomere (T2T) human genome sequence was a cell line derived from a molar pregnancy. This meant that the DNA was essentially haploid, thus avoiding the complications of sequencing diploid DNA which contains two highly similar but different genomes. The cell line, CHM13, lacks a Y chromosome but that's trivial since a complete T2T sequence of a Y chromosome will soon be published and it can be added to the T2T-CHM13 genome sequence [Telomere-to-telomere sequencing of a complete human genome].

Segmental duplications in the human genome

The new completed human genome sequence contains some previously unknown large duplicatons (segmental duplications).

This is my third post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Epigenetic markers in the last 8% of the human genome sequence

The newly sequenced part of the human genome contains the same chromatin regions as the rest of the genome and they don't tell us very much about which regions are functional and which ones are junk.

This is my second post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

A complete human genome sequence (2022)

The first complete human genome sequence has finally been published.

This is my first post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Friday, April 01, 2022

Illuminating dark matter in human DNA?

A few months ago, the press office of the University of California at San Diego issued a press release with a provocative title ...

Illuminating Dark Matter in Human DNA - Unprecedented Atlas of the "Book of Life"

The press release was posted on several prominent science websites and Facebook groups. According to the press release, much of the human genome remains mysterious (dark matter) even 20 years after it was sequenced. According to the senior author of the paper, Bing Ren, we still don't understand how genes are expressed and how they might go awry in genetic diseases. He says,

A major reason is that the majority of the human DNA sequence, more than 98 percent, is non-protein-coding, and we do not yet have a genetic code book to unlock the information embedded in these sequences.

We've heard that story before and it's getting very boring. We know that 90% of our genome is junk, about 1% encodes proteins, and another 9% contains lots of functional DNA sequences, including regulatory elements. We've known about regulatory elements for more than 50 years so there's nothing mysterious about that component of noncoding DNA.

Wednesday, March 30, 2022

John Mattick's new book

John Mattick and Paulo Amaral have written a book that promotes their views on the content of the human genome. It will be available next August. Their main thesis is that the human genome is full of genes for regulatory RNAs and there's very little junk. A secondary theme is that some very smart scientists have been totally wrong about molecular biology and molecular evolution for the past fifty years.

I pretty much know what's going to be in the book [see John Mattick presents his view of genomes]. I also know that most of his claims don't stand up to close scrutiny but that's not going to prevent it from being touted as a true paradigm shift. (It's actually a paradigm shaft.) I suspect it's going to get favorable reviews in Science and Nature.

John Mattick presents his view of genomes

John Mattick has a new book coming out in August where he defends the notion that most of our genome is full of genes for functonal noncoding RNAs. We have a pretty good idea what he's going to say. This is a talk he gave at Oxford on May 17, 2019.

Here are a few statements that should pique your interest.

  • (0:57) He says that his upcoming book is tentatively titled "the misunderstandings of molecular biology."
  • (1:11) He says that "the assumption has been very deeply embedded from the time of the lac operon on that genes equated to proteins."
  • (2:30) There have been three "surprises" in molecuular biology: (1) introns, (2) eukaryotic genomes are full of 'selfish' DNA, and (3) "gene number does not scale with developmental complexity."
  • (4:30) It is an unjustified assumption to assume that transposon-related seqences are junk and that leads to misinterpretation of neutral evolution.
  • (6:00) The view that evolution of regulatory sequences is mostly responsible for developmental complexity (Evo-Devo) has never been justified.
  • (8:45) A lot of obtuse theoretical discussion about how the number of regulatory protein-coding genes increases quadratically as the total number of protein-coding genes increase in a bacterial genome but at some point there has to be more protein-coding regulatory genes than total protein-coding genes so that limits the evolution of bacteria.
  • (13:40) The proportion of noncoding DNA increases with developmental complexity, topping out at humans.
  • (14:00) The vast majority of the genome in complex organisms is differentially transcribed in different cells and different tissues.
  • (14:15) The whole genome is alive on both strands.
  • (14:20) There are two possibilities: junk RNA or abundant functional transcripts and that explains complex organisms.
  • Mattick then takes several minutes to document the fact that there are abundant transcripts— a fact that has been known for the better part of sixty years but he does not mention that. All of his statements carry the implicit assumption that these transcripts are functional.
  • (20:20) He makes the boring, and largely irelevant, point that most disease-associated loci are located in noncoding regions (GWAS). He's responding to a critic who asked why, if these things (transcripts) are real, don't we see genetic evidence of it.
  • (24:00) Noncoding RNAs have all of the characteristics of functional RNAs with an emphasis on the fact that their expression is often only detected in specific cell types.
  • (31:50) It has now been shown that everything that protein transcription factors can do can be done by noncoding RNA.
  • (32:15) "I want to say to you that conservation is totally misunderstood." Apparently, lack of conservation imputes nothing about function.
  • (41:00) RNAs control phase separation. There's a whole other level of cell organization that we never dreamed of. (Ironically, he gives nucleoli as an example of something we never dreamed of.)
  • (42:36) "This is called soft metaphysics, and it's just come into biology, and it's spectacular in its implications."
  • (46:25) Almost every lncRNA is alternatively spliced in mice and humans.
  • (46:30) There's more alternative splicing in human protein-coding genes than in mice protein-coding genes but the extra splicing in humans is mostly in the 5' untranslated region. (I'm sure it has nothing to do with the fact that tons more RNA-Seq experiments have been done on human tissues.) "We think this is due to the increased sophistication of the regulation of these genes for the evolution of cognition."
  • (48:00) At least 20% of the human genome is evolutionarily conserved at the level of RNA structure and this does not require any assumptions.
  • (55:00) The talk ends at 55 minutes. That's too bad because I'm sure Mattick had a dozen more slides explaining why all of those transcripts are functional, as opposed to the few selected examples he picked. I'm sure he also had a lot of data refuting all of the evidence in favor of junk DNA but he just ran out of time.

I don't know if there were questions but, if there were, I bet that none of them challenged Mattick's main thesis.