More Recent Comments

Friday, April 08, 2022

The structures of centromeres

The new complete human genome sequence gives us a first-time look at the structures of human centromeres.

This is my sixth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The new long-read and ultra-long-read sequencing techniques have revealed the organization of centromeric regions of human chromosomes. The basic structure of these regions has been known for many years [Centromere DNA] but the overall arrangement of the various repeats and the large scale organizaton of the centromere was not clear.

The core functional regions of centromeres consist of multiple copies of tandemly repeated alpha-satellite sequences. These are 171 bp AT-rich sequences that serve as attachment sites for kinetocore proteins. The kinetochore proteins interact with spindle fibers that pull the chromosomes to the opposite ends of a dividing cell. The core region is surrounded by pericentromeric regions containing additional repeats (mostly HSat2 and HSat3). The alpha-satellite repeats take up almost 3% of the genome and the pericentromeric repeats occupy an additional 3%.1 That's why centromeres are a major component of the functional part of the human genome. (Centromeres are classic examples of functional noncoding DNA and knowledgeable scientists have known about them for half a century.2

Altemose, N., Logsdon, G.A., Bzikadze, A.V., Sidhwani, P., Langley, S.A., Caldas, G.V., Hoyt, S.J., Uralsky, L., Ryabov, F.D., Shew, C.J. and et al. (2021) Complete genomic and epigenetic maps of human centromeres. Science 376:56. [doi: 10.1126/science.abl4178]

Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.

The details of the organization of each centromere aren't important. There's a lot of variation between centromeres on different chromosomes and between specific centromeres in different individuals. The authors looked at the organization of X chromosome centromeres in a variety of different individuals from different parts of the world. As expected, there was considerable variation and, as expected, there was more variation within Africans than in all other populations combined.

It shouldn't come as a surprise to find that the authors want more T2T sequences.

This high degree of satellite DNA polymorphism underlines the need to produce T2T assemblies from genetically diverse individuals, to fully capture the extent of human variation in these regions, and to shed light on their recent evolution.

I really hope the granting agencies don't fall for this. It would be much better to spend the resources on exploring the biological function of splice variants (alternative splicing?) and putative noncoding genes in order to resolve the junk DNA controversy. It would also help to devote some of this money to the proper education of science undergraduates.

The authors claim to have discovered 676 genes and pseudogenes within the centromeres. They claim that this includes 23 protein coding genes and 141 lncRNAs genes. They present evidence that three of these genes might have a function which means that 161/164 of these "genes" are "putative" genes until we see evidence of function.3


1. It's unlikely that most of this 6% is absolutely required for the proper functioning of the centromeres because there are many individuals with much less centromere DNA. That's why I only attribute about 1% of the genome to functional centromere sequence.

2. Unknowledgeable scientists continue to be shocked when they discover that noncoding DNA can have a biological function. This is because they weren't taught properly as undergraduates.

3. I don't understand why so many scientists are unable to see the difference between a putative gene and a real gene.

Wednesday, April 06, 2022

Genetic variation and the complete human genome sequence

The new complete human genome sequence adds an extra 8% of DNA sequence that's a source of variation in the human population. The sequence also corrects some errors in the current standard reference genome.

This is my fifth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Tuesday, April 05, 2022

Two different views of the history of molecular biology

How can different molecular biologists have such opposite views of the history of their field?

I'm posting links to two papers without comment. One of them is from my friend and colleague Alex Palazzo and the other is from James Shapiro who is not my friend or colleague. Both papers have been published in reputable peer-review journals.

Transcription activity in repeat regions of the human genome

A detailed examination of the new complete human genome reveals that 54% of it consists of various repetitive elements. Some of them are transcribed and some aren't.

This is my fourth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The fourth paper extends the ENCODE-type analysis of the T2T-CHM13 sequence by focusing on repeats.

Hoyt, S.J., Storer, J.M., Hartley, G.A., Grady, P.G., Gershman, A., de Lima, L.G., Limouse, C., Halabian, R., Wojenski, L., Rodriguez, M. et al. (2021) From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376:57. [doi: 10.1126/science.abk3112]

Mobile elements and repetitive genomic regions are sources of lineage-specific genomic innovation and uniquely fingerprint individual genomes. Comprehensive analyses of such repeat elements, including those found in more complex regions of the genome, require a complete, linear genome assembly. We present a de novo repeat discovery and annotation of the T2T-CHM13 human reference genome. We identified previously unknown satellite arrays, expanded the catalog of variants and families for repeats and mobile elements, characterized classes of complex composite repeats, and located retroelement transduction events. We detected nascent transcription and delineated CpG methylation profiles to define the structure of transcriptionally active retroelements in humans, including those in centromeres. These data expand our insight into the diversity, distribution, and evolution of repetitive regions that have shaped the human genome.

The most useful part of this paper is the complete analysis of all repetitive elements in the T2T-CHM13 genome. This gives us, for the first time, a complete picture of a human genome. The exact values of the various components aren't important because there's considerable variation with the human population but the big picture is informative.

These are the percentages of the human genome occupied by the different classes of repetitive DNA.

  • SINEs 12.8%
  • Retrotransposon 0.15%
  • LINEs 20.7%
  • LTRs 8.8%
  • DNA transposons 3.6%
  • simple repeats 8%

The total comes to 54%. There are other estimates that are higher because of a more lenient cutoff value for sequence similarity but this gives you a pretty good idea of what the genome looks like. Most of the transposon-related sequence consists of fragments of once active transposons so the fraction of the genome consisting of true selfish DNA capable of transposing is a small fraction of this 54%.

We have every reason to believe that most of this DNA is junk DNA based on several lines of evidence developed over the past 50 years but most of the authors of this paper are reluctant to reach that conclusion so the fact that these repetitive sequences might be junk isn't mentioned in the paper. Instead, the authors concentrate on mapping CpG methylation sites and transcribed regions. They refer to this as "functional annotation" but they don't provide a definition of function.

We provide a high-confidence functional annotation of repeats across the human genome.

As you might expect, the repeat elements that retain vestiges of promoters are often transcribed and this includes adjacent genomic sequences that are found near these promoter (e.g. near LTRs). The long stretches of short tandem repeats (e.g. satellite DNA) do not contain any sequences that resemble promoters so these regions are not transcribed. (The authors seem to be a bit surprised by this result.) Further work is needed to decide how much of this DNA is truly functional and which parts contribute to human uniqueness. Naturally, that will require much more ENCODE-type work and T2T sequencing of other primates.

Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Although we find repeat variants that appear enriched or specific to the human lineage, in the absence of T2T-level assemblies from other primate species, we cannot truly attribute these elements to specific human phenotypes. Thus, the extent of variation described herein highlights the need to expand the effort to create human and nonhuman primate pan-genome references to support exploration of repeats that define the true extent of human variation.

This will cost millions of dollars. I suspect the grant applications have already been sent.



Monday, April 04, 2022

If you were a Harvard freshman you could take a course on the dark matter of the genome.

Check out this freshperson seminar course on Parts Unknown: The Dark Matter of the Genome at Harvard. It is offfered by Amanda J. Whipple of the Department of Molecular and Cellular Biology. She works on noncoding RNAs in the brain. Harvard likes to think of itself as one of the top universities in the world so this seminar course must be an example of world class critical thinking.

Heaven help us if this is what future American leaders are being taught.

Did you know that genes, traditionally defined as DNA encoding protein, only account for two percent of the entire human genome? What is the purpose of the remaining 98% of the genome? Is it simply “junk DNA”? This seminar will explore the large portion of our genome that has been neglected by scientists for many years because its purpose was not known. We will examine research findings which demonstrate non-coding sequences, previously assigned as “junk DNA”, play crucial roles in the development and maintenance of a healthy organism. We will further discuss how these non-coding sequences are promising targets for drug design and disease diagnosis. We will then visit a local research laboratory (either virtually or in person as deemed appropriate) and engage with active scientists regarding the scientific research enterprise.

A thorough understanding of the human genome not only provides a foundation for any student interested in the life sciences, it enables one to engage more deeply in related political and societal debates, which is expected to become even more central as scientists further uncover the dark matter of our genomes.

Setting aside the sarcasm, how did we get to a stage where a prominent researcher at one of the top research universities in the world could write such a course description?



Sunday, April 03, 2022

Karen Miga and the telomere-to-telomere consortium

Karen Miga deserves a lot of the credit for the complete human genome sequence.

Karen Miga is a professor at the University of California, Santa Cruz, and she's been working for several years on sequencing the repetitive regions of the genome. She is a co-founder of the telomere-to-telomere consortium that just published a complete sequece of the human genome. She made a signficant contribution to long-read (~20 Kb) and ultra-long-read (>100 kb) sequencing and that's a major technological achievement that's worthy of prizes.

Read the interview on CBC (Canada) Quirks & Quarks at Scientists sequence complete, gap-free human genome for the first time and watch the YouTube video.


Miga did her Ph.D. with Huntington Willard at Duke University. Hunt has been working on centromeres for more than 40 yeas years and some of my colleagues may remember him when he was a professor at the University of Toronto in the Department of Medical Genetics.



What do we do with two different human genome reference sequences?

It's going to be extremely difficult, perhaps impossible, to merge the new complete human genome sequence with the current standard reference genome.

The source DNA for the new telomere-to-telomere (T2T) human genome sequence was a cell line derived from a molar pregnancy. This meant that the DNA was essentially haploid, thus avoiding the complications of sequencing diploid DNA which contains two highly similar but different genomes. The cell line, CHM13, lacks a Y chromosome but that's trivial since a complete T2T sequence of a Y chromosome will soon be published and it can be added to the T2T-CHM13 genome sequence [Telomere-to-telomere sequencing of a complete human genome].

Segmental duplications in the human genome

The new completed human genome sequence contains some previously unknown large duplicatons (segmental duplications).

This is my third post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Epigenetic markers in the last 8% of the human genome sequence

The newly sequenced part of the human genome contains the same chromatin regions as the rest of the genome and they don't tell us very much about which regions are functional and which ones are junk.

This is my second post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

A complete human genome sequence (2022)

The first complete human genome sequence has finally been published.

This is my first post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Friday, April 01, 2022

Illuminating dark matter in human DNA?

A few months ago, the press office of the University of California at San Diego issued a press release with a provocative title ...

Illuminating Dark Matter in Human DNA - Unprecedented Atlas of the "Book of Life"

The press release was posted on several prominent science websites and Facebook groups. According to the press release, much of the human genome remains mysterious (dark matter) even 20 years after it was sequenced. According to the senior author of the paper, Bing Ren, we still don't understand how genes are expressed and how they might go awry in genetic diseases. He says,

A major reason is that the majority of the human DNA sequence, more than 98 percent, is non-protein-coding, and we do not yet have a genetic code book to unlock the information embedded in these sequences.

We've heard that story before and it's getting very boring. We know that 90% of our genome is junk, about 1% encodes proteins, and another 9% contains lots of functional DNA sequences, including regulatory elements. We've known about regulatory elements for more than 50 years so there's nothing mysterious about that component of noncoding DNA.

Wednesday, March 30, 2022

John Mattick's new book

John Mattick and Paulo Amaral have written a book that promotes their views on the content of the human genome. It will be available next August. Their main thesis is that the human genome is full of genes for regulatory RNAs and there's very little junk. A secondary theme is that some very smart scientists have been totally wrong about molecular biology and molecular evolution for the past fifty years.

I pretty much know what's going to be in the book [see John Mattick presents his view of genomes]. I also know that most of his claims don't stand up to close scrutiny but that's not going to prevent it from being touted as a true paradigm shift. (It's actually a paradigm shaft.) I suspect it's going to get favorable reviews in Science and Nature.

John Mattick presents his view of genomes

John Mattick has a new book coming out in August where he defends the notion that most of our genome is full of genes for functonal noncoding RNAs. We have a pretty good idea what he's going to say. This is a talk he gave at Oxford on May 17, 2019.

Here are a few statements that should pique your interest.

  • (0:57) He says that his upcoming book is tentatively titled "the misunderstandings of molecular biology."
  • (1:11) He says that "the assumption has been very deeply embedded from the time of the lac operon on that genes equated to proteins."
  • (2:30) There have been three "surprises" in molecuular biology: (1) introns, (2) eukaryotic genomes are full of 'selfish' DNA, and (3) "gene number does not scale with developmental complexity."
  • (4:30) It is an unjustified assumption to assume that transposon-related seqences are junk and that leads to misinterpretation of neutral evolution.
  • (6:00) The view that evolution of regulatory sequences is mostly responsible for developmental complexity (Evo-Devo) has never been justified.
  • (8:45) A lot of obtuse theoretical discussion about how the number of regulatory protein-coding genes increases quadratically as the total number of protein-coding genes increase in a bacterial genome but at some point there has to be more protein-coding regulatory genes than total protein-coding genes so that limits the evolution of bacteria.
  • (13:40) The proportion of noncoding DNA increases with developmental complexity, topping out at humans.
  • (14:00) The vast majority of the genome in complex organisms is differentially transcribed in different cells and different tissues.
  • (14:15) The whole genome is alive on both strands.
  • (14:20) There are two possibilities: junk RNA or abundant functional transcripts and that explains complex organisms.
  • Mattick then takes several minutes to document the fact that there are abundant transcripts— a fact that has been known for the better part of sixty years but he does not mention that. All of his statements carry the implicit assumption that these transcripts are functional.
  • (20:20) He makes the boring, and largely irelevant, point that most disease-associated loci are located in noncoding regions (GWAS). He's responding to a critic who asked why, if these things (transcripts) are real, don't we see genetic evidence of it.
  • (24:00) Noncoding RNAs have all of the characteristics of functional RNAs with an emphasis on the fact that their expression is often only detected in specific cell types.
  • (31:50) It has now been shown that everything that protein transcription factors can do can be done by noncoding RNA.
  • (32:15) "I want to say to you that conservation is totally misunderstood." Apparently, lack of conservation imputes nothing about function.
  • (41:00) RNAs control phase separation. There's a whole other level of cell organization that we never dreamed of. (Ironically, he gives nucleoli as an example of something we never dreamed of.)
  • (42:36) "This is called soft metaphysics, and it's just come into biology, and it's spectacular in its implications."
  • (46:25) Almost every lncRNA is alternatively spliced in mice and humans.
  • (46:30) There's more alternative splicing in human protein-coding genes than in mice protein-coding genes but the extra splicing in humans is mostly in the 5' untranslated region. (I'm sure it has nothing to do with the fact that tons more RNA-Seq experiments have been done on human tissues.) "We think this is due to the increased sophistication of the regulation of these genes for the evolution of cognition."
  • (48:00) At least 20% of the human genome is evolutionarily conserved at the level of RNA structure and this does not require any assumptions.
  • (55:00) The talk ends at 55 minutes. That's too bad because I'm sure Mattick had a dozen more slides explaining why all of those transcripts are functional, as opposed to the few selected examples he picked. I'm sure he also had a lot of data refuting all of the evidence in favor of junk DNA but he just ran out of time.

I don't know if there were questions but, if there were, I bet that none of them challenged Mattick's main thesis.


Saturday, March 26, 2022

Science communication in the modern world

Science editors asked young scientists to imagine what kind of course they would have created if they could go back to a time before the pandemic [A pandemic education]. Three of the courses were about science communication.

COM 145: Identification, analysis, and communication of scientific evidence

This course focuses on developing the skills required to translate scientific evidence into accessible information for the general public, especially under circumstances that lead to the intensification of fear and misinformation. Discussions will cover the principles of the scientific method, as well as its theoretical and practical relevance in counteracting the dissemination of pseudoscience, particularly on social media. This course discusses chapters from Carl Sagan’s book The Demon-Haunted World, certain peer-reviewed and retracted papers, and materials related to key science issues, such as the anti-vaccine movement. For the final project, students will comprehensibly communicate a scientific topic to the public.

Camila Fonseca Amorim da Silva University of Sao Paulo, Sao Paulo, Brazil

COM 198: Everyday science communication

As scientific discoveries become increasingly specialized, the lack of understanding by the general public undermines trust in scientists and causes the spread of misinformation. This course will be taught by scientists and communication specialists who will provide students with a toolset to explain scientific concepts, as well as their own research projects, to the general public. Upon completion of this course, students will be able to explain to their grandparents that viruses exist even though they can’t see them, convince their neighbors that vaccines don’t contain tracking devices, and explain the concept of exponential growth to governmental officials.

Anna Uzonyi Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel.

COM 232: Introduction to talking to regular people

Communicating science is difficult. Many scientists, having immersed themselves in the language of their field, have completely forgotten how to talk to regular people. This course hones introductory science communication skills, such as how to talk about scary things without generating mass panic, how to calmly discourage the hoarding of paper hygiene products, and how to explain why scientific knowledge changes over time. The final project will include cross examination from law school faculty, who are otherwise completely uninvolved with the course and possess minimal scientific training. Recommended for science majors who are unable to discuss impactful scientific findings without citing a P value.

Joseph Michael Cusimano Bernard J. Dunn School of Pharmacy, Shenandoah University, Winchester, VA, USA.

They sound like interesting courses but my own take on science communication is somewhat different. I think it's very difficult for practicing scientists to communicate effectively with the general public so I tend to view science communication at several different levels. My goal is to communicate with an audience of scientists, science journalists, and people who are already familiar with science. The idea is to make sure that this intermediate group understands the scientific facts in my field and to make sure they are familiar with the major controversies.

My hope is that this intermediate group will disseminate this information to their less-informed friends and relatives and, more importantly, stop the spread of misinformation whenever they hear it.

Take junk DNA for example. It's very difficult to convince the average person that 90% of our genome is junk because the idea is so counter-intuitive and contrary to the popular counter-narratives. However, I have a chance of convincing the intermediate group, including science journalists and other scientists, who can follow the scientific arguments. If I succeed, they will at least stop spreading misinformation and false narratives and start presenting alternatives to their sudiences.


Monday, March 14, 2022

Junk DNA

My book manuscript has been reviewed by some outside experts and they seem to have convinced my editor that my book is worth publishing. I hope we can get it finished soon. It would be nice to publish in in September on the 10th anniversary of the ENCODE disaster.

Meanwhile, I keep scanning the literature for mentions of junk DNA to see if scientists are finally coming to their senses. Apparently not, and that's a good thing because it means that my book is still needed. Here's the opening paragraph from a recent review of lncRNAs. The authors are in the Department of Medicine at the Medical College of Gerogia, in Augusta, Georgia (USA).

Ghanam, A.R., Bryant, W.B. and Miano, J.M. (2022) Of mice and human-specific long noncoding RNAs. Mammalian Genome:1-12. [doi: 10.1007/s00335-022-09943-2]

Approximately ninety-eight percent of our genome is noncoding. Contrary to initial descriptions of this vast sea of sequence comprising “junk DNA” (Ohno 1972), comparative genomics and various next-generation sequencing studies have revealed millions of transcription factor binding sites (TFBS) (Vierstra et al. 2020) and tens of thousands of noncoding genes, most notably the class of long noncoding RNAs (LncRNAs), defined currently as processed transcripts of length > 200 base pairs with no protein-coding capacity (Rinn and Chang 2020; Statello et al. 2021). The widespread transcription of LncRNAs and abundance of regulatory sequences such as enhancers support the concept of a genome that is largely functional (ENCODE Project Consortium 2012). Such a dynamic genome should not be surprising given the complex nature of gene expression and gene function necessary for embryonic and postnatal development as well as disease processes.

  • No reasonable scientist, especially Susumu Ohno, ever said that all noncoding DNA was junk.
  • There are millions of transcription factor binding sites but most of them are spurious binding sites that have nothing to do with regulation. They simply reflect the expected behavior of typical DNA binding proteins in a large genome full of junk DNA.
  • Nobody has demonstrated that there are tens of thousand of noncoding genes. There may be tens of thousands of transcripts but that's not the same thing since you have to prove that those transcripts are functional before you can say that they come from genes.
  • There is currently no evidence to support the concept of a genome that is largely functional in spite of what the ENCODE researchers might have said ten years ago.
  • Such a genome would be very surprising, if it were true, given what we know about genomes, evolution, and basic biochemistry.

Except for those few minor details—I hope I'm not being too picky—that's a pretty good way to start a review of lncRNAs. :-)