More Recent Comments

Showing posts with label Genome. Show all posts
Showing posts with label Genome. Show all posts

Friday, April 08, 2022

The structures of centromeres

The new complete human genome sequence gives us a first-time look at the structures of human centromeres.

This is my sixth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The new long-read and ultra-long-read sequencing techniques have revealed the organization of centromeric regions of human chromosomes. The basic structure of these regions has been known for many years [Centromere DNA] but the overall arrangement of the various repeats and the large scale organizaton of the centromere was not clear.

The core functional regions of centromeres consist of multiple copies of tandemly repeated alpha-satellite sequences. These are 171 bp AT-rich sequences that serve as attachment sites for kinetocore proteins. The kinetochore proteins interact with spindle fibers that pull the chromosomes to the opposite ends of a dividing cell. The core region is surrounded by pericentromeric regions containing additional repeats (mostly HSat2 and HSat3). The alpha-satellite repeats take up almost 3% of the genome and the pericentromeric repeats occupy an additional 3%.1 That's why centromeres are a major component of the functional part of the human genome. (Centromeres are classic examples of functional noncoding DNA and knowledgeable scientists have known about them for half a century.2

Altemose, N., Logsdon, G.A., Bzikadze, A.V., Sidhwani, P., Langley, S.A., Caldas, G.V., Hoyt, S.J., Uralsky, L., Ryabov, F.D., Shew, C.J. and et al. (2021) Complete genomic and epigenetic maps of human centromeres. Science 376:56. [doi: 10.1126/science.abl4178]

Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.

The details of the organization of each centromere aren't important. There's a lot of variation between centromeres on different chromosomes and between specific centromeres in different individuals. The authors looked at the organization of X chromosome centromeres in a variety of different individuals from different parts of the world. As expected, there was considerable variation and, as expected, there was more variation within Africans than in all other populations combined.

It shouldn't come as a surprise to find that the authors want more T2T sequences.

This high degree of satellite DNA polymorphism underlines the need to produce T2T assemblies from genetically diverse individuals, to fully capture the extent of human variation in these regions, and to shed light on their recent evolution.

I really hope the granting agencies don't fall for this. It would be much better to spend the resources on exploring the biological function of splice variants (alternative splicing?) and putative noncoding genes in order to resolve the junk DNA controversy. It would also help to devote some of this money to the proper education of science undergraduates.

The authors claim to have discovered 676 genes and pseudogenes within the centromeres. They claim that this includes 23 protein coding genes and 141 lncRNAs genes. They present evidence that three of these genes might have a function which means that 161/164 of these "genes" are "putative" genes until we see evidence of function.3


1. It's unlikely that most of this 6% is absolutely required for the proper functioning of the centromeres because there are many individuals with much less centromere DNA. That's why I only attribute about 1% of the genome to functional centromere sequence.

2. Unknowledgeable scientists continue to be shocked when they discover that noncoding DNA can have a biological function. This is because they weren't taught properly as undergraduates.

3. I don't understand why so many scientists are unable to see the difference between a putative gene and a real gene.

Wednesday, April 06, 2022

Genetic variation and the complete human genome sequence

The new complete human genome sequence adds an extra 8% of DNA sequence that's a source of variation in the human population. The sequence also corrects some errors in the current standard reference genome.

This is my fifth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Tuesday, April 05, 2022

Transcription activity in repeat regions of the human genome

A detailed examination of the new complete human genome reveals that 54% of it consists of various repetitive elements. Some of them are transcribed and some aren't.

This is my fourth post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

The fourth paper extends the ENCODE-type analysis of the T2T-CHM13 sequence by focusing on repeats.

Hoyt, S.J., Storer, J.M., Hartley, G.A., Grady, P.G., Gershman, A., de Lima, L.G., Limouse, C., Halabian, R., Wojenski, L., Rodriguez, M. et al. (2021) From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376:57. [doi: 10.1126/science.abk3112]

Mobile elements and repetitive genomic regions are sources of lineage-specific genomic innovation and uniquely fingerprint individual genomes. Comprehensive analyses of such repeat elements, including those found in more complex regions of the genome, require a complete, linear genome assembly. We present a de novo repeat discovery and annotation of the T2T-CHM13 human reference genome. We identified previously unknown satellite arrays, expanded the catalog of variants and families for repeats and mobile elements, characterized classes of complex composite repeats, and located retroelement transduction events. We detected nascent transcription and delineated CpG methylation profiles to define the structure of transcriptionally active retroelements in humans, including those in centromeres. These data expand our insight into the diversity, distribution, and evolution of repetitive regions that have shaped the human genome.

The most useful part of this paper is the complete analysis of all repetitive elements in the T2T-CHM13 genome. This gives us, for the first time, a complete picture of a human genome. The exact values of the various components aren't important because there's considerable variation with the human population but the big picture is informative.

These are the percentages of the human genome occupied by the different classes of repetitive DNA.

  • SINEs 12.8%
  • Retrotransposon 0.15%
  • LINEs 20.7%
  • LTRs 8.8%
  • DNA transposons 3.6%
  • simple repeats 8%

The total comes to 54%. There are other estimates that are higher because of a more lenient cutoff value for sequence similarity but this gives you a pretty good idea of what the genome looks like. Most of the transposon-related sequence consists of fragments of once active transposons so the fraction of the genome consisting of true selfish DNA capable of transposing is a small fraction of this 54%.

We have every reason to believe that most of this DNA is junk DNA based on several lines of evidence developed over the past 50 years but most of the authors of this paper are reluctant to reach that conclusion so the fact that these repetitive sequences might be junk isn't mentioned in the paper. Instead, the authors concentrate on mapping CpG methylation sites and transcribed regions. They refer to this as "functional annotation" but they don't provide a definition of function.

We provide a high-confidence functional annotation of repeats across the human genome.

As you might expect, the repeat elements that retain vestiges of promoters are often transcribed and this includes adjacent genomic sequences that are found near these promoter (e.g. near LTRs). The long stretches of short tandem repeats (e.g. satellite DNA) do not contain any sequences that resemble promoters so these regions are not transcribed. (The authors seem to be a bit surprised by this result.) Further work is needed to decide how much of this DNA is truly functional and which parts contribute to human uniqueness. Naturally, that will require much more ENCODE-type work and T2T sequencing of other primates.

Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Although we find repeat variants that appear enriched or specific to the human lineage, in the absence of T2T-level assemblies from other primate species, we cannot truly attribute these elements to specific human phenotypes. Thus, the extent of variation described herein highlights the need to expand the effort to create human and nonhuman primate pan-genome references to support exploration of repeats that define the true extent of human variation.

This will cost millions of dollars. I suspect the grant applications have already been sent.



Monday, April 04, 2022

If you were a Harvard freshman you could take a course on the dark matter of the genome.

Check out this freshperson seminar course on Parts Unknown: The Dark Matter of the Genome at Harvard. It is offfered by Amanda J. Whipple of the Department of Molecular and Cellular Biology. She works on noncoding RNAs in the brain. Harvard likes to think of itself as one of the top universities in the world so this seminar course must be an example of world class critical thinking.

Heaven help us if this is what future American leaders are being taught.

Did you know that genes, traditionally defined as DNA encoding protein, only account for two percent of the entire human genome? What is the purpose of the remaining 98% of the genome? Is it simply “junk DNA”? This seminar will explore the large portion of our genome that has been neglected by scientists for many years because its purpose was not known. We will examine research findings which demonstrate non-coding sequences, previously assigned as “junk DNA”, play crucial roles in the development and maintenance of a healthy organism. We will further discuss how these non-coding sequences are promising targets for drug design and disease diagnosis. We will then visit a local research laboratory (either virtually or in person as deemed appropriate) and engage with active scientists regarding the scientific research enterprise.

A thorough understanding of the human genome not only provides a foundation for any student interested in the life sciences, it enables one to engage more deeply in related political and societal debates, which is expected to become even more central as scientists further uncover the dark matter of our genomes.

Setting aside the sarcasm, how did we get to a stage where a prominent researcher at one of the top research universities in the world could write such a course description?



Sunday, April 03, 2022

Karen Miga and the telomere-to-telomere consortium

Karen Miga deserves a lot of the credit for the complete human genome sequence.

Karen Miga is a professor at the University of California, Santa Cruz, and she's been working for several years on sequencing the repetitive regions of the genome. She is a co-founder of the telomere-to-telomere consortium that just published a complete sequece of the human genome. She made a signficant contribution to long-read (~20 Kb) and ultra-long-read (>100 kb) sequencing and that's a major technological achievement that's worthy of prizes.

Read the interview on CBC (Canada) Quirks & Quarks at Scientists sequence complete, gap-free human genome for the first time and watch the YouTube video.


Miga did her Ph.D. with Huntington Willard at Duke University. Hunt has been working on centromeres for more than 40 yeas years and some of my colleagues may remember him when he was a professor at the University of Toronto in the Department of Medical Genetics.



What do we do with two different human genome reference sequences?

It's going to be extremely difficult, perhaps impossible, to merge the new complete human genome sequence with the current standard reference genome.

The source DNA for the new telomere-to-telomere (T2T) human genome sequence was a cell line derived from a molar pregnancy. This meant that the DNA was essentially haploid, thus avoiding the complications of sequencing diploid DNA which contains two highly similar but different genomes. The cell line, CHM13, lacks a Y chromosome but that's trivial since a complete T2T sequence of a Y chromosome will soon be published and it can be added to the T2T-CHM13 genome sequence [Telomere-to-telomere sequencing of a complete human genome].

Segmental duplications in the human genome

The new completed human genome sequence contains some previously unknown large duplicatons (segmental duplications).

This is my third post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Epigenetic markers in the last 8% of the human genome sequence

The newly sequenced part of the human genome contains the same chromatin regions as the rest of the genome and they don't tell us very much about which regions are functional and which ones are junk.

This is my second post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

A complete human genome sequence (2022)

The first complete human genome sequence has finally been published.

This is my first post on the complete telomere-to-telomere sequence of the human genome in cell line CHM13 (T2T-CHM13). There were six papers in the April 1st edition of Science. My posts on all six papers are listed at the bottom of this post.

Friday, April 01, 2022

Illuminating dark matter in human DNA?

A few months ago, the press office of the University of California at San Diego issued a press release with a provocative title ...

Illuminating Dark Matter in Human DNA - Unprecedented Atlas of the "Book of Life"

The press release was posted on several prominent science websites and Facebook groups. According to the press release, much of the human genome remains mysterious (dark matter) even 20 years after it was sequenced. According to the senior author of the paper, Bing Ren, we still don't understand how genes are expressed and how they might go awry in genetic diseases. He says,

A major reason is that the majority of the human DNA sequence, more than 98 percent, is non-protein-coding, and we do not yet have a genetic code book to unlock the information embedded in these sequences.

We've heard that story before and it's getting very boring. We know that 90% of our genome is junk, about 1% encodes proteins, and another 9% contains lots of functional DNA sequences, including regulatory elements. We've known about regulatory elements for more than 50 years so there's nothing mysterious about that component of noncoding DNA.

Wednesday, March 30, 2022

John Mattick's new book

John Mattick and Paulo Amaral have written a book that promotes their views on the content of the human genome. It will be available next August. Their main thesis is that the human genome is full of genes for regulatory RNAs and there's very little junk. A secondary theme is that some very smart scientists have been totally wrong about molecular biology and molecular evolution for the past fifty years.

I pretty much know what's going to be in the book [see John Mattick presents his view of genomes]. I also know that most of his claims don't stand up to close scrutiny but that's not going to prevent it from being touted as a true paradigm shift. (It's actually a paradigm shaft.) I suspect it's going to get favorable reviews in Science and Nature.

John Mattick presents his view of genomes

John Mattick has a new book coming out in August where he defends the notion that most of our genome is full of genes for functonal noncoding RNAs. We have a pretty good idea what he's going to say. This is a talk he gave at Oxford on May 17, 2019.

Here are a few statements that should pique your interest.

  • (0:57) He says that his upcoming book is tentatively titled "the misunderstandings of molecular biology."
  • (1:11) He says that "the assumption has been very deeply embedded from the time of the lac operon on that genes equated to proteins."
  • (2:30) There have been three "surprises" in molecuular biology: (1) introns, (2) eukaryotic genomes are full of 'selfish' DNA, and (3) "gene number does not scale with developmental complexity."
  • (4:30) It is an unjustified assumption to assume that transposon-related seqences are junk and that leads to misinterpretation of neutral evolution.
  • (6:00) The view that evolution of regulatory sequences is mostly responsible for developmental complexity (Evo-Devo) has never been justified.
  • (8:45) A lot of obtuse theoretical discussion about how the number of regulatory protein-coding genes increases quadratically as the total number of protein-coding genes increase in a bacterial genome but at some point there has to be more protein-coding regulatory genes than total protein-coding genes so that limits the evolution of bacteria.
  • (13:40) The proportion of noncoding DNA increases with developmental complexity, topping out at humans.
  • (14:00) The vast majority of the genome in complex organisms is differentially transcribed in different cells and different tissues.
  • (14:15) The whole genome is alive on both strands.
  • (14:20) There are two possibilities: junk RNA or abundant functional transcripts and that explains complex organisms.
  • Mattick then takes several minutes to document the fact that there are abundant transcripts— a fact that has been known for the better part of sixty years but he does not mention that. All of his statements carry the implicit assumption that these transcripts are functional.
  • (20:20) He makes the boring, and largely irelevant, point that most disease-associated loci are located in noncoding regions (GWAS). He's responding to a critic who asked why, if these things (transcripts) are real, don't we see genetic evidence of it.
  • (24:00) Noncoding RNAs have all of the characteristics of functional RNAs with an emphasis on the fact that their expression is often only detected in specific cell types.
  • (31:50) It has now been shown that everything that protein transcription factors can do can be done by noncoding RNA.
  • (32:15) "I want to say to you that conservation is totally misunderstood." Apparently, lack of conservation imputes nothing about function.
  • (41:00) RNAs control phase separation. There's a whole other level of cell organization that we never dreamed of. (Ironically, he gives nucleoli as an example of something we never dreamed of.)
  • (42:36) "This is called soft metaphysics, and it's just come into biology, and it's spectacular in its implications."
  • (46:25) Almost every lncRNA is alternatively spliced in mice and humans.
  • (46:30) There's more alternative splicing in human protein-coding genes than in mice protein-coding genes but the extra splicing in humans is mostly in the 5' untranslated region. (I'm sure it has nothing to do with the fact that tons more RNA-Seq experiments have been done on human tissues.) "We think this is due to the increased sophistication of the regulation of these genes for the evolution of cognition."
  • (48:00) At least 20% of the human genome is evolutionarily conserved at the level of RNA structure and this does not require any assumptions.
  • (55:00) The talk ends at 55 minutes. That's too bad because I'm sure Mattick had a dozen more slides explaining why all of those transcripts are functional, as opposed to the few selected examples he picked. I'm sure he also had a lot of data refuting all of the evidence in favor of junk DNA but he just ran out of time.

I don't know if there were questions but, if there were, I bet that none of them challenged Mattick's main thesis.


Wednesday, November 03, 2021

What's in your genome?: 2021

This is an updated version of what's in your genome based on the latest data. The simple version is ...

about 90% of your genome is junk

Sunday, May 30, 2021

Telomere-to-telomere sequencing of a complete human genome

Here's a paper that has recently been posted on the preprint server bioRxiv.

Nurk et al. (2021) The complete sequence of a human genome. [doi: 10.1101/2021.05.26.445798]

I usually don't like to comment on preprints but this one is surely going to be published somewhere and it's important.

The authors have sequenced the entire chromosomes (telomere-to-telomere) of the 22 autosomes and the X chromosome of the cell line CHM13. The cell line is a complete hydatiform mole, which means it is derived from a molar pregnancy where a sperm combines with an egg cell that has lost its nucleus. The sperm DNA duplicates giving rise to cells that have two identical copies of each chromosome. The karyotype of the CHM13 cell line is 46,XX. The advantage of sequencing the DNA from such cell lines is that the interpretation of the sequencing results is not complicated by the heterogeneity of normal diploid cell lines. This was important because the focus of this study was on sequencing repetitive regions of the chromosomes and most chromosome pairs have different numbers of repeats.

Thursday, April 08, 2021

On the accuracy of genomics in detecting disease variants

Several diseases, such as cancers, are caused by the presence of deleterious alleles that affect the function of a gene. In the case of cancer, most of the mutations are somatic cell mutations—mutations that have occurred after fertilization. These mutations will not be passed on to future generations. However, there are some variants that are present in the germline and these will be inherited. A small percentage of these variants will cause cancer directly but most will just indicate a predisposition to develop cancer.

There are a host of other diseases that have a genetic component and the responsible alleles can also be present in the germline or due to somatic cell mutations.

Over the past fifty years or so there has been a lot of hype associated with the latest technological advances and the ability to detect deleterious germline mutations. The general public has been repeatedly told that we will soon be able to identify all disease-causing alleles and this will definitely lead to incredible medical advances in treating these diseases. Just yesterday, for example, I posted an article on predictions made by The National Genome Research Institute (USA) who predicts that by 2030,

The clinical relevance of all encountered genomic variants will be readily predictable, rendering the diagnostic designation ‘variant of uncertain significance (VUS)’ obsolete.

Similar predictions, in various forms, were made when the human genome project got under way and at various time afterword. First there was the 1000 genomes project then there was the 100,000 genome project and, of course, ENCODE. The problem is that genomics hasn't lived up to these expectations and there's a very good reason for that: it's because the problem is a lot more difficult than it seems.

One of the Facebook groups that I follow (Modern Genetics & Technology)1 alerted me to a recent paper in JAMA that addressed the problem of genomics accuracy and the prediction of pathogenic variants. I'm posting the complete abstract so you can see the extent of the problem.

AlDubayan, S.H., Conway, J.R., Camp, S.Y., Witkowski, L., Kofman, E., Reardon, B., Han, S., Moore, N., Elmarakeby, H. and Salari, K. (2020) Detection of Pathogenic Variants With Germline Genetic Testing Using Deep Learning vs Standard Methods in Patients With Prostate Cancer and Melanoma. JAMA 324:1957-1969. [doi: 10.1001/jama.2020.20457]

Importance Less than 10% of patients with cancer have detectable pathogenic germline alterations, which may be partially due to incomplete pathogenic variant detection.

Objective To evaluate if deep learning approaches identify more germline pathogenic variants in patients with cancer.

Design Setting, and Participants A cross-sectional study of a standard germline detection method and a deep learning method in 2 convenience cohorts with prostate cancer and melanoma enrolled in the US and Europe between 2010 and 2017. The final date of clinical data collection was December 2017.

Exposures Germline variant detection using standard or deep learning methods.

Main Outcomes and Measures The primary outcomes included pathogenic variant detection performance in 118 cancer-predisposition genes estimated as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The secondary outcomes were pathogenic variant detection performance in 59 genes deemed actionable by the American College of Medical Genetics and Genomics (ACMG) and 5197 clinically relevant mendelian genes. True sensitivity and true specificity could not be calculated due to lack of a criterion reference standard, but were estimated as the proportion of true-positive variants and true-negative variants, respectively, identified by each method in a reference variant set that consisted of all variants judged to be valid from either approach.

Results The prostate cancer cohort included 1072 men (mean [SD] age at diagnosis, 63.7 [7.9] years; 857 [79.9%] with European ancestry) and the melanoma cohort included 1295 patients (mean [SD] age at diagnosis, 59.8 [15.6] years; 488 [37.7%] women; 1060 [81.9%] with European ancestry). The deep learning method identified more patients with pathogenic variants in cancer-predisposition genes than the standard method (prostate cancer: 198 vs 182; melanoma: 93 vs 74); sensitivity (prostate cancer: 94.7% vs 87.1% [difference, 7.6%; 95% CI, 2.2% to 13.1%]; melanoma: 74.4% vs 59.2% [difference, 15.2%; 95% CI, 3.7% to 26.7%]), specificity (prostate cancer: 64.0% vs 36.0% [difference, 28.0%; 95% CI, 1.4% to 54.6%]; melanoma: 63.4% vs 36.6% [difference, 26.8%; 95% CI, 17.6% to 35.9%]), PPV (prostate cancer: 95.7% vs 91.9% [difference, 3.8%; 95% CI, –1.0% to 8.4%]; melanoma: 54.4% vs 35.4% [difference, 19.0%; 95% CI, 9.1% to 28.9%]), and NPV (prostate cancer: 59.3% vs 25.0% [difference, 34.3%; 95% CI, 10.9% to 57.6%]; melanoma: 80.8% vs 60.5% [difference, 20.3%; 95% CI, 10.0% to 30.7%]). For the ACMG genes, the sensitivity of the 2 methods was not significantly different in the prostate cancer cohort (94.9% vs 90.6% [difference, 4.3%; 95% CI, –2.3% to 10.9%]), but the deep learning method had a higher sensitivity in the melanoma cohort (71.6% vs 53.7% [difference, 17.9%; 95% CI, 1.82% to 34.0%]). The deep learning method had higher sensitivity in the mendelian genes (prostate cancer: 99.7% vs 95.1% [difference, 4.6%; 95% CI, 3.0% to 6.3%]; melanoma: 91.7% vs 86.2% [difference, 5.5%; 95% CI, 2.2% to 8.8%]).

Conclusions and Relevance Among a convenience sample of 2 independent cohorts of patients with prostate cancer and melanoma, germline genetic testing using deep learning, compared with the current standard genetic testing method, was associated with higher sensitivity and specificity for detection of pathogenic variants. Further research is needed to understand the relevance of these findings with regard to clinical outcomes.

It's really difficult to understand this paper since there are many terms that I'd have to research more thoroughly; for example, does "germline whole-exon sequencing" mean that only sperm or egg DNA was sequenced and that every single exon in the entire genome was sequenced? Were exons in noncoding genes also sequenced?

I found it much more useful to look at the accompanying editorial by Gregory Feero.

Feero, W.G. (2020) Bioinformatics, Sequencing Accuracy, and the Credibility of Clinical Genomics. JAMA 324:1945-1947. [doi: 10.1001/jama.2020.19939]

Ferro explains that the main problem is distinguishing real pathogenic variants from false positives and this can only be accomplished by first sequencing and assembling the DNA and then using various algorithms to focus on important variants. Then there's the third step.

The third step, which often requires a high level of clinical expertise, sifts through detected potentially deleterious variations to determine if any are relevant to the indication for testing. For example, exome sequencing ordered for a patient with unexplained cardiomyopathy might harbor deleterious variants in the BRCA1 gene which, while a potentially important incidental finding, does not provide a plausible molecular diagnosis for the cardiomyopathy. The complexity of the bioinformatics tools used in these 3 steps is considerable.

It's that third step that's analyzed in the AlDubayan et al. paper and one of the tools used is a deep-learning (AI) algorithm. However, the training of this algorithm requiries considerable clinical expertise and testing it requires a gold standard set of variants to serve as an internal control. As you might have guessed, that gold standard doesn't exist because the whole point of the genomics is to identify perviously unknown deleterious alleles.

Ferro warns us that "clinical genome sequencing remains largely unregulated and accuracy is highly dependant on the expertise of individual testing laboratories." He concludes that genomics still has a long way to go.

The genomics community needs to act as a coherent body to ensure reproducibility of outcomes from clinical genome or exome sequencing, or provide transparent quality metrics for individual clinical laboratories. Issues related to achieving accuracy are not new, are not limited to bioinformatics tools, and will not be surmounted easily. However, until analytic and clinical validity are ensured, conversations about the potential value that genome sequencing brings to clinical situations will be challenging for clinical centers, laboratories that provide sequencing services, and consumers. For the foreseeable future, nongeneticist clinicians should be familiar with the quality of their chosen genome-sequencing laboratory and engage expert advice before changing patient management based on a test result.

I'm guessing that Gregory Feero doesn't think that in nine years (2030) "The clinical relevance of all encountered genomic variants will be readily predictable."


1. I do NOT recommend this group. It's full of amateurs who resist leaning and one of it's main purposes is to post copies of pirated textbooks in its files. The group members get very angry when you tell them that what they are doing is illegal!

Wednesday, April 07, 2021

Bold predictions for human genomics by 2030

After spending several years working on a book about the human genome I've come to the realization that the field of genomics is not delivering on its promise to help us understand what's in your genome. In fact, genomics researchers have by and large impeded progress by coming up with false claims that need to be debunked.

My view is not widely shared by today's researchers who honestly believe they have made tremendous progress and will make even more as long as they get several billion dollars to continue funding their research. This view is nicely summarized in a Scientific American article from last fall that's really just a precis of an article that first appeared in Nature. The Nature article was written by employees of the National Human Genome Research Institute (NHGRI) at the National Institutes of Health in Bethesda, MD, USA (Green et al., 2020). Its purpose is to promote the work that NHGRI has done in the past and to summarize its strategic vision for the future. At the risk of oversimplifying, the strategic vision is "more of the same."

Green, E.D., Gunter, C., Biesecker, L.G., Di Francesco, V., Easter, C.L., Feingold, E.A., Felsenfeld, A.L., Kaufman, D.J., Ostrander, E.A. and Pavan, W.J. and 20 others (2020) Strategic vision for improving human health at The Forefront of Genomics. Nature 586:683-692. [doi: 10.1038/s41586-020-2817-4]

Starting with the launch of the Human Genome Project three decades ago, and continuing after its completion in 2003, genomics has progressively come to have a central and catalytic role in basic and translational research. In addition, studies increasingly demonstrate how genomic information can be effectively used in clinical care. In the future, the anticipated advances in technology development, biological insights, and clinical applications (among others) will lead to more widespread integration of genomics into almost all areas of biomedical research, the adoption of genomics into mainstream medical and public-health practices, and an increasing relevance of genomics for everyday life. On behalf of the research community, the National Human Genome Research Institute recently completed a multi-year process of strategic engagement to identify future research priorities and opportunities in human genomics, with an emphasis on health applications. Here we describe the highest-priority elements envisioned for the cutting-edge of human genomics going forward—that is, at ‘The Forefront of Genomics’.

What's interesting are the predictions that the NHGRI makes for 2030—predictions that were highlighted in the Scientific American article. I'm going to post those predictions without comment other than saying that I think they are mostly bovine manure. I'm interested in hearing your comments.

Bold predictions for human genomics by 2030

Some of the most impressive genomics achievements, when viewed in retrospect, could hardly have been imagined ten years earlier. Here are ten bold predictions for human genomics that might come true by 2030. Although most are unlikely to be fully attained, achieving one or more of these would require individuals to strive for something that currently seems out of reach. These predictions were crafted to be both inspirational and aspirational in nature, provoking discussions about what might be possible at The Forefront of Genomics in the coming decade.

  1. Generating and analysing a complete human genome sequence will be routine for any research laboratory, becoming as straightforward as carrying out a DNA purification.
  2. The biological function(s) of every human gene will be known; for non-coding elements in the human genome, such knowledge will be the rule rather than the exception.
  3. The general features of the epigenetic landscape and transcriptional output will be routinely incorporated into predictive models of the effect of genotype on phenotype.
  4. Research in human genomics will have moved beyond population descriptors based on historic social constructs such as race.
  5. Studies that involve analyses of genome sequences and associated phenotypic information for millions of human participants will be regularly featured at school science fairs.
  6. The regular use of genomic information will have transitioned from boutique to mainstream in all clinical settings, making genomic testing as routine as complete blood counts.
  7. The clinical relevance of all encountered genomic variants will be readily predictable, rendering the diagnostic designation ‘variant of uncertain significance (VUS)’ obsolete.
  8. An individual’s complete genome sequence along with informative annotations will, if desired, be securely and readily accessible on their smartphone.
  9. Individuals from ancestrally diverse backgrounds will benefit equitably from advances in human genomics.
  10. Breakthrough discoveries will lead to curative therapies involving genomic modifications for dozens of genetic diseases.

I predict that nine years from now (2030) we will still be dealing with scientists who think that most of our genome is functional; that most human protein-coding genes produce many different proteins by alternative splicing; that epigenetics is useful; that there are more noncoding genes than protein-coding genes; that the leading scientists in the 1960 and 70s were incredibly stupid to suggest junk DNA; that almost every transcription factor binding site is biologically relevant; that most transposon-related sequences have a mysterious (still unknown) function; that it's still a mystery why humans are so much more complex than chimps; and that genomics will eventually solve all problems by 2040.

Why in the world, you might ask, would we still be dealing with issues like that? Because of genomics.


Saturday, April 03, 2021

"Dark matter" as an argument against junk DNA

Opponents of junk DNA have been largely unsuccessful in demonstrating that most of our genome is functional. Many of them are vaguely aware of the fact that "no function" (i.e. junk) is the default hypothesis and the onus is on them to come up with evidence of function. In order to shift, or obfuscate, this burden of proof they have increasingly begun to talk about the "dark matter" of the genome. The idea is to pretend that most of the genome is a complete mystery so that you can't say for certain whether it is junk or functional.

One of the more recent attempts appears in the "Journal Club" section of Nature Reviews Genetics. It focuses on repetitive DNA.

Before looking at that article, let's begin by summarizing what we already know about repetitive DNA. It includes highly repetitive DNA consisting of mutliple tandem repeats of short sequences such as ATATATATAT... or CGACGACGACGA ... or even longer repeats. Much of this is located in centromeric regions of the chromosome and I estimate that functional highly repetitve regions make up about 1% of the genome.[see Centromere DNA and Telomeres]

The other part of repetitive DNA is middle repetitive DNA, which is largely composed of transposons and endogenous viruses, although it includes ribosomal RNA genes and origins of replication. Most of these sequences are dispersed as single copies throughout the genome. It's difficult to determine exactly how much of the genome consists of these middle repetitive sequences but it's certainly more than 50%.

Almost all of the transposon- and virus-related sequences are defective copies of once active transposons and viruses. Most of them are just fragments of the originals. They are evolving at the neutral rate so they look like junk and they behave like junk.1 That's not selfish DNA because is doesn't transpose and it's not "dark matter." These fragments have all the characterstics of nonfunctional junk in our genome.

We know that the C-value paradox is mostly explained by differing amounts of repetitive DNA in different genomes and this is consistent with the idea that they are junk. We know that less that 10% of our genome is conserved and this fits in with that conclusion. Finally, we know that genetic load arguments indicate that most our genome must be impervious to mutation. Combined, these are all powerful bits of evidence and logic in favor of repetitive sequences being mostly junk DNA.

Now let's look at what Neil Gemmell says in this article.

Gemmell, N.J. (2021) Repetitive DNA: genomic dark matter matters. Nature Reviews Genetics:1-1. [doi: 10.1038/s41576-021-00354-8]

"Repetitive DNA sequences were found in hundreds of thousands, and sometimes millions, of copies in the genomes of most eukaryotes. while widespread and evolutionarily conserved, the function of these repeats was unknown. Provocatively, Britten and Kohne concluded 'a concept that is repugnant to us is that about half of the DNA of higher organisms is trivial or permanently inert.'”"

That's from Britten and Kohne (1968) and it's true that more than 50 years ago those workers didn't like the idea of junk DNA. Britten argued that most of this repetitive DNA was likely to be involved in regulation. Gemmell goes on to describe centromeres and telomeres and mentions that most repetitive DNA was thought to be junk.

"... the idea that much of the genome is junk, maintained and perpetuated by random chance, seemed as broadly unsatisfactory to me as it had to the original authors. Enthralled by the mystery of why half our genome is repetitive DNA, I have followed this field ever since."

Gemmell is not alone. In spite of all the evidence for junk DNA, the majority of scientists don't like the fact that most of our genome is junk. Here's how he justifies his continued skepticism.

"But it was not until the 2000s, as full eukaryotic genome sequences emerged, that we discovered that the repetitive non-coding regions of our genome harbour large numbers of promoters, enhancers, transcription factor binding sites and regulatory RNAs that control gene expression. More recently, the importance of repetitive DNA in both structural and regulatory processes has emerged, but much remains to be discovered and understood. It is time to shine further light on this genomic dark matter."

This appears to be the ENCODE publicity campaign legacy rearing its ugly head once more. Most Sandwalk readers know that the presence of transcription factor binding sites, RNA polymerase binding sites, and junk RNA is exactly what one would predict from a genome full of defective transposons. Most of us know that a big fat sloppy genome is bound to contain millions of spurious binding sites for transcription factors so this says nothing about function.

Apparently Gemmell's skepticism doesn't apply to the ENCODE results so he still thinks that all those bits and pieces of transposons are mysterious bits of dark matter that could be several billion base pairs of functional DNA. I don't know what he imagines they could be doing.


Photo Credit: The photo shows human chromosomes labelled with a telomere probe (yellow), from Christoher Counter at Duke University.

1. In my book, I cover this in a section called "If it walks like a duck ..." It's a form of abductive reasoning.

Britten, R. and Kohne, D. (1968) Repeated Sequences in DNA. Science 161:529-540. [doi: 10.1126/science.161.3841.529]

Friday, April 02, 2021

Off to the publisher!

The first draft of my book is ready to be sent to my publisher.

Text by Laurence A. Moran

Cover art and figures by Gordon L. Moran

  • 11 chapters
  • 112,000 words (+ preface and glossary)
  • about 370 pages (estimated)
  • 26 figures
  • 305 notes
  • 400 references

©Laurence A. Moran


Tuesday, February 16, 2021

The 20th anniversary of the human genome sequence:
6. Nature doubles down on ENCODE results

Nature has now published a series of articles celebrating the 20th anniversary of the publication of the draft sequences of the human genome [Genome revolution]. Two of the articles are about free access to information and, unlike a similar article in Science, the Nature editors aren't shy about mentioning an important event from 2001; namely, the fact that Science wasn't committed to open access.

By publishing the Human Genome Project’s first paper, we worked with a publicly funded initiative that was committed to data sharing. But the journal acknowledged there would be challenges to maintaining the free, open flow of information, and that the research community might need to make compromises to these principles, for example when the data came from private companies. Indeed, in 2001, colleagues at Science negotiated publishing the draft genome generated by Celera Corporation in Rockville, Maryland. The research paper was immediately free to access, but there were some restrictions on access to the full data.

Friday, February 12, 2021

The 20th anniversary of the human genome sequence:
5. 90% of our genome is junk

This is the fifth (and last) post in celebration of the 20th anniversary of publishing the draft sequence. The first four posts dealt with: (1) the way Science chose to commemorate the occasion [Access to the data]; (2) finishing the sequence; (3) the number of genes; and (4) the amount of functional DNA in the genome.

Back in 2001, knowledgeable scientists knew that most of the human genome is junk and the sequence confirmed that knowledge. Subsequent work on the human genome over the past 20 years has provided additional evidence of junk DNA so that we can now be confident that something like 90% of our genome is junk DNA. Here's a list of data and arguments that support that claim.