More Recent Comments

Showing posts with label Genomics. Show all posts
Showing posts with label Genomics. Show all posts

Monday, February 16, 2026

Carl Zimmer writes about AlphaGenome

We may not know a lot about how artificial intelligence (AI) algorithms work but the one thing we do know is that they are only as good as their databases. If you ask an AI program to tell you when Charles Darwin was born then chances are good it's going to give you the correct answer because that information is in Wikipedia and lots of other reliable online sources.

However, if you ask it to tell you how many genes are in the human genome it will not give you the correct answer. The correct answer is that we don't know for sure because it depends on how you define a gene and how many non-coding genes there are using various definitions. That's not the answer you will get. (I personally believe that there are only about 1000 non-coding genes but I don't expect a good "intelligence" program to favor my view over others. I DO expect it to not favor other opinions over mine.)

I just asked ChatGPT and it told me that there are tens of thousands of non-coding genes based on the Human Genome Project plus GENCODE and Ensemble annotations. This is correct ... and misleading. It's giving the best answer it can based on the databases it searches. However, many of us are skeptical of the GENCODE and Ensemble annotations and for good reason. They tend to err on the side of inclusion in order to avoid false negatives. In other words, they don't want to risk ignoring a real biologically relevant feature for lack of evidence so they deliberately risk including a lot of false positives. This is why those databases include a lot of questionable features such as non-coding genes, multiple transcription start sites, multiple splice variants, and tons of potential regulatory elements.

Along comes AlphaGenome. It's an AI program designed to scan those GENCODE and Ensemble databases to identify important features that might play a role in genetic diseases. What could possibly go wrong? [How intelligent is artificial intelligence?] [Will AlphaGenome from Google DeepMind help us understand the human genome?]

The average science writer jumped all over the original announcement of AlphGenome to let us all know that artificial intelligence was going to solve the problem of the mysterious genome. Apparently the complexity of the human genome has astonished scientists ever since the first human genome sequence was published 25 years ago.1 The typical article on AlphaGenome fits nicely into the common theme that AI is soon going to rule the world.

That's why I was excited to pick up my copy of the New York Times yesterday and see that Carl Zimmer had written about AlphaGenome. Finally, an intelligent, highly respected, science writer was going to give us the truth. Here's the article that I saw in my version of the paper. (It was originally published several weeks ago on January 28, 2026.)

What a disappointment! Zimmer goes with the hype about AlphaGenome and repeats some of the tropes that he has avoided in the past. For example, he writes about how alternative splicing can create hundreds of different proteins from a single gene and how regulatory sequences can lie thousands or million of base pairs away from a gene. (There's no question that this is true for a small number of transcription factor binding sites but the vast majority are close to the promoter.)

Zimmer gives an example showing that AlphaGenome identified a regulatory sequence for a gene called TAL1, implying that the program will help decipher the rest of the genome. The general tone of the newspaper article is that AlphaGenome will be of great help to scientists who want to understand the human genome.

I checked the online version of Carl Zimmer's article in order to prepare for this blog post. I was surprised to see that there were lots of things in the online version that weren't in the newspaper article. For example, Zimmer quotes my colleague Alex Palazzo saying that everybody uses AlphaFold to study proteins then later on in the article Zimmer notes that, "But the more scientists studied the human genome, the more complicated and messy it turned out to be." The newspaper article left out the words "and messy" and that's significant because junk DNA supporters like Alex Palazzo often refer to the human genome as "messy" and full of junk DNA and that's a very different perspective than opponents of junk DNA who emphasize things like "complicated" and "mysterious."2

Zimmer has an even more revealing section that's in the online version but not the newspaper version.

Peter Koo, a computational biologist at Cold Spring Harbor Laboratory in New York who was not involved in the project, said that AlphaGenome represented an important step forward in applying artificial intelligence to the genome. “It’s an engineering marvel,” he said.

But Dr. Koo and other outside experts cautioned that it represented just one step on a long road ahead. “This is not AlphaFold, and it’s not going to win the Nobel Prize,” said Mark Gerstein, a computational biologist at Yale.

AlphaGenome will be useful. Dr. Gerstein said that he would probably add it to his toolbox for exploring DNA, and others expect to follow suit. But not all scientists trust A.I. programs like AlphaGenome to help them understand the genome.

“I see no value in them at all right now,” said Steven Salzberg, a computational biologist at Johns Hopkins University. “I think there are a lot of smart people wasting their time.”

The end of the online article is quite different from the final paragraphs of the newspaper article. In the newspaper article, Zimmer describes the TAL1 result then ends it with the paragraph starting with "In reality." I've highlighted that paragraph in the quotations below from the online version.

The AlphaGenome researchers shared their TAL1 predictions with Dr. Marc Mansour, a hematologist at University College London who spent years uncovering the leukemia-driving mutations with lab experiments.

“It was quite mind-blowing,” Dr. Mansour said. “It really showed how powerful this is.”

But, Dr. Mansour noted, AlphaGenome’s predictive powers fade the farther its gaze strays from a particular gene. He is now using AlphaGenome in his cancer research but does not blindly accept its results.

“These prediction tools are still prediction tools,” he said. “We still need to go to the lab.”

Dr. Salzberg of Johns Hopkins is less sanguine about AlphaGenome, in part because he thinks its creators put too much trust in the data they trained it on. Scientists who study splice sites don’t agree on which sites are real and which are genetic mirages. As a result, they have created databases that contain different catalogs of splice sites.

“The community has been working for 25 years to try to figure out what are all the splice sites in the human genome, and we’re still not really there,” Dr. Salzberg said. “We don’t have an agreed-upon gold-standard set.”

Dr. Pollard also cautioned that AlphaGenome was a long way from being a tool that doctors could use to scan the genomes of patients for threats to their health. It predicts only the effects of a single mutation on one standard human genome.

In reality, any two people have millions of genetic differences in their DNA. Assessing the effects of all those variations throughout a patient’s body remains far beyond AlphaGenome’s industrial-strength power.

“It is a much, much harder problem — and yet that’s the problem we need to solve if we want to use a model like this for health care,” Dr. Pollard said.

The net effect of these differences is to transform the article from one that promotes AlphaGenome in the newspaper version to one that's far more skeptical in the online version. I believe that the online version is far more accurate and reflects the high standard that I expect from Carl Zimmer. I'm assuming that the newspaper article was edited for the New York Times supplement that I read and I'm assuming that Zimmer did not approve of that edit.

Note: The cartoon was generated by ChatGPT in response to the request, "draw a cartoon illustrating GIGO - garbage in garbage out."

Note: The photo is from 10 years ago when Carl was in Toronto working on his junk DNA article for The New York Times [Is Most of Our DNA Garbage?]. That's Alex Palazzo on the left, then me, Ryan Gregory, and Carl Zimmer on the right.


1. Most knowledgeable scientists were not astonished to learn that 90% of our genome really is junk and there are fewer than 30,000 genes.

2. See the last chapter of my book: "Chapter 11: Zen and the Art of Coping with a Sloppy Genome."

Friday, December 19, 2025

How many lncRNA genes in the human genome? (2025)

There is considerable controversy over the total number of genes in the human genome. The number of protein-coding genes is pretty well established at somewhere between 19,500 and 20,000. It's the number of non-coding genes that's disputed.

There's general agreement on the number of well-defined small RNA genes such as snRNAs, snoRNA, microRNAs etc. Similarly, the number of ribosomal RNA and tRNA genes is known. The problem is with identifying genuine long non-coding RNA genes (lncRNA genes). Estimates vary from less than 20,000 to more than 200,000 but most of these estimates fail to define what they mean by "gene." Many scientists seem to think that any detectable transcript must come from a gene.

This doesn't make any sense since we know that spurious transcripts exist and they don't come from genes by any meaningful definition of gene. The only reasonable definition of a molecular gene is a DNA sequence that's transcribed to produce a functional product.1

The idea that spurious, non-functional, transcripts exist has been described in the scientific literature for many decades. One of my favorites is in a paper by Ponting and Haerty (2022) quoting another paper from thirteen years ago by Ulitsky and Bartel.

The cellular transcriptional machinery does not perfectly discriminate cryptic promoters from functional gene promoters. This machinery is abundant and so can engage sites momentarily depleted of nucleosomes and rapidly initiate transcription. The chance occurrence of splice sites can then facilitate the capping, splicing, and polyadenylation of long transcripts. A very large number of such rare RNA species are detectable in RNA-sequencing experiments whose properties are virtually indistinguishable from those of bona fide lncRNAs. Consequently, “a sensible [null] hypothesis is that most of the currently annotated long (typically >200 nt) noncoding RNAs are not functional, i.e., most impart no fitness advantage, however slight” (Ulitsky and Bartel, 2013: p. 26).

The important point here is that the correct null hypothesis is that these transcripts don't have a biologically relevant function and the burden of proof is on researchers to demonstrate function before assigning them to a genuine gene. My colleagues at the University of Toronto made the same point in a paper published in 2015.

In the absence of sufficient evidence, a given ncRNA should be provisionally labeled as non-functional. Subsequently, if the ncRNA displays features/activities beyond what one would expect for the null hypothesis, then we can reclassify the ncRNA in question as being functional. (Palazzo and Lee, 2015)

There are a number of well-defined lncRNAs that have been shown to have distinct reproducible functions. The key question is how many of these biologically relevant lncRNA genes exist in the human genome. I struggled with the answer to this question when I was writing my book. I finally decided to make a generous estimate of 5000 non-coding genes and that implies several thousand lncRNA genes (p. 127). I now think that estimate was far too generous and there are probably fewer than 1000 genuine lncRNA genes.

I have not scoured the literature for all the examples of human lncRNAs having good evidence of function but my impression is that there are only a few hundred. This post was incited by a recent publication by researchers from the Hospital for Sick Children and the University of Toronto (Toronto, Canada) who characterized another functional lncRNA called CISTR-ACT that plays a role in regulating cell size (Kiriakopulos et al., 2025).

I was prompted to revisit this controversy by the accompanying press release that said ...

Unlike genes that encode for proteins, CISTR-ACT is a long non-coding RNA (or lncRNA) and is part of the non-coding genome, the largely unexplored part that makes up 98 per cent of our DNA. This research helps show that the non-coding genome, often dismissed as ‘junk DNA’, plays an important role in how cells function.

We're used to this kind of misinformation2 in press releases but I thought it would be a good idea to read the paper. As I expected, there's nothing in the paper about junk DNA but here's the first sentence of the introduction.

The human genome contains more long non-coding RNAs (lncRNAs) than protein-coding genes (GENCODE v49) which regulate genes and chromatin scaffolding.

The latest version of GENCODE Release 49 claims that there are 35,899 lncRNA genes. This is the only reference in the Kiriakopulos et al. paper to the number of lncRNA genes. There's no mention of the controversy and none of the papers that discuss the controversy are referenced.

The GENCODE number is close to the latest version of Ensembl, which lists 35,042 lncRNA genes. I couldn't find any good explanation for these numbers or for the definition of "gene" that they are using but what's interesting is how these numbers are climbing every year; for example, a paper from two years ago listed a number of sources and you can see that the RefSeq and GENCODE numbers are much smaller than today's numbers (Amaral et al., 2023).3

We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims.

Ponting and Haerty (2022)

It's perfectly acceptable to state your preferred view on lncRNAs when you publish a paper. The authors of the recent paper may want to believe that there are more lncRNA genes than protein-coding genes but I think it's important for them to define what they mean by "gene" when they make such a claim. What's not acceptable, in my opinion, is to ignore a genuine scientific controversy by not mentioning in the introduction that there are other legitimate views.

It's a shame that they didn't do that because their paper is a good example of the hard work that needs to be done in order to demonstrate that a particular lncRNA has a biologically relevant function.

In closing, I want to emphasize the recent review by Ponting and Haerty (2022)4 that points out the importance of the problem and the kinds of experiments that need to be done in order to establish that a given RNA comes from a real gene. This is how a scientific controversy should be addressed. Here's the abstract of that paper ...

Do long noncoding RNAs (lncRNAs) contribute little or substantively to human biology? To address how lncRNA loci and their transcripts, structures, interactions, and functions contribute to human traits and disease, we adopt a genome-wide perspective. We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims. We discuss pitfalls of lncRNA experimental and computational methods as well as opposing interpretations of their results. The majority of evidence, we argue, indicates that most lncRNA transcript models reflect transcriptional noise or provide minor regulatory roles, leaving relatively few human lncRNAs that contribute centrally to human development, physiology, or behavior. These important few tend to be spliced and better conserved but lack a simple syntax relating sequence to structure and mechanism, and so resist simple categorization. This genome-wide view should help investigators prioritize individual lncRNAs based on their likely contribution to human biology.


1. See Wikipedia: Gene; What Is a Gene?; Definition of a gene (again); Must a Gene Have a Function?.

2. No knowledgeable scientist ever said that all non-coding DNA was junk. We've known about non-coding genes for more than half-a-century.

3. See How many genes in the human genome (2023)?

4. See Most lncRNAs are junk

Amaral, P., Carbonell-Sala, S., De La Vega, F.M., Faial, T., Frankish, A., Gingeras, T., Guigo, R., Harrow, J.L., Hatzigeorgiou, A.G., Johnson, R. et al. (2023) The status of the human gene catalogue. Nature 622:41-47. [doi: 10.1038/s41586-023-06490-x]

Kiriakopulos et al. (2025) LncRNA CISTR-ACT regulates cell size in human and mouse by guiding FOSL2. Nature communications: (in press). [doi: 10.1038/s41467-025-67591-x]

Palazzo, A.F. and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in genetics 6:2(1-11). [doi: 10.3389/fgene.2015.00002]

Ponting, C.P. and Haerty, W. (2022) Genome-Wide Analysis of Human Long Noncoding RNAs: A Provocative Review. Annual review of genomics and human genetics 23. [doi: 10.1146/annurev-genom-112921-123710

Ulitsky, I. and Bartel, D.P. (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154:26-46. [doi: 10.1016/j.cell.2013.06.020]

Thursday, December 11, 2025

How many regulatory sites in the human genome?

The current best model of the human genome is that only 10% is functional and 90% is junk. This model was first developed over half a century ago (see Junk DNA). From the very beginning, the model recognized that regulatory sequences would make up a significant proportion of the functional elements but early suggestions that most of the repetitive DNA would turn out to be involved in regulation were rejected.

As more and more data accumulated on regulatory sequences, it became apparent that most regulatory sequences of pol II (RNA polymerase II) genes could be found in relatively short regions of DNA just upstream of the transcription start site. It also became apparent that for each transcription factor there were thousands of transcription factor binding sites even though only a small number were actually involved in genuine gene regulation.1

Monday, November 24, 2025

Evolution explains the differences between the human and chimpanzee genomes

If you align similar regions of the human and chimpanzee genomes they turn out to be about 98.6% identical in nucleotide sequence. The total number of differences amount to 44 million base pairs (bp). If the differences are due to mutations that have occurred since divergence from a common ancestor, then there would be 22 million mutations in each lineage.

The mutation rate is approximately 100 new mutations per generation. Most of these will be neutral mutations that have no effect on the survival of the individual and almost all of them will be lost within a few generations. A small number of these neutral mutations will become fixed in the population and it's these fixed mutations that produce most of the changes in the genome of evolving populations. According to the neutral theory of population genetics, the number of fixed neutral mutations corresponds to the mutation rate. Thus, in every evolving population there will be 100 new fixed mutations per generation.

Monday, May 19, 2025

A new higher mutation rate in humans includes indels in repetitive DNA regions

Theme

Mutation

-definition
-mutation types
-mutation rates
-phylogeny
-controversies

There are three ways of estimating the human mutation rate. The Biochemical Method is based on the known error rate of DNA replication and the average number of cell divisions between generations. It gives a rate of about 130 mutations per generation.

The Phylogenetic Method assumes that a large fraction of mammalian genomes is evolving at the neutral rate because it is junk DNA. Since we know that the rate of fixation of neutral alleles is equal to the mutation rate, we can estimate the mutation rate if we know the total number of nucleotide difference between two species (e.g. humans and chimpanzees) and the approximate time of divergence from a common ancestor. This gives an estimate of about 112 mutations per generation.

Friday, September 29, 2023

Evelyn Fox Keller (1936 - 2023) and junk DNA

Evelyn Fox Keller died a few days ago (Sept. 22, 2023). She was a professor of History and Philosopher of Science at the Massachusetts Institute of Technology (Boston, MA, USA). Most of the obituaries praise her for her promotion of women scientists and her critiques of science as a male-dominated discipline. More recently, she turned her attention to molecular biology and genomics and many philosophers (and others) seem to think that she made notable contributions in that area as well.

Sunday, April 03, 2022

Karen Miga and the telomere-to-telomere consortium

Karen Miga deserves a lot of the credit for the complete human genome sequence.

Karen Miga is a professor at the University of California, Santa Cruz, and she's been working for several years on sequencing the repetitive regions of the genome. She is a co-founder of the telomere-to-telomere consortium that just published a complete sequece of the human genome. She made a signficant contribution to long-read (~20 Kb) and ultra-long-read (>100 kb) sequencing and that's a major technological achievement that's worthy of prizes.

Read the interview on CBC (Canada) Quirks & Quarks at Scientists sequence complete, gap-free human genome for the first time and watch the YouTube video.


Miga did her Ph.D. with Huntington Willard at Duke University. Hunt has been working on centromeres for more than 40 yeas years and some of my colleagues may remember him when he was a professor at the University of Toronto in the Department of Medical Genetics.



Wednesday, March 30, 2022

John Mattick presents his view of genomes

John Mattick has a new book coming out in August where he defends the notion that most of our genome is full of genes for functonal noncoding RNAs. We have a pretty good idea what he's going to say. This is a talk he gave at Oxford on May 17, 2019.

Here are a few statements that should pique your interest.

  • (0:57) He says that his upcoming book is tentatively titled "the misunderstandings of molecular biology."
  • (1:11) He says that "the assumption has been very deeply embedded from the time of the lac operon on that genes equated to proteins."
  • (2:30) There have been three "surprises" in molecuular biology: (1) introns, (2) eukaryotic genomes are full of 'selfish' DNA, and (3) "gene number does not scale with developmental complexity."
  • (4:30) It is an unjustified assumption to assume that transposon-related seqences are junk and that leads to misinterpretation of neutral evolution.
  • (6:00) The view that evolution of regulatory sequences is mostly responsible for developmental complexity (Evo-Devo) has never been justified.
  • (8:45) A lot of obtuse theoretical discussion about how the number of regulatory protein-coding genes increases quadratically as the total number of protein-coding genes increase in a bacterial genome but at some point there has to be more protein-coding regulatory genes than total protein-coding genes so that limits the evolution of bacteria.
  • (13:40) The proportion of noncoding DNA increases with developmental complexity, topping out at humans.
  • (14:00) The vast majority of the genome in complex organisms is differentially transcribed in different cells and different tissues.
  • (14:15) The whole genome is alive on both strands.
  • (14:20) There are two possibilities: junk RNA or abundant functional transcripts and that explains complex organisms.
  • Mattick then takes several minutes to document the fact that there are abundant transcripts— a fact that has been known for the better part of sixty years but he does not mention that. All of his statements carry the implicit assumption that these transcripts are functional.
  • (20:20) He makes the boring, and largely irelevant, point that most disease-associated loci are located in noncoding regions (GWAS). He's responding to a critic who asked why, if these things (transcripts) are real, don't we see genetic evidence of it.
  • (24:00) Noncoding RNAs have all of the characteristics of functional RNAs with an emphasis on the fact that their expression is often only detected in specific cell types.
  • (31:50) It has now been shown that everything that protein transcription factors can do can be done by noncoding RNA.
  • (32:15) "I want to say to you that conservation is totally misunderstood." Apparently, lack of conservation imputes nothing about function.
  • (41:00) RNAs control phase separation. There's a whole other level of cell organization that we never dreamed of. (Ironically, he gives nucleoli as an example of something we never dreamed of.)
  • (42:36) "This is called soft metaphysics, and it's just come into biology, and it's spectacular in its implications."
  • (46:25) Almost every lncRNA is alternatively spliced in mice and humans.
  • (46:30) There's more alternative splicing in human protein-coding genes than in mice protein-coding genes but the extra splicing in humans is mostly in the 5' untranslated region. (I'm sure it has nothing to do with the fact that tons more RNA-Seq experiments have been done on human tissues.) "We think this is due to the increased sophistication of the regulation of these genes for the evolution of cognition."
  • (48:00) At least 20% of the human genome is evolutionarily conserved at the level of RNA structure and this does not require any assumptions.
  • (55:00) The talk ends at 55 minutes. That's too bad because I'm sure Mattick had a dozen more slides explaining why all of those transcripts are functional, as opposed to the few selected examples he picked. I'm sure he also had a lot of data refuting all of the evidence in favor of junk DNA but he just ran out of time.

I don't know if there were questions but, if there were, I bet that none of them challenged Mattick's main thesis.


Friday, April 09, 2021

Should we teach genomics and evolution to medical students?

Rama Singh,1 a biology professor at McMaster Universtiy in Hamilton (Ontario, Canada) has just published an interesting article on The Conversation website. It's about Medical schools need to prepare doctors for revolutionary advances in genetics. You can read the full article yourself but let me highlight the last few paragraphs to start the discussion.

Future physicians will be part of health networks involving medical lab technicians, data analysts, disease specialists and the patients and their family members. The physician would need to be knowledgeable about the basic principles of genetics, genomics and evolution to be able to take part in the chain of communication, information sharing and decision-making process.

This would require a more in-depth knowledge of genomics than generally provided in basic genetics courses.

Much has changed in genetics since the discovery of DNA, but much less has changed how genetics and evolution are taught in medical schools.

In 2013-14 a survey of course curriculums in American and Canadian medical schools showed that while most medical schools taught genetics, most respondents felt the amount of time spent was insufficient preparation for clinical practice as it did not provide them with sufficient knowledge base. The survey showed that only 15 per cent of schools covered evolutionary genetics in their programs.

A simple viable solution may require that all medical applicants entering medical schools have completed rigorous courses in genetics and genomics.

Here's the problem. I've just finished research on a book about modern evolution and genomics so I think I know a little bit about the subject. I'm also on the editorial board of a journal that publishes research on biochemistry and molecular biology education. I've written a biochemistry textbook and I have far too many years of experience trying to teach this material to graduate students and undergraduates at the University of Toronto. I can safely say that we (university teachers) have done a horrible job of teaching evolution and genomics to our students. We have turned out an entire generation of students who don't understand modern molecular evolution and don't understand what's in your genome.

What this means is that there's an extremely small pool of students who have completed "rigorous courses in genetics and genomics." Nobody will be able to apply to medical school. I doubt that we could teach this material to medical students with or without the appropriate background.

But you don't have to take my word for it. Some people have tried to teach this material to health science workers so we can see how it's working at that level. Take a look at the The Genomics Education Programme supported by the NHS in the United Kingdom. They have a series of short videos and longer lessons that are designed to educate health care specialists. Here's the blurb that defines their objective.

Rapid advances in technology and understanding mean that genomics is now more relevant than ever before. As genomics increasingly becomes a part of mainstream NHS care, all healthcare professionals, and not just genomics specialists, need to have a good understanding of its relevance and potential to impact the diagnosis, treatment and management of people in our care.

In 2014, Health Education England (HEE) launched a four-year £20 million Genomics Education Programme (GEP) to ensure that our 1.2 million-strong NHS workforce has the knowledge, skills and experience to keep the UK at the heart of the genomics revolution in healthcare.

Funding for the programme has since been extended to enable us to continue our work in providing co-ordinated national direction of education and training in genomics and developing resources for a wide range of professionals.

They describe genes as 'coding' genes that build proteins. There's no mention of noncoding genes. The define a genome as "both genes (coding) and non-coding DNA." They also say that your genome is all of the DNA in our cells (46 chromosomes, 23 pairs). I don't see anything in their education packages that covers modern molecular evolution. In one of the packages they say,

The term ‘junk DNA’ has been used since the 1970s to describe non-coding regions of the genome, but today it is considered inaccurate and misleading. The term ‘junk’ suggests that 98% of the genome has no use, but in recent years, studies and projects have used advances in technology to shed light on these regions and have come to different conclusions about how much of the genome has a biological function.

Here's a link to a short video called What is a genome?. I recommend that you watch it to see the level that these experts think is suitable for health care professionals in the UK and to see the level of expertise of those who made the video. This is what seven years of work by experts and £20 million will get you.

All of this tells me that teaching genomics and evolution to medical students is going to be a lot more difficult than Rama Singh imagines. Not only would we have to counter several years of misinformation but we would have to rely on teachers who probably don't understand either topic.

Let's start by teaching these things correctly to biology and biochemistry majors. That's going to be hard enough for now.


1. Full displosure: Rama and I shared an NSERC grant in 1981 on genetic variation in Drosophila.

Thursday, April 08, 2021

On the accuracy of genomics in detecting disease variants

Several diseases, such as cancers, are caused by the presence of deleterious alleles that affect the function of a gene. In the case of cancer, most of the mutations are somatic cell mutations—mutations that have occurred after fertilization. These mutations will not be passed on to future generations. However, there are some variants that are present in the germline and these will be inherited. A small percentage of these variants will cause cancer directly but most will just indicate a predisposition to develop cancer.

There are a host of other diseases that have a genetic component and the responsible alleles can also be present in the germline or due to somatic cell mutations.

Over the past fifty years or so there has been a lot of hype associated with the latest technological advances and the ability to detect deleterious germline mutations. The general public has been repeatedly told that we will soon be able to identify all disease-causing alleles and this will definitely lead to incredible medical advances in treating these diseases. Just yesterday, for example, I posted an article on predictions made by The National Genome Research Institute (USA) who predicts that by 2030,

The clinical relevance of all encountered genomic variants will be readily predictable, rendering the diagnostic designation ‘variant of uncertain significance (VUS)’ obsolete.

Similar predictions, in various forms, were made when the human genome project got under way and at various time afterword. First there was the 1000 genomes project then there was the 100,000 genome project and, of course, ENCODE. The problem is that genomics hasn't lived up to these expectations and there's a very good reason for that: it's because the problem is a lot more difficult than it seems.

One of the Facebook groups that I follow (Modern Genetics & Technology)1 alerted me to a recent paper in JAMA that addressed the problem of genomics accuracy and the prediction of pathogenic variants. I'm posting the complete abstract so you can see the extent of the problem.

AlDubayan, S.H., Conway, J.R., Camp, S.Y., Witkowski, L., Kofman, E., Reardon, B., Han, S., Moore, N., Elmarakeby, H. and Salari, K. (2020) Detection of Pathogenic Variants With Germline Genetic Testing Using Deep Learning vs Standard Methods in Patients With Prostate Cancer and Melanoma. JAMA 324:1957-1969. [doi: 10.1001/jama.2020.20457]

Importance Less than 10% of patients with cancer have detectable pathogenic germline alterations, which may be partially due to incomplete pathogenic variant detection.

Objective To evaluate if deep learning approaches identify more germline pathogenic variants in patients with cancer.

Design Setting, and Participants A cross-sectional study of a standard germline detection method and a deep learning method in 2 convenience cohorts with prostate cancer and melanoma enrolled in the US and Europe between 2010 and 2017. The final date of clinical data collection was December 2017.

Exposures Germline variant detection using standard or deep learning methods.

Main Outcomes and Measures The primary outcomes included pathogenic variant detection performance in 118 cancer-predisposition genes estimated as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The secondary outcomes were pathogenic variant detection performance in 59 genes deemed actionable by the American College of Medical Genetics and Genomics (ACMG) and 5197 clinically relevant mendelian genes. True sensitivity and true specificity could not be calculated due to lack of a criterion reference standard, but were estimated as the proportion of true-positive variants and true-negative variants, respectively, identified by each method in a reference variant set that consisted of all variants judged to be valid from either approach.

Results The prostate cancer cohort included 1072 men (mean [SD] age at diagnosis, 63.7 [7.9] years; 857 [79.9%] with European ancestry) and the melanoma cohort included 1295 patients (mean [SD] age at diagnosis, 59.8 [15.6] years; 488 [37.7%] women; 1060 [81.9%] with European ancestry). The deep learning method identified more patients with pathogenic variants in cancer-predisposition genes than the standard method (prostate cancer: 198 vs 182; melanoma: 93 vs 74); sensitivity (prostate cancer: 94.7% vs 87.1% [difference, 7.6%; 95% CI, 2.2% to 13.1%]; melanoma: 74.4% vs 59.2% [difference, 15.2%; 95% CI, 3.7% to 26.7%]), specificity (prostate cancer: 64.0% vs 36.0% [difference, 28.0%; 95% CI, 1.4% to 54.6%]; melanoma: 63.4% vs 36.6% [difference, 26.8%; 95% CI, 17.6% to 35.9%]), PPV (prostate cancer: 95.7% vs 91.9% [difference, 3.8%; 95% CI, –1.0% to 8.4%]; melanoma: 54.4% vs 35.4% [difference, 19.0%; 95% CI, 9.1% to 28.9%]), and NPV (prostate cancer: 59.3% vs 25.0% [difference, 34.3%; 95% CI, 10.9% to 57.6%]; melanoma: 80.8% vs 60.5% [difference, 20.3%; 95% CI, 10.0% to 30.7%]). For the ACMG genes, the sensitivity of the 2 methods was not significantly different in the prostate cancer cohort (94.9% vs 90.6% [difference, 4.3%; 95% CI, –2.3% to 10.9%]), but the deep learning method had a higher sensitivity in the melanoma cohort (71.6% vs 53.7% [difference, 17.9%; 95% CI, 1.82% to 34.0%]). The deep learning method had higher sensitivity in the mendelian genes (prostate cancer: 99.7% vs 95.1% [difference, 4.6%; 95% CI, 3.0% to 6.3%]; melanoma: 91.7% vs 86.2% [difference, 5.5%; 95% CI, 2.2% to 8.8%]).

Conclusions and Relevance Among a convenience sample of 2 independent cohorts of patients with prostate cancer and melanoma, germline genetic testing using deep learning, compared with the current standard genetic testing method, was associated with higher sensitivity and specificity for detection of pathogenic variants. Further research is needed to understand the relevance of these findings with regard to clinical outcomes.

It's really difficult to understand this paper since there are many terms that I'd have to research more thoroughly; for example, does "germline whole-exon sequencing" mean that only sperm or egg DNA was sequenced and that every single exon in the entire genome was sequenced? Were exons in noncoding genes also sequenced?

I found it much more useful to look at the accompanying editorial by Gregory Feero.

Feero, W.G. (2020) Bioinformatics, Sequencing Accuracy, and the Credibility of Clinical Genomics. JAMA 324:1945-1947. [doi: 10.1001/jama.2020.19939]

Ferro explains that the main problem is distinguishing real pathogenic variants from false positives and this can only be accomplished by first sequencing and assembling the DNA and then using various algorithms to focus on important variants. Then there's the third step.

The third step, which often requires a high level of clinical expertise, sifts through detected potentially deleterious variations to determine if any are relevant to the indication for testing. For example, exome sequencing ordered for a patient with unexplained cardiomyopathy might harbor deleterious variants in the BRCA1 gene which, while a potentially important incidental finding, does not provide a plausible molecular diagnosis for the cardiomyopathy. The complexity of the bioinformatics tools used in these 3 steps is considerable.

It's that third step that's analyzed in the AlDubayan et al. paper and one of the tools used is a deep-learning (AI) algorithm. However, the training of this algorithm requiries considerable clinical expertise and testing it requires a gold standard set of variants to serve as an internal control. As you might have guessed, that gold standard doesn't exist because the whole point of the genomics is to identify perviously unknown deleterious alleles.

Ferro warns us that "clinical genome sequencing remains largely unregulated and accuracy is highly dependant on the expertise of individual testing laboratories." He concludes that genomics still has a long way to go.

The genomics community needs to act as a coherent body to ensure reproducibility of outcomes from clinical genome or exome sequencing, or provide transparent quality metrics for individual clinical laboratories. Issues related to achieving accuracy are not new, are not limited to bioinformatics tools, and will not be surmounted easily. However, until analytic and clinical validity are ensured, conversations about the potential value that genome sequencing brings to clinical situations will be challenging for clinical centers, laboratories that provide sequencing services, and consumers. For the foreseeable future, nongeneticist clinicians should be familiar with the quality of their chosen genome-sequencing laboratory and engage expert advice before changing patient management based on a test result.

I'm guessing that Gregory Feero doesn't think that in nine years (2030) "The clinical relevance of all encountered genomic variants will be readily predictable."


1. I do NOT recommend this group. It's full of amateurs who resist leaning and one of it's main purposes is to post copies of pirated textbooks in its files. The group members get very angry when you tell them that what they are doing is illegal!