I recently reported that Google's AI program does a horrible job of summarizing the junk DNA controversy. [The scary future of AI is revealed by how it deals with junk DNA] That led to a discussion about the "intelligence" in artificial intelligence and whether AI was capable of distinguishing between accurate and inaccurate data.
Google DeepMind is an artificial intelligence research laboratory headquartered in London, UK. Two of its programmers, Demis Hassabis and John Jumper, were awarded the 2024 Nobel Prize in Chemistry for developing AlphaFold, a program that predicts the tertiary structure of proteins.
DeepMind is now turning its attention to the human genome by developing a program called AlphaGenome, which is billed as "AI for better understanding the genome." The goal of the program is summarized in a document titled AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model. Here's the abstract.
Deep learning models that predict functional genomic measurements from DNA sequence are powerful tools for deciphering the genetic regulatory code. Existing methods trade off between input sequence length and prediction resolution, thereby limiting their modality scope and performance. We present AlphaGenome, which takes as input 1 megabase of DNA sequence and predicts thousands of functional genomic tracks up to single base pair resolution across diverse modalities – including gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage, and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest respective available external models on 24 out of 26 evaluations on variant effect prediction. AlphaGenome’s ability to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically-relevant variants near the TAL1 oncogene. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.
It's not easy to figure out what they're trying to do but here's what I think they're doing. I think they're going to build a reference genome containing all of the information about function that they can find in various databases. Then they will compare large sequences (1Mb, or one million base pairs) to that reference genome and examine any variants in the input sequence to predict how it will affect function. It's a way of detecting genetic diseases.
Lots of other programs do similar things but they are limited to small input sequences. The advantage of AphlaGenome, according to its developers, is that by using 1Mb sequences they will be able to detect the effect of variants responsible for long range interactions.
Also, some other programs focus on specific functions such as splice sites, regulatory sequences, and 3D chromatin organization whereas AlphaGenome incorporates all those functions into a single multimodel prediction program at base pair resolution.
Here we present AlphaGenome, a model that unifies multimodal prediction, long sequence context, and base-pair resolution into a single framework. The model takes 1 megabase (Mb) of DNA sequence as input and predicts a diverse range of genome tracks across numerous cell types.
In order for this to work, AlphaGenome must have a solid reference sequence that identifies functional regions of the genome. That's because in order to predict the effect of variants, you need to be sure that the variants (mutations) are affecting a sequence that's biologically relevant. Differences that affect spurious splice sites or spurious regulatory sequences in junk DNA need to be ignored.
I don't know how DeepMind is going to tell the difference between functional elements and junk. There's nothing in the document that addresses this issue. The methods section indicates that they are using the standard human reference genome (GRCh38.p14) as annotated by Release 46 of GENCODE. That particular GENCODE version list 28,000 non-coding genes and 19,411 protein-coding genes. There are 89,581 transcripts from those protein-coding genes (4-5 transcripts per gene). I don't know how many "regulatory" sites are listed and I don't know if other sites such as origins of replication are included. More recent versions of GENCODE (Release 49) list 43,462 non-coding genes and 211,446 transcripts from protein-coding genes (11 transcripts per gene). I'm skeptical of all those numbers.
The Methods section indicates that the "training data" includes some other highly suspect databases.
Training Data
To generate training targets for AlphaGenome, RNA-seq and epigenomic datasets were sourced from the ENCODE, FANTOM5, and GTEx consortia. Specifically, this included RNA-seq, PRO-cap, DNase-seq, ATAC-seq, transcription factor (TF) and histone ChIP-seq data from ENCODE; RNA-seq data from GTEx; CAGE data from Fantom5 portals; and genomic contact maps from the 4D Nucleome portal.
It would be nice to see DeepMind incorporate data from all sequenced human genomes (> 500,000) in order to determine which sites were under purifying selection and which ones seem to tolerate multiple mutations. That would be a valuable contribution since it would identify real functional elements of the genome and eliminate most of the ones in the standard databases.
I don't think that AlphaGenome is going to help us understand the human genome. In fact, I think it's just going to perpetuate the misinformation in the current databases.
As you might expect, the popular press has jumped on the announcement by Google DeepMind. Here's an article published in Nature in June 2025: DeepMind’s new AlphaGenome AI tackles the ‘dark matter’ in our DNA.
Tool aims to solve the mystery of non-coding sequences — but is still in its infancy.
Nearly 25 years after scientists completed a draft human genome sequence, many of its 3.1 billion letters remain a puzzle. The 98% of the genome that is not made of protein-coding genes — but which can influence their activity — is especially vexing.
An artificial intelligence (AI) model developed by Google DeepMind in London could help scientists to make sense of this ‘dark matter’, and see how it might contribute to diseases such as cancer and influence the inner workings of cells. The model, called AlphaGenome, is described in a 25 June preprint.
“This is one of the most fundamental problems not just in biology — in all of science,” said Pushmeet Kohli, the company’s head of AI for science, at a press briefing.
I don't see anything in the AlphaGenome preprint suggesting that they are about to solve the mystery of our genome's "dark matter." It would be great if an artificial intelligence program looked at all the data and confirmed that 90% of our genome is junk but I'm not holding my breath.

5 comments :
{It would be nice to see DeepMind incorporate data from all sequenced human genomes (> 500,000) in order to determine which sites were under purifying selection and which ones seem to tolerate multiple mutations.}
That won't help much. by now, scientists have detected many cis-reguletory regions ( like enhancers) and so many non -coding RNAs ( like centromeric RNAs, telomeric RNAs, enhancers RNAs) that are poor in primary sequence conservation but show conservation in function, structure, size , positions, and network interactions.
You're criteria are outdated and insufficient for function detection.
@Mehrshad: Give me your best estimate of the percentage of the genome taken up by functional elements that show no evidence of purifying selection.
//"so many non -coding RNAs that are poor in primary sequence conservation but show conservation in function"//
How many are "so many"? [Citation needed]
Is this going to be another one of those where the paper you bring doesn't support the claim you make?
In this study, the authors investigated how cis-regulatory elements (CREs), particularly enhancers, are conserved across large evolutionary distances despite extensive sequence divergence. Using embryonic heart tissue from mouse and chicken, they generated high-confidence enhancer catalogs based on chromatin accessibility and histone modification signatures. They identified tens of thousands of enhancers in each species (∼30,000 in mouse and ∼22,000 in chicken) and then assessed conservation using conventional sequence alignment approaches. When enhancer conservation was evaluated solely by direct sequence alignment, only ~10% of mouse enhancers showed detectable sequence conservation in chicken, meaning that ~90% lacked recognizable primary sequence similarity across this deep evolutionary split.
Citation : Conservation of regulatory elements with highly diverged sequences across large evolutionary distances
In also this study, the authors performed an exhaustive functional scan of noncoding sequences around the zebrafish phox2b locus to test how well metrics of evolutionary sequence constraint detect regulatory elements. They assayed 48 noncoding intervals spanning ∼40.7 kb using a GFP reporter in zebrafish embryos. Out of 20 sequence-conserved intervals, 13 (61%) showed enhancer activity, while among 13 non-conserved intervals, 4 (31%) were also functional, demonstrating that regulatory activity is not restricted to conserved sequences.
Comparisons with common conservation metrics (phastCons, AVID, MLAGAN, SLAGAN, PipMaker, WebMCS) revealed that 29–71% of experimentally validated enhancers were missed, indicating that standard sequence constraint methods fail to capture a substantial fraction of functional regulatory elements.
The study provides direct evidence that some enhancers are functional despite lacking detectable sequence conservation.
Citation (title only):
Metrics of sequence constraint overlook regulatory sequences in an exhaustive analysis at phox2b
Post a Comment