Here's the problem. I can only access the cheap versions of AI such as ChatGPT and Scite Assistant but I can also see the results of Google's Generative AI whenever I do a Google search. Chris has access to more sophisticated versions so that's what he might be referring to when he says they operate at the Ph.D. level of intelligence.
The ones I see are not very intelligent. They are mostly sophisticated aggregators that comb the internet for information published by humans and then they summarize that information with an emphasis on the frequency that a given human opinion appears in its databases. Since there's a lot of misinformation out there, this often means that the AI algorithm just propagates that misinformation, especially if it's a very popular misconception (see question #8, below).
The greatest danger of AI right now is not that it is smarter than humans—it's that it's as stupid as the humans that it relies on to give it the correct answers.
There have been some attempts to improve the current crop of AI algorithms by having them put more emphasis on reliable data. For example, some of them seem to weigh Wikipedia opinions more than typical press releases and some of them mistakenly believe that opinions published in the scientific literature are the gold standard of accuracy. This attempt is laudable but it hardly counts as true artificial intelligence since the results are being guided by humans who can independently evaluate the accuracy of the information that the program is aggregating.
I've seen articles that extol the virtues of AI by claiming that an AI algorithm can now pass all undergraduate exams in the sciences and that AI-generated essays are getting high grades in university courses. To me this says more about the quality of essays and exams than it does about the power of AI. I've yet to see an AI algorithm that could have passed an exam in one of my undergraduate biochemistry or molecular biology courses. And the idea that an AI program could have written a decent essay in one of my courses is ludicrous because all of my essay topics were based on controversial and contentious issues in the field and I've never seen an AI app that can deal effectively with controversy.
I've published lots of examples where the common AI programs botch the definition of junk DNA and fail to report the current views of many knowledgeable experts. Let's look at another example produced by Google's Generative AI when you search for "gene." Here's the opening statement that you see at the top of the page.
A gene is the fundamental unit of heredity, a segment of DNA that contains the instructions for building a specific protein or a functional RNA molecule. These instructions are passed down from parents to offspring, determining traits and characteristics.
This definition is misleading because it doesn't distinguish between a gene as a hereditary object and the molecular gene. With reference to the molecular gene, it doesn't tell us whether to include regions such as UTRs in protein-coding genes and introns in all genes. Contrast this with the Wikipedia definition, written by human beings.
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protein-coding genes and non-coding genes.
Note that the concise definition is a DNA sequence that is transcribed.1
It's true that if you scroll down below the opening statement on Google you'll encounter more information but it's just a hodgepodge of bits of information that have been retrieved from the internet. There's no coherence of thought and no attempt at coming up with a definition that accounts for the main areas of contention such as whether regulatory elements are a part of the gene or whether introns count. Google's Generative AI doesn't look very intelligent to me.How will we know when AI reaches the intelligence level that's expected of a Ph.D. student? Obviously it will depend on the field of study since the level of intelligence expected in some fields might be quite different from that expected in other fields. It will also depend on the university since some universities in my field clearly don't have the same standards that I'm used to.
I can only evaluate the result for areas where I think I'm knowledgeable enough to serve on Ph.D. exam committees so I made a list of thirteen questions that I would expect a Ph.D. student to answer in an intelligent manner.2
(I note that in some cases, the current AI answers are quite a bit better than the answers from a few years ago (e.g. #1, #6 and #10). I suspect this is because programmers are restricting the sources that the AI program is using in order to come up with better answers from human experts.)
If you think you've come across an AI algorithm that can perform at the Ph.D. level on most of these questions then please post the answers here. I'm especially interested in the more expensive programs that I can't afford.
- What is a gene?
- How much junk DNA is there in the human genome?
- What percentage of human genes can produce multiple functional products by alternative splicing?
- What is the most common mechanism of evolution at the molecular level?
- What is the oldest biochemical pathway?
- Many biochemical reactions have a positive standard Gibbs free energy change. How do those reactions proceed in a typical cell?
- Why is most of the human genome transcribed?
- What's the best definition of photosynthesis?
- What's the best current explanation of the origin of life?
- Do you think the Three Domain Hypothesis is a valid description of the tree of life?
- Why is there an approximate molecular clock?
- Why can humans synthesize proline but not valine?
- What is the Central Dogma of Molecular Biology?
Image Credit: The image was created by ChatGPT in response to the request, "draw an image of the role of AI in biology."
1. Don't quibble about RNA genes. That's covered later on in the Wikipedia article.
2. I'm not demanding that the answer is the correct answer (where "correct" is the answer that I would give!). What I'm asking is that the answer address the important issues and present the controversies in a fair and rational manner. It should then draw a conclusion that makes logical sense. Students would be given the question in advance so they could demonstrate their ability to do research.
35 comments :
All percentage of alternative transcripts are functional. Even noisy transcript are part of the regulation process known as Regulated unproductive splicing (RUS)
i asked Copilot (Microsoft, free) about proline and valine (your Q12). It briefly described the synthetic pathways and that humans lack acetolactate synthase and isomeroreductase enzymes. So (if accurate) that answers one aspect of "why?'
Copilot then introduced essential and non-essential amino acids and suggested that the ready availability of vaccine in a protein-rich diet meant there was no pressure on humans to maintain the synthetic pathway for valine. That's a different aspect of "why?"
I'm no biochemist, and that's obviously a short undergraduate level answer, but I'm happy with that answer … until you tell me why I shouldn't be!
Also, AI chatbots target ordinary questions, the requirements of examination questions are different.
I tried out Elon Musk's X AI Grok, which I find to give the most long winded answers of AI's which don't require a subscription!
I asked the following question:
Using expert biochemical knowledge, address the important issues and present the controversies in a fair and rational manner and then draw a conclusion that makes logical sense to answer the question: What percentage of human genes can produce multiple functional products by alternative splicing?
The answer is a bit repetitive and comes out with a very high estimate of the key issue - what proportion of the alternative splices are functional 70-80% , though it does try to address the issue of functionality.
The answer is too long to be allowed so I will place it into two comments:
Part 1
Grok answered:
Addressing the Question: What Percentage of Human Genes Can Produce Multiple Functional Products by Alternative Splicing?
Alternative splicing is a fundamental biochemical process in eukaryotic gene expression, allowing a single gene to produce multiple mRNA isoforms, which can translate into distinct protein products with diverse functions. This mechanism significantly expands the proteomic diversity without requiring an increase in the number of genes. To estimate the percentage of human genes capable of producing multiple functional products through alternative splicing, we need to consider biochemical mechanisms, genomic data, and ongoing controversies in the field. Below, I will address the key issues, present relevant controversies, and provide a reasoned conclusion.
Key Issues in Alternative Splicing and Functional Products
Definition of Alternative Splicing and Functional Products:
Alternative splicing involves the selective inclusion or exclusion of exons or introns in the final mRNA transcript, resulting in isoforms that may differ in coding sequences, untranslated regions (UTRs), or regulatory elements.
A "functional product" typically refers to a protein isoform with a distinct biological role, such as altered enzymatic activity, binding affinity, localization, or stability. However, some isoforms may affect mRNA stability or translation efficiency without producing a functional protein, complicating the definition.
Scale of Alternative Splicing in Humans:
The human genome contains approximately 20,000–25,000 protein-coding genes, based on current annotations from databases like Ensembl and GENCODE (as of 2025).
High-throughput sequencing technologies (e.g., RNA-seq) have revealed that most human genes undergo some form of alternative splicing, but not all isoforms are functional.
Detection and Quantification:
RNA-seq and transcriptomic studies provide evidence of splicing events, but detecting low-abundance isoforms or tissue-specific variants remains challenging.
Bioinformatic pipelines (e.g., STAR, Kallisto) and databases (e.g., Ensembl, RefSeq) annotate splice variants, but discrepancies in annotation criteria can affect estimates.
Functional Relevance:
Not all splice variants produce functional proteins. Some isoforms may be non-coding, degraded via nonsense-mediated decay (NMD), or produce truncated proteins with no biological role.
Experimental validation (e.g., proteomics, functional assays) is required to confirm functionality, but such studies lag behind transcriptomic data due to technical limitations.
Controversies in Estimating the Extent of Alternative Splicing
Estimates of Genes Undergoing Alternative Splicing:
Early microarray-based studies suggested that 40–60% of human genes undergo alternative splicing. However, RNA-seq studies have increased estimates to 90–95% or higher, as reported in landmark studies (e.g., Pan et al., 2008; Wang et al., 2008).
Controversy arises over whether all detected isoforms are biologically relevant. Some researchers argue that many splice variants are "transcriptional noise" with no functional role, while others contend that even low-abundance isoforms may have context-specific functions (e.g., in development or disease).
Part 2:
Defining "Functional" Products:
A major debate centers on what constitutes a "functional" product. Some isoforms may subtly modulate protein function (e.g., altering a single domain), while others may be non-functional or deleterious. For example, alternative splicing in the TP53 gene can produce isoforms with opposing roles in cancer progression.
The lack of comprehensive proteomic data makes it difficult to confirm whether all mRNA isoforms translate into functional proteins. Mass spectrometry studies often detect fewer isoforms than predicted by transcriptomics, suggesting a gap between transcription and translation.
Tissue-Specific and Context-Dependent Splicing:
Alternative splicing is highly tissue-specific and regulated by developmental stage, environmental signals, or disease states. This variability complicates global estimates of genes producing multiple functional products.
For example, neuronal genes (e.g., Dscam in Drosophila, though less extreme in humans) can produce thousands of isoforms, but only a subset may be functional in specific contexts.
Bioinformatics and Annotation Challenges:
Different databases (Ensembl, RefSeq, GENCODE) report varying numbers of splice variants due to differences in annotation pipelines and criteria for defining a "gene" or "transcript."
Overestimation of functional isoforms may occur if bioinformatic tools include low-abundance or artefactual transcripts, while underestimation may result from incomplete sequencing depth or tissue sampling.
Current Evidence and Estimates
Based on recent genomic and transcriptomic studies (up to 2025), the following points provide a foundation for estimating the percentage of human genes producing multiple functional products via alternative splicing:
High Prevalence of Splicing: Studies using RNA-seq (e.g., ENCODE, GTEx) estimate that 90–95% of human multi-exon genes undergo alternative splicing, producing multiple mRNA isoforms. This is supported by data from Wang et al. (2008) and subsequent analyses in GENCODE (release 44, 2023), which annotate multiple transcripts for most protein-coding genes.
Functional Isoforms: While most genes produce multiple mRNA isoforms, not all translate into functional proteins. Proteomic studies (e.g., Kim et al., 2014; Wilhelm et al., 2014) suggest that 50–70% of multi-exon genes produce at least two distinct protein isoforms detectable by mass spectrometry, though this is likely an underestimate due to limitations in proteomic sensitivity.
Conservative Estimates: Accounting for tissue-specificity and functional validation, a reasonable estimate is that 60–80% of human genes produce multiple functional protein products through alternative splicing. This range aligns with reviews (e.g., Kelemen et al., 2013; Baralle & Giudice, 2017) that balance transcriptomic and proteomic evidence.
Conclusion
Based on the available evidence, approximately 70–80% of human genes are capable of producing multiple functional products through alternative splicing. This estimate reflects the high prevalence of alternative splicing (90–95% of multi-exon genes) tempered by the fact that not all isoforms are functional or translated into proteins. The range accounts for ongoing controversies regarding the definition of "functional" and limitations in proteomic validation. Advances in single-cell RNA-seq, long-read sequencing (e.g., PacBio, Oxford Nanopore), and high-sensitivity proteomics will likely refine this estimate in the future, potentially increasing it as more tissue-specific and low-abundance isoforms are characterized.
This conclusion balances the optimism of transcriptomic data with the caution required by proteomic and functional validation, providing a rational and evidence-based answer to the question.
Oh, boy, that is not a smart answer.
The first part is only true if you define "functional" as "has some effect on the cell (measurable or not)". However, only a small percentage of conserved alternative transcripts produce protein isoforms that are useful for the cell.
The vast majority of splice events are not conserved and the proteins they generate are little more than biological noise. There are a lot of these alternative transcripts (and proteins) because neither translation initiation nor alternative splicing are particularly efficient.
Regulated unproductive splicing does not apply to all transcript noise, just a small proportion of conserved NMD transcripts.
To clarify, my reply was to the first comment by anonymous.
Sequence conservation is not a good indication of transcript functionality. As there are many examples of non-coding RNA that shows pretty no conservation like Xist RNA.... And also splisome machine, is a well efficient and flexible machine how dare you to say it's in inefficient
@mehrshad.
Last time I looked, non-coding RNA wasn't (by definition) coding.
Will I get struck down for pointing out the obvious inefficiency then?
@mehrshad
The absolute numbers are irrelevant (it doesn't matter that there are "many" functional but poorly conserved transcripts), it's the degree of correlation that matters. If among 10 million poorly conserved transcripts 99.9% are nonfunctional, it doesn't matter that the remaining 0.1% are functional (0.1% of 10 million would still be 10 000, aka "many"). And if 99.9% of highly conserved transcripts are functional, then sequence conservation is in fact a perfectly good INDICATION of transcript functionality.
By the way, this answer:
"Proteomic studies (e.g., Kim et al., 2014; Wilhelm et al., 2014) suggest that 50–70% of multi-exon genes produce at least two distinct protein isoforms detectable by mass spectrometry"
... is an interesting example.
ChatGPT initially invented its references, but both of these are real and large-scale studies. Unfortunately, neither detected multiple splice isoforms for anywhere close to 50-70% of multi-exon genes.
Kim et al claimed 2,450 genes with isoform specific peptides and Wilhelm et al didn't give a number, but the figures they show mean that it could not have been more than 1,300 and was probably much lower.
Which means that ChatGPT has invented the 50-70% figure.
@Zammo See: It's trivial to note that animals have lost some of the genes required to synthesize valine. That's just a fact that can be found in any undergraduate biochemistry textbook.
Why have animals lost these genes? If it's just because there's plenty of valine in the diet of mammals then why haven't they also lost the ability to synthesize the other amino acids because they are also abundant in the diet?
A good biochemistry student could address this question by looking at all the data on amino acid pathways and by understanding the role of chance and contingency in evolution. I don't see any AI program that's capable of that kind of intelligent response.
You said, "AI chatbots target ordinary questions, the requirements of examination questions are different." Are you suggesting that a true artificial intelligence algorithm can't come up with the correct answer to examination questions that intelligent human students can answer?
I put the questions to "o3", the most advanced agent available via ChatGPT, in the context of its Deep Research feature, which conducts extensive web searches for approximately fifteen minutes.
https://chatgpt.com/share/6831a0b2-3244-8001-a7b2-3a111007a87e
The answers in the link are an interesting mix! As a layman who's tried to educate myself via reading Sandwalk for many years some answers aren't bad, but others are poorer quality. I'd love to see a detailed "Fisking" of the answers... Or just a mark and comment for each answer from an expert like Prof Moran. But suspect it would be a lot of work to get right into the details.
One area of entertainment with Chat GTP is to ask it to render a labelled picture of something. It can take a while, but the results are absolutely bizarre. One of my favorites are labelled pictures of eukaryote cells. To quickly see examples, one can do a Google search of such images. Actually, they would make lovely posters.
@Mitchell: Thanks for those answers. Most of them are very similar to the ones that the standard version of ChatGPT would give but some are better. I wonder what sources the “Deep Research” version is looking at?
The general impression I get is that the best answers are similar to what you would find in the best undergraduate textbooks. I see very little evidence of significant intelligence or critical thinking but in some cases the answer has avoided the most common misconceptions and that’s an improvement.
I’ll give you a grade for each question assuming that the answer was submitted by a graduate student in biochemistry/molecular biology. Most of the answers are somewhat incoherent and many of them include contradictory statements. I expect better writing from graduate students but I’ll try and ignore that in order to focus on the information in the answer.
@Mitchell: Question #1 (What is a gene?) F
The answer doesn’t distinguish between the Mendelian definition of a gene and the molecular gene. It doesn’t define the boundaries of a molecular gene - the better definition is that a gene is a transcribed region.
In the beginning and at the end it says that a gene includes regulatory sequences but in the middle it says that distant regulatory elements aren’t included. There’s no justification for including (or excluding) regulatory sequences from the definition.
The main problem is that the answer devotes too much attention to alternative splicing and puts far too much emphasis on Mark Gerstein’s 2007 paper. I don’t think there are very many knowledgeable scientists who quote that paper. We usually don’t think that ENCODE researchers are reliable sources.
Gerstein, M.B., Bruce, C., Rozowsky, J.S., Zheng, D., Du, J., Korbel, J.O., Emanuelsson, O., Zhang, Z.D., Weissman, S. and Snyder, M. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Research 17:669-681. https://doi.org/10.1101/gr.6339607
@Mitchell: Question #2 (How much junk DNA?) F
No graduate student is going to get a passing grade in my course if they don’t know the difference between junk DNA and non-coding DNA and the true history of the subject. Statements such as “It’s important to note that “junk” is not a precise term – some non-coding DNA is critical (e.g. regulatory enhancers, noncoding RNA genes), and some is likely neutral or nearly neutral sequence that persists without function.” are not acceptable. I expect graduate students to come up with a definition of junk DNA that they can defend.
There’s no critical discussion of the evidence for and against the idea that most of our genome is junk. The good news is that this version of ChatGPT acknowledges that most of our genome is probably junk.
Also, Quanta Magazine is not a reliable source.
@Mitchell: Question #3 (How much alternative splicing?) F
The answer is wrong. I’d give a barely passing grade to a student who intelligently discusses the possibility that most splice variants are junk but gives reasons for tentatively rejecting that idea. But you do not get a pass if you don’t even realize that there’s a controversy.
@Mitchell: Question #4 (Most common mechanism of evolution?) D
The answer is correct although it should have discussed whether it applies to all organisms or just those that have a lot of junk DNA. Also, the answer should have made clear what definition of evolution it was using.
It doesn’t get more than a “D” because the answer is rambling and because it clearly doesn’t understand the difference between mutation and random genetic drift and which one counts as a mechanism of evolution. It gets it right in one part of the answer but not another.
@Mitchell: Question #5 (Oldest biochemical pathway?) ?
I don’t know how to grade this answer. It makes sense that a carbon fixing pathway such as the Wood–Ljungdahl pathway (the acetyl-CoA pathway) is one of the oldest pathways. I’d give high marks to any student who named this pathway and defended the logic.
However, the ChatGPT response also talks about glycolysis as one of the oldest pathways. This is the standard misconception in the scientific literature and biochemistry textbooks but there’s a problem. You can’t have a pathway for degrading glucose until you have a pathway for making glucose (gluconeogenesis). The Chat GPT answer seems to appreciate the problem but doesn’t follow through with the logic.
Also, the ChatGPT answer seems to think that glycolysis is some sort of magical net energy producing pathway in bacteria and the earliest forms of life. This is a misconception based on the unfortunate emphasis on animal metabolism (human) in undergraduate biochemistry courses.
@Mitchell: Question #6 (Free energy change.) B-
This is a pretty good answer. ChatGPT seems to understand the difference between the standard free energy change and the actual free energy change inside the cell. It’s a little confused about “coupling” and it adds extraneous information about enzymes - that’s mainly why it doesn’t get an “A.” Also, I’m not happy about the way ChatGPT refers to exergonic and endergonic reactions. How do you define these terms inside a cell where reactions can go in both directions?
The sources are chemistry texts and Wikipedia. I wonder how this version of ChatGPT knows which sources to rely on for correct information?
(Students whose main area of study is metabolism would not get as high a grade.)
@Mitchell: Question #7 (Why is most of the human genome transcribed?) F
The most important part of the answer is that the transcription of known genes accounts for more than 40% of the human genome if you define a gene as a DNA sequence that’s transcribed. Most of this is introns but it’s important to note this point.
The question then becomes what about the remaining 40% or so of the human genome that’s transcribed? ChatGPT says that some unspecified amount of this is due to “tens of thousands” of non-coding genes but that not true. The second reason for pervasive transcription is that regulatory sequences are bidirectionally transcribed but no amount is given. ChatGTP also notes that there is a certain (unspecified) amount of transcriptional noise.
The answer is a mix of bits of information gleaned from the scientific literature with no attempt to put it together into a coherent answer. It also gets the history wrong - we’ve known since the early 1970s that most of the genome is transcribed.
@Mitchell: Question #8 (What’s the best definition of photosynthesis?) F
Carbon fixation (e.g. Calvin cycle) is not an necessary part of photosynthesis. Water is not the only electron donor so oxygen production is not a necessary part of photosynthesis. The best definition of photosynthesis has to apply to all photosynthetic organisms.
National Geographic and the Khan Academy are not reliable sources of information.
@Mitchell: Question #9 (What’s the best current explanation of the origin of life?) F
ChatGPT can’t explain the differences between RNA world, primordial soup, hydrothermal vents, Miller-Urey, meteorites, and warm little ponds in order to come up with a coherent and critical examination of the leading explanations.
@Mitchell: Question #9 (What’s the best current explanation of the origin of life?) F
ChatGPT can’t explain the differences between RNA world, primordial soup, hydrothermal vents, Miller-Urey, meteorites, and warm little ponds in order to come up with a coherent explanation of the leading explanations.
@Larry Very interesting to see your assessments. I hope to make a follow-up comment on AI capabilities and cognitive styles, a few days from now.
Like Mitchell I am finding your grades fascinating. I hope you continue for all 13 questions. Thank you.
@Mitchell: Question #10 (Is the Three Domain Hypothesis valid?) B
ChatGPT recognizes that eukaryotes can’t be a separate domain of life because the first eukaryote arose from WITHIN the Archaea branch of the tree. This fact has been documented in the scientific literature for about 25 years but the idea of Three Domains is still widespread in textbooks and in popular science literature. I’m pleased that the AI program is able to give more credence to reliable sources.
However, that’s only part of the story so I’m being a bit generous by awarding a B grade. The other half of the story is that half of eukaryotic genomes are derived from the alphaproteobacteria lineage that’s within the Bacteria Domain. Eukaryotes arose from a fusion of and archaeal cell and a bacterial cell. The bacterial ancestor survives as mitochondria but many of its genes were transferred to the nucleus.
The fact that eukaryotes have two different ancestors is another nail in the coffin for the Three Domain Hypothesis. This is mentioned in the ChatGPT answer but the AI program clearly doesn’t recognize its significance.
@Mitchell: Question #11 (why is there an appropriate molecular clock?) B
This is a pretty good explanation but it could be much better written.
I'm assuming that "appropriate" is a typo for "approximate", but did the typo happen before or after submitting the question to the AI?
@Mitchell: Question #12 (Why can humans synthesize proline but not valine?) ?
This is one of those questions that could be answered in at least two ways. I try to avoid such questions but this time I messed up.
The simple answer is that humans can’t make valine because they are missing some of the enzymes in the valine biosynthesis pathway but they have all the enzymes required to make proline. ChatGPT answers this part correctly.
My intent was go beyond that simple answer and ask students to explain why humans have lost the ability to synthesize some amino acids and not others. The ChatGPT answer attempts to answer that part of the question but doesn’t really comes to grips with the problem. The answer states correctly that the branched chain biosynthesis pathway is more complex than some others but then goes on to state that animals might have found it to be “metabolically cheaper” to obtain valine from the diet rather than synthesize it directly. That doesn’t make sense because the same logic applies to all the other amino acid biosynthetic pathways.
I’d give a B+ grade to student lawyers who insist that the simple answer must be counted as correct. I’d give a C- grade to the answer that I expected. There is no “correct” answer to the deeper “why” question - it’s a way of evaluating the ability to think critically.
@Mitchell: Question #13 (What is the Central Dogma of Molecular Biology?) F
You can’t get a passing grade on this question unless you correctly define the Central Dogma in the way Francis Crick did in his 1958 and 1970 papers AND explain why it differs from the common DNA –> RNA –> protein definition.
There’s enough information out there to alert ChatGPT to the problem of conflicting definitions and it’s interesting to note that it quotes a correct Crick version. However, it’s clear that the current algorithm isn’t capable of handling the conflict and mostly reverts to defending the popular version.
It’s interesting that ChatGPT extracts a quote from Wikipedia as “often stated as ‘DNA makes RNA, and RNA makes protein’” when the intent of that phrase was to illustrate the INCORRECT version of the Central Dogma. (I know this because I wrote much of the Wikipedia article.) This tells us that the current version of ChatGPT can find information and can, to some extent, recognize which sources are more reliable but it still has trouble dealing with controversy and nuance. It’s still a long way from exhibiting a Ph.D. level of intelligence.
Wikipedia: Central dogma of molecular biology
https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology
Larry - Sorry it's taken me so long to get back to this.
I just saw a comment by a mathematician on GPT-o3's performance when discussing mathematics:
https://thezvi.substack.com/p/ai-120-while-o3-turned-pro?open=false#%C2%A7on-your-marks
"The model often resembles a keen graduate student who has read extensively and can name-drop many results and authors. While this can seem impressive initially, it quickly becomes apparent to experts that the student hasn’t deeply internalized the material and is mostly regurgitating what they’ve read. The model exhibits a similar behavior—good at recognizing relevant material, but unable to extend or apply it in a novel way."
Does that sound like what you saw in these answers?
@Mitchell
Yes, that pretty much sums up the problem with current AI programs. They can retrieve information but they aren't "intelligent" in the sense that I use the word.
I think one way to get at the difference, is that current AIs don't reflect on what they "learn". They "memorize" all kinds of things, and they even develop templates of generalization which allow them to produce novelty, but all of that happens unreflectively. In these newer AIs that are called reasoning models, a kind of reflection happens during the "chain of thought" process, in that they plan what to say, and weigh up whether it's a good plan, before they say it. So reflection happens when they answer questions, but not when they are trained - and perhaps this just isn't enough reflection, to equal the performance of someone who truly learns, and thinks about it as they learn.
If they were reflecting throughout the training process, they might actually develop individual opinions, which could be a good thing or a bad thing... There's a new AI architecture from China called Absolute Zero in which it's supposed to guide its own training (the "zero" refers to zero presuppositions); that *might* produce something closer to the human capacity to reflect as you're learning, and not just when you're applying that knowledge.
Post a Comment