More Recent Comments

Monday, November 24, 2025

Evolution explains the differences between the human and chimpanzee genomes

If you align similar regions of the human and chimpanzee genomes they turn out to be about 98.6% identical in nucleotide sequence. The total number of differences amount to 44 million base pairs (bp). If the differences are due to mutations that have occurred since divergence from a common ancestor, then there would be 22 million mutations in each lineage.

The mutation rate is approximately 100 new mutations per generation. Most of these will be neutral mutations that have no effect on the survival of the individual and almost all of them will be lost within a few generations. A small number of these neutral mutations will become fixed in the population and it's these fixed mutations that produce most of the changes in the genome of evolving populations. According to the neutral theory of population genetics, the number of fixed neutral mutations corresponds to the mutation rate. Thus, in every evolving population there will be 100 new fixed mutations per generation.

This means that fixation of 22 million mutations would take 220,000 generations. The average generation time of humans and chimps is 27.5 years so this corresponds to about 6 million years. That's close to the time that humans and chimps diverged according to the fossil record. What this means is that evolutionary theory is able to explain the differences in the human genome—it has explanatory power. It could have been falsified if the differences between the human and chimp genomes were quite different.

There is no other explanation that accounts for the data.

Background

The technology for sequencing proteins was developed by Fred Sanger in the 1950s and got him his first Nobel Prize in 1958. By the beginning of the 1960s homologous proteins such as hemoglobin and cytochrome c had been sequenced from a number of different species. The amino acid sequences of these proteins could be aligned and it soon became evident that the proteins from some species were much more similar than the proteins from other species. Furthermore, the similarities seemed to correspond to the inferred evolutionary relationships.

It was possible to construct trees showing the relationship between those amino acid sequences. One of the earliest trees was published by Emanual Margoliash using cytochrome c proteins (Margoliash, 1963). The figure (above) illustrates the relationship of the various sequences from different species. The numbers of the branches represent the number of different amino acids between the sequences at the tips of the branches and the closest node. (I'm showing a later version here from Fitch and Margoliash (1967). This is a very famous tree that's found in many textbooks. The version shown here is from Mulligan (2008).)

Note that there's only one difference between the sequence of the human protein and that of the monkey and the are many more differences between the human and other mammals. The differences between humans and insects and fungi are even greater. This strongly suggests that what we're looking at is an evolutionary relationship beween species. The remarkable and unexpected result is that the number of changes seems to correspond to the time of divergence of these various species and Margoliash noted that this relatively constant rate of change over time is what makes it possible to construct a robust tree.

Similar trees were constructed from hemoglobin sequences by Emile Zuckerkandl and Linus Pauling and they were the first ones to use the term "molecular clock" to identify the relatively constant rate of change in amino acid seqeences over time. The history of this idea is fascinating and I strongly recommend reading Gregory Morgan's article from 1998 (Morgan, 1998). The molecular lock concept is clearly one of the most important discoveries in evolution in the last half of the 20th century.

Explaining the molecular clock was challenging in the early 1960s since it didn't seem to be consistent with evolution by natural selection. If all the amino acids substitutions are due to beneficial alleles that become fixed by natural selection in each species then there didn't see to be any obvious reason why such changes should be constant over long periods of time. The explanation came from the development of the neutral theory of evolution by Kimura and others in the late 1960s.

The neutral theory was based on observations that most changes in the amino acid sequences of proteins were neutral with respect to fitness. This meant that fixation of neutral alleles was due to random genetic drift. Population genetics had shown that the rate of fixation was dependant only on the mutation rate so that as long as the mutation rate per generations is relatively constant over time then there should be a relatively constant rate of change giving rise to an approximate molecular clock. [The Modern Molecular Clock]

By the late 1970s, the concept of a molecular clock had been extended to RNA sequences, especially ribosomal RNAs (RNA). Since rRNA sequences are very similar in all species this enabled scientists to construct very large trees that included all known species. It was this data that led of the discovery of two different Kingdoms of bacteria: Bacteria and Archaea.

Later on the comparisons used whole genomes and the extra data enabled more precise estimations of divergence times.

Now let's look at the data used to explain the difference between the human and chimp genome.

Percent Similarity

I used 98.6% similar in the calculation. This value is based on aligning about 2.1 billion base pair similar regions in the two genomes. It does not account for regions that do not align because of duplications, insertions, and deletions. There are about 26,000 regions that don't align and they range in size from just a few base pairs to over 1000 bp. If you count all the diferences in those duplications, insertions, and deletions, then you can get percent similarities of much less than 98.6% (Yoo et al. 2025).

This is deceptive because what we are interested in is the mumber of mutations that have occurred and a 1000 bp insertion is not 1000 different mutations; it's only one mutation. This is why the percent similarity in aligned regions is much closer to a true estimate of the number of mutations. [What's the Difference Between a Human and Chimpanzee?]

The earliest data on the difference between the human and chimp genomes depended on the rate of hybridization of DNA from each species and gave rise to the common view that the DNA from the two species is 98.5% similar (see Britten, 2002). The first direct comparison of substantial amounts of human and chimp genome sequences indicated a difference of 1.4% (Britten, 2002). The first sequenced chimpanzee genome showed a diffence of 1.23% (The Chimpanzee Sequencing and Analysis Consotrium, 2005). Subsequent analyses indicated a range of 1.1-1.4% (Rogers and Gibbs, 2014).

Genome Size

In order to calculate the total number of mutations you have to multiply the 1.4% difference by the size of the genome. The best current estimate of the human genome size is 3.1 billion bp and the size of the recently completed chimpanzee genome is 3.14 billion bp (Yoo et al., 2025). (0.014 × 3.1 × 109 = 43.4 × 106; I rounded up to 44 million)

Mutation Rate

There's a lot of data on the mutation rate in humans. Different papers give values that cluster around 100 mutations per generation or slightly less. I used 100 mutations per generation for simplicity. The calculated time of divergence doesn't change very much with slighlty difference values. [Parental age and the human mutation rate] [Human mutation rates] [Human mutation rates - what's the right number?]

Generation Time

The actual time of divergence depends on estimates of the number of generations and the generation time of the evolving populations. Most people think that human generation times are about 20 years based on the time that a couple has their first baby but that's not the same as the average generation time. The real generation time is the average number of years between the time that a man and woman are born and the time of the birth of two children who survive to reproduce. And it's not just the age of the mother that counts; the age of the father is very important because most mutations occur in males during spermatogenesis.

I used the date from Langergraber et al. (2012) who estimated that the generation time of chimpanzees in the wild averaged 25 years and the human generation time is about 30 years. (I used the average, 27.5 years.) The human generation time is consistent consistent with genealogical studies but since I have an extensive genealogy of my family I decided to check to see if this is correct. There are two important members of my family who share a common ancestor who was born in 1350. They are 18th cousins and the generation time is about 31 years.

Coalescence and Divergence

I used simple arithmetic to calculate the time that humans and chimpanzees last shared a common ancestor. This assumes that both lineages had reasonably-sized populations where the rate of fixation could be modeled. There are more sophisticated algorithms that use coalescent theory to model divergence times. These algorithms take into account effective population sizes (Ne) and other factors such as whether the fixed alleles are independent and whether some of them have reverted.

Most of the more sophisticated calculations predict similar divergence times of around 6 million years. For example, You et al. (2025) estimated a time between 5.5 and 6.3 million years ago with an average effective population size of 198,000.


1. Assuming a genome size of 3.1 billion base pairs (bp) in each species. 1.4% × 3.1 × 109 bp = 43.4 million bp, rounded up to 44 million to simplify calculations.

2. A mutation rate of about 100 mutations per generation is consistent with all the data from various sources but many scientists emphasize the direct measurements which tend to give a somehwat lower value. The difference doesn't seriously affect the overall calculation—any reasonable rate is still consistent with a divergence time of 5-7 million years. [See Parental age and the human mutation rate.]

Britten, R.J. (2002) Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proceedings of the National Academy of Sciences 99:13633-13635. [doi: 10.1073/pnas.172510699]

Fitch, W.M. and Margoliash, E. (1967) Construction of phylogenetic trees. Science 155:279–284.

Langergraber, K.E., et al. (2012) Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. Proc. Natl. Acad. Sci. (USA) 109:15716-15721. [doi: 10.1073/pnas.1211740109]

Margoliash, E. (1963) Primary structure and evolution of cytochrome c. Proc. Natl. Acad. Sci. USA 50:672-679.

Morgan, G. (1998) Emile Zuckerkandl, Linus Pauling, and the Molecular Evolutionary Clock, 1959-1965. J. Hist. Biol. 31:155-178. [PDF]

Mulligan, P.K. (2008) Proteins, evolution of in AccessScience, ©McGraw-Hill Companies.

Rogers, J. and Gibbs, R.A. (2014) Comparative primate genomics: emerging patterns of genome content and dynamics. Nature Reviews Genetics 15:347-359. [doi: 10.1038/nrg3707]

Sequencing The Chimpanzee Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69-87. doi: 10.1038/nature0407

Yoo, D. et al. (2025) Complete sequencing of ape genomes. Nature 641:401-418. [doi: 10.1038/s41586-025-08816-3]

33 comments :

Joe Felsenstein said...

... except for 5% or so of the genome which is subject to purifying selection, which greatly reduces the rate of substitution, and sometimes advantageous mutations, which can increase the rate of substitution.

Mehrshad said...

Neutral theory had never have ( and still don't have ) any good scientific evidence. Kimura just proposed it to save Darwin's theory from Haldane dilemma ( or the wating time problem ) which speculates enormous time for cooperative mutation to come out and being fixed by mutation and natural selection. Specially for Humans who have low population size

Anonymous said...

Light gonna be here any minute now talkin some nonsense bout humans evolving from Neanderthals and their 99.7% similar DNA.

Light said...

How does a constant mutation rate square with punctuated equilibrium?

John Harshman said...

You might as well ask how apples square with aardvarks. With every post, you reveal more of your ignorance of evolution.

Light said...

Nice try John.

John Harshman said...

You keep using those words. I do not think they mean what you think they mean.

Light said...

How does a constant mutation rate square with punctuated equilibrium?
Is it because the mutation rate is at the molecular level and punctuated equilibrium is at the organism level? And different rules apply at each?

Larry Moran said...

@Joe Felsenstein: Let's assume that about 10% of the genome is functional and therefore it is under purifying selection. This means that the rate of substitution in those regions is less than the neutral rate.

It does not mean that there are no substitutions at all in those regions. I don't know the average neutral substitution rate in functional regions of the genome. Do you have a good estimate of that value? For example, what percentage of mutations in typical coding regions are effectively neutral?

I get your point that less than 100% of the human and chimp genomes are evolving at the neutral rate but I question whether this "greatly reduces the rate of substitution." Do you have data on this? It seems to me that this is a minor effect that doesn't make much difference to the big picture given the probable error rates associated with the other variables.

What I'm more worried about is whether the sequences of the standard reference genomes really represent the alleles that have become fixed in the populations. I'm certain that this isn't correct because both the human and chimp populations carry an enormous amount of variation at millions of sites. But can we use the standard reference genomes anyway because the "noise" cancels out in the end?

I'd like to incorporate the answer to that question in the post. In fact, that's why I've delayed putting up this post for more than 10 months. I posted the incomplete version because many creationists are making a big deal about the difference between the chimp and human genomes and they (and most defenders of evolution) seem to be unaware of the correct explanation for most of the differences.

Larry Moran said...

From time to time I learn some new facts about biology that seem to conflict with what I thought I knew. My first reaction to that apparent conflict is to assume that my understanding was incorrect so I do some research to see whether that's true.

For example, let's say I learn for the first time that there's a relatively constant rate of mutation due to the intrinsic error rate of DNA replication and repair. This seems to be in conflict with my understanding of punctuated equlibria. My reaction would be to go back and study punctuated equilibria to see if I understood it correctly.

The creationist reaction is to assume that scientists are stupid and evolution is wrong.

John Harshman said...

Just to clarity: is a constant rate of mutation really in conflict with your understanding of punctuated equilibria? And when you say "mutation" do you really mean "substitution" or "fixation"? Of course there's no conflict in any case, given that most of your genome is junk, and a punctuation event involves only a small number of alleles. And of course most evolutionary biologists don't think PE is a thing anyway.

Joe Felsenstein said...

Larry, to your concern "I get your point that less than 100% of the human and chimp genomes are evolving at the neutral rate but I question whether this "greatly reduces the rate of substitution."

Sorry, I meant greatly reduces the rate of substitution in those (functional) regions of the genome, not in the wholee genome. I wasn't clear enough.

Light said...

How does a constant mutation rate square with punctuated equilibrium?
Is it because the mutation rate is at the molecular level and punctuated equilibrium is at the organism level? And different rules apply at each?
Selection applies at the organism level and genetic drift applies at the molecular level.

Light said...

As a sidenote: that is what I would predict. Different levels with different rules.

John Harshman said...

I detect a slight glimmer of understanding here, which ought to be encouraged. It's not the mutation rate that should concern us but the fixation rate, and those are the same only under neutrality. Selection is primarily at the organism level, with the entire genotype contributing. Punctuated equilibria, if that were really a thing, would be a population-level phenomenon but would also involve selection at the individual level. Genetic drift also happens at the population level, since it's about changes in allele frequencies, and of course those changes are the summations of individual reproductive success at the individual level. Anyway, it's the same rules at the same level, more or less. It's just that only a small part of the genome is under selection, and even in those parts, changes are mostly neutral.

Larry Moran said...

I will continue to delete all irrelevant comments. Don't bother commenting unless you have something substantive to contribute.

Larry Moran said...

There is no mechanism on blogspot for banning Light/Doug Dobney. I can't prevent him from posting comments so what I will do is delete all of his comments as soon as I see them. There's no point in retaining the comments of anyone who replies to him because the context is missing.

I strongly recommend that you do not feed the troll or encourage him by responding to his ridiculous comments.

gert korthof said...

Larry, you deleted a comment where I pointed with a link to the exact location in a video where your book is shown. Why?

Larry Moran said...

@gert korthof You posted a link to a creationist video arguing against junk DNA using very stupid arguments. I do not consider that relevant. I'm well aware of that video and all other creationist videos claiming that they were the ones who predicted that most of the genome would be functional and their prediction turned out to be correct.

Larry Moran said...

Doug Dobney/Light was banned from this blog many years ago because he spammed many articles. I will continue that ban by deleting his comments and those of anyone who replies to him.

However, I note that he has recently asked some reasonable questions about the amount of DNA devoted to regulation so here's my summary of the relevant information with references to page numbers in my book.

protein-coding genes (~38%, p. 126, p.151)
coding DNA (~1%, p. 151)

Most of the protein coding genes consist of introns and introns are junk DNA (pp. 151-154)

regulatory sequences (<0.2%, pp. 128-129, pp. 264-296)

Here's a link to a post from last year that explains my estimate of the amount of DNA devoted to regulation and why speculations such as 8% or 20% are ridiculous.

Transcription factors and their binding sites
https://sandwalk.blogspot.com/2024/03/nils-walter-disputes-junk-dna-8.html

Larry Moran said...

Doug Dobney is a well-known internet kook who used to live in Peterborough, Ontario, Canada. He may still be there for all I know.

I was willing to tolerate some of the spam comments he posted but the last straw was when he sent letters to the chair of my department and the dean of the Faculty of Medicine complaining about my blog.

A New Moderation Policy: Doug Dobney Is Banned on Sandwalk

Anonymous said...

For completeness, more recent analysis report values of 8.2% (7.1–9.2%). Intro offers reasonable background. Rands et al., PLoS Genet 2014. PMID 25057982.

As a layperson in this area, I struggle to distinguish between regions of the genome described as ‘functional’ with relatively higher end (~15%) values and those under weak or under-constrained selection.

Larry Moran said...

@Anonymous: You are referring to this paper.

Rands, C.M., Meader, S., Ponting, C.P. and Lunter, G. (2014) 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage. PLOS Genetics 10:e1004525. doi: 10.1371/journal.pgen.1004525

There are other, more recent, papers that attempt to identify the amount of conserved DNA or the amount that is under purifying selection. It's difficult to get an exact value since most of the techniques are not very good at identifying short functional stretches of DNA such as biologically relevant transcription factor binding sites.

However, the latest attempts from the analysis of 150,000 complete genomes does get down to that level. The results show, once again, that less that 10% of the genome is under purifying selection and that's consistent with the model of 90% junk DNA from half-a-century ago.

Identifying functional DNA (and junk) by purifying selection

Daniel Rokhsar said...

“What I'm more worried about is whether the sequences of the standard reference genomes really represent the alleles that have become fixed in the populations. I'm certain that this isn't correct because both the human and chimp populations carry an enormous amount of variation at millions of sites. But can we use the standard reference genomes anyway because the "noise" cancels out in the end?”

If I understand your question, I think the answer is that the difference between a single human chromosome and a single chimpanzee chromosome includes any mutations that have occurred on either lineage since the last common ancestor of that sequence. (Recombination makes the age of that ancestral sequence variable from sure to sure but there is an average. Some of these mutational changes are fixed in each species (either by selection or drift) and would be there no matter which specific representative individual you chose. Others are more recent and are polymorphic in the species. The specific recent mutations may differ from individual to individual, but are always there and count towards the divergence.

But maybe I am missing your question?

Larry Moran said...

@Daniel Rokhsar: I think your explanation is okay but it doesn't quite address the issue.

The time of divergence is based on the theory that the fixation rate is equal to the mutation rate but we aren't always looking at mutations that have become fixed in the population. Some of them might be polymorphic and some of them might be somatic cell mutations.

I don't think the exceptions have a serious impact on the overall calculation but it's an issue that needs to be addressed.

Larry Moran said...

I updated the section on genome size and added information on generation time and coalescence & divergence.

John Harshman said...

Do anyone's calculations take into account the male mutation rate vs. the female rate, the former much greater than the latter? And that the male rate increases with age, since the sperm population is always turning over? One could also make estimates based on autosomes, X (presumably slower) and Y (presumably faster) loci. I wonder how such estimates of divergence time would math up.

Larry Moran said...

@John Harshman: Check the links under "Mutation Rate." The value of 100 mutations per generation takes into account the much higher mutation rate in males. It's difficult to model the increase in mutation rate in older men but it's not likely to have much of an influence on the divergence calculation since we're using 27.5 years as an average generation time.

Anonymous said...

"What I'm more worried about is whether the sequences of the standard reference genomes really represent the alleles that have become fixed in the populations. I'm certain that this isn't correct because both the human and chimp populations carry an enormous amount of variation at millions of sites. But can we use the standard reference genomes anyway because the "noise" cancels out in the end?"

I would think that each human is genetically equidistant to any given chimp, and vice versa. However, immediately after the human and chimp lineages diverged the difference between the genomes from each lineage would still retain much of the variation they inherited from the common ancestral population. In the human population, there are about 4-5 million substitution mutations separating any two given humans. This could be added on to the number of mutations that occurred once the lineages went their separate ways.

Joe Felsenstein said...

The standard reference sequence is not intended to consist of those alleles at each site that are fixed, or even the most frequent. It is in effect the sequence of one haphazardly sampled individual.

Larry Moran said...

@Joe Felsenstein: Actually, the original human standard reference genome is a composite of several difference individuals from the area around Buffalo (New York, USA) although much of it came from a single male with the code name RP11 (p. 126 in my book.) When different sequences overlapped, the most common allele was chosen.

I think that in cases where the sequences were heterozygous an arbitrary choice was made.

The original final draft version has been updated many times and annotations now show where there are variants at particular sites, especially those within genes. I think there have been cases where the original version was changed to reflect the more common allele.

Given the known mutation rate and the size of the current human population, we can be certain that every possible site contains multiple alleles (A, T, G, or C). In other words, strictly speaking there are only a few totally fixed alleles in the human population except for the small percentage at extremely conserved loci.

We would like to know what the sequences of our ancient ancestors from 100,000 years ago looked like when the population was much smaller but we don't have that data.

I don't think this issue affects the overall calculation of divergence times but it's worth pointing out the problem so that we don't deliberately mislead people by exaggerating the evidence.


Anonymous said...

The difference between a chimp sequence and a human sequence is the accumulated mutation along each lineage since the last common ancestor of the two sequences. This varies from site to site because of recombination, but there is some average time T.

The expected number of difference between the two sequences is then 2uT allowing for mutations that occurred on each lineage.

Between any human and any chimp sequence (could be the references) these 2uT mutations can be put into two categories. Some are old mutations that are fixed between species, either by selection or drift, and some are recent and are polymorphic within chimp or human.

The fixed type (substitutions) would be the same no matter what sequence you chose to represent human and chimp. The polymorphic type vary in position among individuals but the number of these sites is the roughly the same no matter which representative you chose for that species.

If you are most interested in the fixed differences, then you’d have to sample multiple human and chimp genomes to isolate the fixed differences from the intraspecific polymorphism. But since most variants are rare, even a few genomes of each species is are enough.

Even comparing diploids helps. If A and B are alleles at a site then there are four possibilities

Human:Chimp
AA : BB ~ fixed difference. There is a small chance that you’re unlucky and sequenced a homozygote of a rare segregating allele
AB : AA ~ polymorphic site in human, not fixed difference
AA : AB ~ polymorphism in chimp
AB : AB is very rare and is a candidate for balancing selection or an old polymorphism in the chimp human ancestor that has not yet become fixed

Anonymous said...

“ We would like to know what the sequences of our ancient ancestors from 100,000 years ago looked like when the population was much smaller but we don't have that data.”

From surveys like the “1000 genomes” project, we can use the most frequently observed (major) allele at each locus as a proxy for the state of the ancestral population. Since the vast majority of sites have major allele frequency ~1 this is generally unambiguous. Relative to a “reference” genome this involved flipping ~4-5 m sites. Loci that have alternative major alleles in different human populations are very rare; in these cases one of them typically matched the chimp allele and could be used.

Of course the 100,000 year old ancestral population also had standing sequence variation. Some of these ancestral variants are still around as variation in today’s population; these are typically variants shared across continent. Other variation in the ancestral population was rare then and is gone now, having been mostly lost by drift.