Monday, November 24, 2025

Evolution explains the differences between the human and chimpanzee genomes

If you align similar regions of the human and chimpanzee genomes they turn out to be about 98.6% identical in nucleotide sequence. The total number of differences amount to 44 million base pairs (bp). If the differences are due to mutations that have occurred since divergence from a common ancestor, then there would be 22 million mutations in each lineage.

The mutation rate is approximately 100 new mutations per generation. Most of these will be neutral mutations that have no effect on the survival of the individual and almost all of them will be lost within a few generations. A small number of these neutral mutations will become fixed in the population and it's these fixed mutations that produce most of the changes in the genome of evolving populations. According to the neutral theory of population genetics, the number of fixed neutral mutations corresponds to the mutation rate. Thus, in every evolving population there will be 100 new fixed mutations per generation.

This means that fixation of 22 million mutations would take 220,000 generations. The average generation time of humans and chimps is 27.5 years so this corresponds to about 6 million years. That's close to the time that humans and chimps diverged according to the fossil record. What this means is that evolutionary theory is able to explain the differences in the human genome—it has explanatory power. It could have been falsified if the differences between the human and chimp genomes were quite different.

There is no other explanation that accounts for the data.

Background

The technology for sequencing proteins was developed by Fred Sanger in the 1950s and got him his first Nobel Prize in 1958. By the beginning of the 1960s homologous proteins such as hemoglobin and cytochrome c had been sequenced from a number of different species. The amino acid sequences of these proteins could be aligned and it soon became evident that the proteins from some species were much more similar than the proteins from other species. Furthermore, the similarities seemed to correspond to the inferred evolutionary relationships.

It was possible to construct trees showing the relationship between those amino acid sequences. One of the earliest trees was published by Emanual Margoliash using cytochrome c proteins (Margoliash, 1963). The figure (above) illustrates the relationship of the various sequences from different species. The numbers of the branches represent the number of different amino acids between the sequences at the tips of the branches and the closest node. (I'm showing a later version here from Fitch and Margoliash (1967). This is a very famous tree that's found in many textbooks. The version shown here is from Mulligan (2008).)

Note that there's only one difference between the sequence of the human protein and that of the monkey and the are many more differences between the human and other mammals. The differences between humans and insects and fungi are even greater. This strongly suggests that what we're looking at is an evolutionary relationship beween species. The remarkable and unexpected result is that the number of changes seems to correspond to the time of divergence of these various species and Margoliash noted that this relatively constant rate of change over time is what makes it possible to construct a robust tree.

Similar trees were constructed from hemoglobin sequences by Emile Zuckerkandl and Linus Pauling and they were the first ones to use the term "molecular clock" to identify the relatively constant rate of change in amino acid seqeences over time. The history of this idea is fascinating and I strongly recommend reading Gregory Morgan's article from 1998 (Morgan, 1998). The molecular lock concept is clearly one of the most important discoveries in evolution in the last half of the 20th century.

Explaining the molecular clock was challenging in the early 1960s since it didn't seem to be consistent with evolution by natural selection. If all the amino acids substitutions are due to beneficial alleles that become fixed by natural selection in each species then there didn't see to be any obvious reason why such changes should be constant over long periods of time. The explanation came from the development of the neutral theory of evolution by Kimura and others in the late 1960s.

The neutral theory was based on observations that most changes in the amino acid sequences of proteins were neutral with respect to fitness. This meant that fixation of neutral alleles was due to random genetic drift. Population genetics had shown that the rate of fixation was dependant only on the mutation rate so that as long as the mutation rate per generations is relatively constant over time then there should be a relatively constant rate of change giving rise to an approximate molecular clock. [The Modern Molecular Clock]

By the late 1970s, the concept of a molecular clock had been extended to RNA sequences, especially ribosomal RNAs (RNA). Since rRNA sequences are very similar in all species this enabled scientists to construct very large trees that included all known species. It was this data that led of the discovery of two different Kingdoms of bacteria: Bacteria and Archaea.

Later on the comparisons used whole genomes and the extra data enabled more precise estimations of divergence times.

Now let's look at the data used to explain the difference between the human and chimp genome.

Percent Similarity

I used 98.6% similar in the calculation. This value is based on aligning about 2.1 billion base pair similar regions in the two genomes. It does not account for regions that do not align because of duplications, insertions, and deletions. There are about 26,000 regions that don't align and they range in size from just a few base pairs to over 1000 bp. If you count all the diferences in those duplications, insertions, and deletions, then you can get percent differences of much less than 98.6%.

This is deceptive because what we are interested in is the mumber of mutations that have occurred and a 1000 bp insertion is not 1000 different mutations; it's only one mutation. This is why the percent similarity in aligned regions is much closer to a true estimate of the number of mutations. [What's the Difference Between a Human and Chimpanzee?]

The earliest data on the difference between the human and chimp genomes depended on the rate of hybridization of DNA from each species and gave rise to the common view that the DNA from the two species is 98.5% similar (see Britten, 2002). The first direct comparison of substantial amounts of human and chimp genome sequences indicated a difference of 1.4% (Britten, 2002). The first sequenced chimpanzee genome showed a diffence of 1.23% (The Chimpanzee Sequencing and Analysis Consotrium, 2005). Subsequent analyses indicated a range of 1.1-1.4% (ogers and Gibbs, 2014).

Genome Size

In order to calculate the total number of mutations you have to multiply the 1.4% difference by the size of the genome. The best current estimate of the human genome size is 3.1 billion bp and it's reasonable to assume that the final version of the chimpanzee genome will be close to this size. (0.014 × 3.1 × 109 = 43.4 × 106; I rounded up to 44 million)

Mutation Rate

There's a lot of data on the mutation rate in humans. Different papers give values that cluster around 100 mutations per generation or slightly less. I used 100 mutations per generation for simplicity. The calculated time of divergence doesn't change very much with slighlty difference values. [Parental age and the human mutation rate] [Human mutation rates] [Human mutation rates - what's the right number?]

Generation Time


1. Assuming a genome size of 3.1 billion base pairs (bp) in each species. 1.4% × 3.1 × 109 bp = 43.4 million bp, rounded up to 44 million to simplify calculations.

2. A mutation rate of about 100 mutations per generation is consistent with all the data from various sources but many scientists emphasize the direct measurements which tend to give a somehwat lower value. The difference doesn't seriously affect the overall calculation—any reasonable rate is still consistent with a divergence time of 5-7 million years. [See Parental age and the human mutation rate.]

Britten, R.J. (2002) Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proceedings of the National Academy of Sciences 99:13633-13635. [doi: 10.1073/pnas.172510699]

Fitch, W.M. and Margoliash, E. (1967) Construction of phylogenetic trees. Science 155:279–284.

Margoliash, E. (1963) Primary structure and evolution of cytochrome c. Proc. Natl. Acad. Sci. USA 50:672-679.

Morgan, G. (1998) Emile Zuckerkandl, Linus Pauling, and the Molecular Evolutionary Clock, 1959-1965. J. Hist. Biol. 31:155-178. [PDF]

Mulligan, P.K. (2008) Proteins, evolution of in AccessScience, ©McGraw-Hill Companies.

Rogers, J. and Gibbs, R.A. (2014) Comparative primate genomics: emerging patterns of genome content and dynamics. Nature Reviews Genetics 15:347-359. doi: doi:10.1038/nrg3707 Sequencing, T.C. and Consortium, A. (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69-87. doi: 10.1038/nature0407

1 comment:

  1. ... except for 5% or so of the genome which is subject to purifying selection, which greatly reduces the rate of substitution, and sometimes advantageous mutations, which can increase the rate of substitution.

    ReplyDelete