Monday, January 23, 2012

What's the Difference Between a Human and Chimpanzee?

The number of differences between the human and chimpanzee genomes is consistent with Neutral Theory and fixation by random genetic drift.

How Many Differences?

You can estimate the total number of single nucleotide differences by measuring the rate of hybridization of human and chimpanzee DNA in a technique developed by Dave Kohne and Roy Britten over forty years ago. This technique was applied to human and chimp DNA and the results indicated that the two genomes differed by about 1.5% (reviewed in Britton, 2002). That corresponds to 45 million bp in a genome of 3 billion bp.

This value of 1.5%, rounded up to 2%, gave rise to the widely quoted statement that humans and chimps are 98% identical. Britton (2002) challenged that number by pointing out that humans and chimp genomes differed by a large number of insertions and deletions (indels) that could not have been detected in hybridization studies. He claimed that there was an addition 3.4% of the genome that differed due to indels. That means the the real difference between humans and chimps is closer to 5% and we are only 95% identical!

Much of the difference is due to insertion and deletion of members of gene families. One study shows that the human genome has 689 genes not present in the chimp genome and chimps have 729 genes not present in humans [Mammalian Gene Families: Humans and Chimps Differ by 6%]. That's a total of 1,418 complete genes that are only found in one of the species.

At first glance this looks like 689 completely new genes have evolved in the human lineage since it diverged from our common ancestor with chimpanzees but looks can be deceiving. These genes are members of gene families and all that's happened is that 689 orthologous genes have either arisen by duplication in the human lineage or been lost by deletion in the chimp lineage or 689 new parologous genes have been "born" by gene duplication (or some combination).

Much better date is available today than in 2002 when Britten wrote his paper. We now know by direct comparison that there are at least 30 million single nucleotide differences between human and chimp genomes. There are about 90 million base pair differences as insertion and deletions (Margues-Bonet et al., 2009). The indels (insertions and deletions) may only represent 90,000 mutational events if the average length of an insertion/deletion is 1kb (1000 bp). In fact, more than 75% of indels are less than 5 bp (Britton 2002) so the actual number of mutational events is in the millions. Many of these are undoubtedly due to sequence errors. The latest studies indicate that humans and chimps differ by only 26,500 large indels (>80 bp) (Polavarapu et al., 2011). To a first approximation, the single nucleotide differences are a good measure of the total number of mutational events that have occurred in the two lineages. (underlined portion added on Jan. 25, 2012 - LAM)

Polymorphisms

It's worth noting that many of the differences between the human and chimp genomes are polymorphic within their respective populations. In other words, the variant alleles have not become fixed in the population. This affects the calculations of mutation rate since that calculation assumes that an allele has become fixed in the population by random genetic drift.

The polymorphisms include SNPs, of course, and that's the basis of many studies that look for specific haplotypes associated with disease. At least one of the variants at a given polymorphic locus in humans will be different from the nucleotide in the chimp reference genome. Deletions in the human and chimp genomes can also be polymorphic. Copy number variants (CNVs) in humans have been characterized in a number of studies (Campbell et al. 2011). In terms of total nucleotides, there is more variation in copy number than in single nucleotide polymorphisms (Alkan et al., 2011).

Are the Differences Neutral?

We would like to know if the differences between the human and chimp genomes are neutral alleles or if natural selection has played an important role in fixing these differences. Nobody doubts that many of the changes we see are adaptive in one or other of the lineages but can we recognize those important adaptive changes in a sea of possible neutral changes?

Several lines of evidence suggest that most of the changes are non-adaptive. First, since most (~90%) of the genome is junk, and most of the differences are located in junk DNA, it follows that most of the new alleles had no effect on function.

Second, if we look at the pattern of changes this is what we see for one of the human chromosomes.


The percent identity between humans and chimps fluctuates between 98% and 99% identity and the differences are pretty evenly scattered throughout chromosome 7. Remember, most of that DNA is junk.

Calculating the rate of evolution in terms of nucleotide substitutions seems to give a value so high that many of the mutations must be neutral ones.

Motoo Kimura (1968)
The third line of evidence has to do with the mutation rate and fixation in the two lineages. The mutation rate in humans is about 130 mutations per generation based on our knowledge of the biochemistry of DNA replication [Mutation Rates]. A value that's consistent with recent direct measurements [Human Y Chromosome Mutation Rates] [Direct Measurement of Human Mutation Rate]. Michael Lynch (2010) bases his estimate of human mutation rates on a number of other studies. He comes up with a value of about 80 new mutations per generation.

In an evolving population the rate of fixation of neutral alleles is equal to the mutation rate [Random Genetic Drift and Population Size]. How many mutations would we expect in the human lineage since it diverged from a common ancestor with chimpanzees if all of the fixed alleles were neutral? The two species diverged about 5 million years ago. The average generation time in the human lineage is about ten years, so that means 500,000 generations. If the rate of mutation is about 100 new mutations per generation, then we would expect to see about 50 million new mutations in the human lineage. The actual number is about 22.5 million (half of 45 million). We're certainly in the right ballpark.

The actual mutation rate may be lower than we calculate.

We're certainly safe in concluding that the number of differences between humans and chimps is consistent with Neutral Theory and we should accept this as the null hypothesis.


Alkan C, Coe BP, Eichler EE. (2011) Genome structural variation discovery and genotyping. Nat Rev Genet. 12:363-376. [PubMed]

Britton, R.J. (2002) Divergence between samples of chimpanzee and human DNA sequences if 5%, counting indels. Proc. Natl. Acad. Sci. (USA) 99:13633-13636.

Campbell, C.D., Sampas, N., Tsalenko, A., Sudmant, P.H., Kidd, J.M., Malig, M., Vu, T.H., Vives, L., Tsang, P., Bruhn, L., and Eichler, E.E. (2011) Population-genetic properties of differentiated human copy-number polymorphisms. Am J Hum Genet. 88:317-32. [PubMed]

Marques-Bonet, T., Ryder, O.A., and Eichler, E.E. (2009) Sequencing primate genomes: what have we learned? Annu. Rev. Genomics Hum. Genet. 10:355-386. [PubMed]

Lynch, M. (2010) Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. (USA) 107:961-968. [PubMed]

Polavarapu, N., Arora, G., Mittal, V.K., McDonald, J.F. (2011) Characterization and potential functional significance of human-chimpanzee large INDEL variation. Mob. DNA 2:13. [PubMed] [doi:10.1186/1759-8753-2-13]

33 comments:

  1. You forgot the reference for Polavarapu et al. I'd be interested, since the original chimp genome paper estimated (presumably by counting them in the alignment) 5 million indels. That's such a huge discrepancy that it needs a serious explanation. Now from my experience, 26500 would be a very surprisingly low number, and 5 million would be in the expected ballpark. Unless birds are radically different in this way from mammals, which would also be very surprising.

    ReplyDelete
    Replies
    1. John,

      Sorry 'bout that.

      I added the reference and modified the text because the number reported in the paper (26,509) only refers to indels larger than 80 bp.

      It's true that there are millions of indel differences between humans and chimps. What we don't know is how many of those are real and how many are due to sequence errors. The chimp sequence is not very good.

      Are the bird genome sequences so good that you can rely on small indels of 1-5 bp?

      Delete
    2. I don't actually see how you can get such sequence errors in a genome that has -- what? -- 10x coverage or better. Now my usual experience is with just 1x coverage of many species, and there's a suspiciously high level of consistency in small indel placement among close relatives. So yeah, I think most of them should be just fine. (Of course I'm talking about Sanger sequencing, and pyro/454 sequencing has a much higher error rate for both base and indel calls. But the genomes in question were done the old way.) Is this really an issue? I've got to suppose that a very small proportion of reported indels are sequencing (or, more likely, amplification) errors.

      Delete
    3. Typical error rates for "finished" genome sequences are on the order of 10^-4 or one error in 10,000 base pairs. That's about 300,000 sequence errors in the genome. The original chimp genome sequence was far from being "finished" when it was first published.

      First draft genomes typically have error rates of more than one in 1000 meaning that something like 3 million errors would be counted as differences in the comparison with the human genome. Many of these would be small deletions.

      Furthermore, the human genome was not a "finished" genome when the comparison was first done so there would have been sequence errors there as well.

      Delete
    4. Most of those errors would be base calls. A small proportion would be indels. We get a tiny fraction of 5 million. Not worth thinking about in a first approximation.

      In fact I would expect polymorphism to be a bigger source of "error". Or, if we're talking about one individual, heterozygosity. How did the genome projects handle heterozygosity anyway?

      Delete
  2. Clearly you haven't read the latest cutting edge research on this subject, published in the following leading peer reviewed journal:

    http://www.answersingenesis.org/articles/arj/v4/n1/blastin

    ReplyDelete
    Replies
    1. Well that pretty much settles it, don't you think?

      The creationists say that chimps and humans are only 80% identical because all of the unfinished part of the genome sequences must be different.

      Time to re-write the textbooks! :-)

      Delete
  3. We're certainly safe in concluding that the number of differences between humans and chimps is consistent with Neutral Theory and we should accept this as the null hypothesis.

    Not so safe to conlcude that. It may be a good null hypothesis for the great majority of them, but the ones involved in visible morphological differences, for example, have such a large phenotypic effect that it is hard to believe that those changes are invisible to selection.

    ReplyDelete
    Replies
    1. I think we can agree that 99.9% of all differences are neutral since they don't involve any change in visible morphology.

      The question is, what percentage of those alleles that DO result in a visible morphology have been fixed by natural selection? I'm guessing that number would be less than 10%. What number do you prefer?

      Delete
    2. Population genetic theory says that natural selection will be effective if the selection coefficient for that allele substitution exceeds the reciprocal of 4N. As the population size N was in the vicinity of 100,000, that says that selection if effective if s > 0.0000025. So I think the fraction is more like 90%.

      Changes that do not result in any change in visible morphology include some that change nerve function or physiology, so let's not go overboard. And no, just because we declare one pattern to be a null hypothesis does not mean we get to forget about other possibilities.

      Delete
    3. The probability of fixation is about 2s. If s is 0.0000025 then the probability of fixation is about 0.000005 or 0.0005%. In other words, the allele will be lost 99.999% of the time.

      The time to fixation is (2/s)ln(2N). For an allele with such a low fixation coefficient the time will be about one million generations or approximately 10 million years.

      The selection coefficients are going to have to be much higher if we're going to explain a significant numbers of morphological changes as adaptations.

      In class, I ask my students to look around them at all of the heritable variation in morphological traits that we see today in a modern multicultural society. How much of that variation is likely to be adaptive or thought have been adaptive in the recent past (<50,000 years)?

      Not much, if you are being honest. Isn't it likely that much of the morphological variation that became fixed in the past was also non-adaptive?

      That doesn't rule out significant adaptations. I'm just trying to get people to realize that the vast majority of changes at the molecular level are non-adaptive and that may also be true for most morphological changes.

      Delete
    4. The time to fixation is (2/s)ln(2N). For an allele with such a low fixation coefficient the time will be about one million generations or approximately 10 million years.

      Ah, this is where population genetics really confuses me! The 'simplified' time to fixation of a neutral allele is 4Ne generations, ie 400,000 (only 2 gens out from the 'complicated' time). Yet here we have an allele with a positive selection coefficient taking two and a half times as long?

      Again, the neutral chance of fixation is 1/N, so values of 2s below that become effectively neutral - they can't be less likely than a neutral allele to fix? ie, is the minimum value of s that would behave non-neutrally not >0.000005?

      I know it doesn't make a huge difference. And, in the wild, the computationally necessary assumption of constancy for s values over such long time periods is, I suspect, always overturned by more local effects - fluctuations in s itself.

      Delete
    5. Of course, in deriving the minimum non-neutral selection coefficient, I may have forgotten to consider diploidy, ie 1/2N not 1/N ...

      Delete
  4. "These genes are members of gene families and all that's happened is that 689 orthologous genes have either arisen by duplication in the human lineage or been lost by deletion in the chimp lineage."
    Did you mean paralogous? I thought orthologous genes were the same gene, whereas paralogous is where a gene duplicates, and then later on becomes different?

    ReplyDelete
  5. We should expect the overall average number to fit the Neutral Theory, however that's averaged over the genome. Individual loci can have large deviations well into the positive selection range (dn/ds>1).

    ReplyDelete
  6. "the difference between a human and a chimpanzee"

    is not the same as:

    "The number of differences between the human and chimpanzee genomes"

    An ultra-reductionist, DNA-centrist title ;o)

    ReplyDelete
    Replies
    1. So, are you saying that ... gasp ... there's something else besides DNA?!

      Delete
  7. Oh, and according to that same chimp genome paper there are 35 million point mutation differences between chimps and humans. Is that changed too?

    ReplyDelete
    Replies
    1. There are many papers and many estimates. I don't know which one is more likely to be correct, do you?

      Delete
    2. Haven't read Polavarapu et al. yet (and thanks for posting the reference), but the chimp genome paper was based on an actual count over the entire published alignment. It's hard to see how two alignments (if that's what it is) could be different enough to account for a greater than 3 orders of magnitude difference. At the least, this demands some explanation.

      Delete
    3. OK, I'm back, having looked at Polavarapu et al. They do not in fact say there are only 26,500 indels. They limit their search to indels between 80 and 12,000 bp. That is, they ignore all indels under 80 bp. Like many other phenomena, indels display a hollow curve distribution. Indels over 80 bp are very rare compared to shorter indels. It's likely, in fact, that 1bp indels are an absolute majority of indel events.

      So there you are, discrepancy explained: the chimp genome paper was counting all indels, while Polavarapu et al. were counting only the largest indels.

      Delete
  8. It may be a good null hypothesis for the great majority of them, but the ones involved in visible morphological differences, for example, have such a large phenotypic effect that it is hard to believe that those changes are invisible to selection.

    That's how a null hypothesis works: if you can prove, or at the very least strongly demonstrate, that they are due to selection, great. But you start by assuming neutral and work toward demonstrating selection, not starting with selection and going the other way.

    And you always have a null hypothesis, whether or not you state one.

    ReplyDelete
  9. good information ... I have read and will be added to my personal knowledge... thanks

    ReplyDelete
  10. The probability of fixation is about 2s. If s is 0.0000025 then the probability of fixation is about 0.000005 or 0.0005%. In other words, the allele will be lost 99.999% of the time.

    The time to fixation is (2/s)ln(2N). For an allele with such a low fixation coefficient the time will be about one million generations or approximately 10 million years.

    The selection coefficients are going to have to be much higher if we're going to explain a significant numbers of morphological changes as adaptations.


    There are two issues: significant numbers of morphological changes, or significant numbers of loci differing between humans and chimps. You've got me on the latter, but I think you are in a terribly weak position on the former. Most of the differences will be due to the allele substitutions of larger effect, because these both have higher probability of fixation and have more effect on the character. Thus the changes that account for a morphological difference are 4x as likely to be due to a substitution that has 2x larger effect.

    In class, I ask my students to look around them at all of the heritable variation in morphological traits that we see today in a modern multicultural society. How much of that variation is likely to be adaptive or thought have been adaptive in the recent past (<50,000 years)?

    Not much, if you are being honest. Isn't it likely that much of the morphological variation that became fixed in the past was also non-adaptive?


    Well, how much of it would have how much effect on fitness? You seem to know (somehow). Considering a trait that is a quantitative character, and considering how much genetic variance it has and how large the selective effect is of a one-standard-deviation change, you can do some calculations. But it looks like you are relying on impressions.

    That doesn't rule out significant adaptations. I'm just trying to get people to realize that the vast majority of changes at the molecular level are non-adaptive and that may also be true for most morphological changes.

    I agree about the molecules, anyway.

    ReplyDelete
    Replies
    1. Well, how much of it would have how much effect on fitness? You seem to know (somehow).

      I don't know but I want students to think about the possibility that much of it may be non-adaptive. There's a huge misconception out there that all (or almost all) morphological changes have to affect fitness and my goal is to challenge that assumption.

      We usually stick out our tongues and try to roll them. Many students can't roll their tongues so we discuss whether tongue-rolling ability affects fitness. Then we talk about the shape of our lips, our earlobes, English ankles, and male pattern baldness. How many of these heritable morphological traits affect fitness?

      The point is not to dispute the fact that many morphological differences between closely related species are adaptive. The point is to question whether all of it is adaptive and what we should adopt as out null hypothesis when trying to decide.

      In humans, the hair on our heads just keeps growing and growing. That doesn't happen in the other apes. How do we explain this difference? Do we immediately assume that there must be some adaptive significance to having long hair and beards or do we assume that it could be non-adaptive, putting the onus on the adaptationists to prove their case?

      And what about the relative lack of hair on the rest of our bodies? Is that an accident due to neoteny or does it require some elaborate just-so story about sun and savannahs?

      Delete
  11. We usually stick out our tongues and try to roll them. Many students can't roll their tongues so we discuss whether tongue-rolling ability affects fitness. Then we talk about the shape of our lips, our earlobes, English ankles, and male pattern baldness. How many of these heritable morphological traits affect fitness?

    The point is not to dispute the fact that many morphological differences between closely related species are adaptive. The point is to question whether all of it is adaptive and what we should adopt as out null hypothesis when trying to decide.


    Many of these changes can also be side effects of selection for other traits. My ability to roll my tongue is the result of changes in muscles in the tongue, changes that may have noticeable effects on fitness for other reasons than the issue of rolling.

    The question was not whether all morphological differences are neutral -- surely they are not. Previously we were discussing whether the fraction neutral was more like 90% or more like 10%.

    ReplyDelete
    Replies
    1. Joe, don't take this the wrong way but when you say ...
      Many of these changes can also be side effects of selection for other traits. My ability to roll my tongue is the result of changes in muscles in the tongue, changes that may have noticeable effects on fitness for other reasons than the issue of rolling.

      Why do you feel it's necessary to come up with an adaptive explanation of some sort? Why not assume that it's accidental until there's evidence to the contrary?

      Is there any evidence at all that people who can roll their tongues are more (or less) fit than people who can't?

      Delete
  12. Why do you feel it's necessary to come up with an adaptive explanation of some sort? Why not assume that it's accidental until there's evidence to the contrary?

    Great, so the less information we have, the more strongly we can conclude in favor of neutrality ...

    ReplyDelete
    Replies
    1. I said nothing about "concluding" anything. In the absence of any evidence to the contrary I assume, as a working hypothesis, that the trait is neutral. That's all I assume. It's the null hypothesis. I will happily change my mind as soon as I see some evidence of adaptation.

      Adaptationists, on the other hand, assume that the trait is adaptive and then they "assume" some kind of adaptive just-so story to explain their assumption. They reject neutrality out of hand. I don't know why they do that, do you?

      Delete
    2. I think what I would do is to see differences in characters, and guess that differences big enough for me to see are big enough to make some relevant fitness difference to the organism. I would suspect non-neutrality but be suspicious of any particular just-so-story, particularly given the real possibility that these changes are side-effects of changes in other characters.

      For molecular changes I would certainly suspect neutrality, but of course that depends on exactly where they are -- for example, if a change was in the active site of an important enzyme my eyebrows would raise.

      Delete
  13. As I mentioned in the blog post you linked to (regarding Demuth's paper about gene families), the author of that study made some huge mistakes, and they were pointed out by Laurent Duret:

    http://www.plosone.org/annotation/listThread.action?root=8729

    Basically, Demuth's paper should be discarded.

    I have tried to replicate Duret's objection, but unfortunately the format of the gene family identifiers has changed, and there is no mapping between the old format and the new.

    ReplyDelete
  14. Hello,

    This reference might help..

    http://www.nature.com/nature/journal/v429/n6990/abs/nature02564.html

    Cheers,
    Karl

    ReplyDelete