Sunday, October 04, 2015

Genetic variation in human populations

The Human Genome Project produced a high quality reference genome that serves as a standard to measure genetic variation. Every new human genome that's sequenced can be compared with the reference genome to detect differences due to mutation. It's possible to build large databases of genetic variation by sequencing genomes from different populations. Genetic variation can be used to infer evolutionary history and to test theories of population genetics. Detailed maps of genetic variation can also be used to infer selection (genetic sweeps) and distinguish it from random genetic drift.

In addition to this basic science, the analysis of multiple human genomes can be used to map genetic disease loci through association of various haplotypes with disease. The technique is called genome wide association studies (GWAS). The same technology can be used to map other phenotypes to identify the genes responsible.

The 1000 Genomes Project Consortium has just published their latest efforts in a recent issue of Nature (Oct. 1, 2015) (The 1000 Genomes Project Consortium, 2015; Studmant et al., 2015). They looked at the genomes of 2,504 individuals from 26 different populations in Africa, East Asia, South Asia, Europe, and the Americas.


The idea is to identify variants that are segregating in humans. Single nucleotide polymorphisms (SNPS) are difficult to identify because the error rate of sequencing is significant. When comparing a new genome sequence to the reference genome you don't know whether a single base change is due to sequencing error or a genuine variant unless you have a high quality sequence. Most of the 2,504 genome sequences are not of sufficiently high quality to be certain that the false positive rate is low but by sequencing multiple genomes it becomes feasible to identify variants that are shared by more that one individual within a population.

Recall that every human genome has about 100 new mutations so that even brothers and sisters will differ at 200 sites. The 1000 Genomes Consortium looks at the frequency of alleles in a population to determine whether the genetic variation is significant. They use a preliminary cutoff of 0.5%, which means that a variant (mutation) has to be present in 5 out of 1000 genomes in order to count as a variant that's segregating within the population. They estimate that 95% of SNPs meeting this threshold are true variants. For small insertions and deletions the accuracy is about 80%.

For variants at lower frequency, additional sequencing to a depth of >30X coverage was done and the putative variant was compared against other databases of genetic variation. The predicted accuracy of variants at 0.1% frequency is about 75%.

Given those limitations, the results of the studies are very informative. Looking at single base pair changes and small indels (insertions and deletions), the typical human genome (yours and mine) differs from the standard reference genome at about 4.5 million sites. That's about 0.14% of our genomes. Humans and chimpanzees differ by about 1.4% or ten times more.

SNPs and small indels account for 99.9% of variants. The others are "structural variants" consisting of; large deletions, copy number variants, Alu insertions, LINE L1 insertions, other transposon insertions, mitochondrial DNA insertions (NUMTS), and inversions. The typical human genome has about 2,300 of these structural variants of which about 1000 are large deletions.

Most of these variants are in junk DNA regions but the typical human genome carries about 10-12,000 variants that affect the sequence of a protein. Many of these will be neutral and some of the ones that have a detrimental effects will be heterozygous and recessive. The average person has 24-30 variants that are associated with genetic disease. (These are known detrimental alleles. If you get your genome sequenced, you will learn that you carry about 30 harmful alleles that you can pass on to your children.)

The Consortium reports that the the typical genome has variants at about 500,000 sites mapping to untranslated regions of mRNA (UTRs), insulators, enhancers, and transcription factor binding sites. I assume they are using the ENCODE data here so we need to take it with a large grain of salt. Most of these sites are not biologically relevant.

As expected, common variants are distributed in populations all over the world. These are the result of mutations that arose several hundred thousand years ago and reached significant frequencies before the present-day populations separated. However, 86% of all variants are restricted to a single continental group. These are the result of mutations that occurred after the present-day populations split.

The African populations contain more genetic variation than the Asian and European populations. Again, this is is expected since the European and Asian groups split from within the African group after Africans had been evolving on that continent for thousands of years. The differences are not great—Africans differ at about 4.3 million SNPs while the typical Europeans and Asian differ at only 3.5 million SNPs.

Only a small number of loci show evidence of selective sweeps, or recent selection (adaptation). It indicates that most of the differences between local ethnic groups are not associated with adaptation. The exceptions are SLC24A5 (skin pigmentation), HERC2 (eye color), LCT (lactose tolerance), and FADS (fat metabolism).


Sudmant, P.H., Rausch, T., Gardner, E.J., Handsaker, R.E., Abyzov, A., Huddleston, J., Zhang, Y., Ye, K., Jun, G., Hsi-Yang Fritz, M., Konkel, M.K., Malhotra, A., Stutz, A.M., Shi, X., Paolo Casale, F., Chen, J., Hormozdiari, F., Dayama, G., Chen, K., Malig, M., Chaisson, M.J. P., Walter, K., Meiers, S., Kashin, S., Garrison, E., Auton, A., Lam, H.Y.K., Jasmine Mu, X., Alkan, C., Antaki, D., Bae, T., Cerveira, E., Chines, P., Chong, Z., Clarke, L., Dal, E., Ding, L., Emery, S., Fan, X., Gujral, M., Kahveci, F., Kidd, J.M., Kong, Y., Lameijer, E.-W., McCarthy, S., Flicek, P., Gibbs, R.A., Marth, G., Mason, C.E., Menelaou, A., Muzny, D.M., Nelson, B.J., Noor, A., Parrish, N.F., Pendleton, M., Quitadamo, A., Raeder, B., Schadt, E.E., Romanovitch, M., Schlattl, A., Sebra, R., Shabalin, A.A., Untergasser, A., Walker, J.A., Wang, M., Yu, F., Zhang, C., Zhang, J., Zheng-Bradley, X., Zhou, W., Zichner, T., Sebat, J., Batzer, M.A., McCarroll, S.A., The Genomes Project, C., Mills, R.E., Gerstein, M.B., Bashir, A., Stegle, O., Devine, S.E., Lee, C., Eichler, E.E., and Korbel, J.O. (2015) An integrated map of structural variation in 2,504 human genomes. Nature, 526(7571), 75-81. [doi: 10.1038/nature15394]

The Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526(7571), 68-74. [doi: 10.1038/nature15393]

50 comments :

  1. Replies
    1. With 10-12000 variations in the protein coding region from standard (if I have this right) being on average 10% of expected value based on the predicted 1 to 2% of the genome being protein coding regions. (6000 av variations in CR divided by 4.5M in NCR) Do we think this is because of error detection and correction in protein coding regions?

      Delete
    2. Mutations in coding regions have an increased risk of being detrimental and so these are more likely to be selected out.

      Delete
    3. Hi Aceofspades: Attached is a paper that involves an experiment that exposes different grasses (corn wheat rice etc) transposable elements to their genomes. The percentage that land in the protein coding region is consistent with the percentages above. This experiment is w/o selection.

      Comparison of class 2 transposable elements at superfamily resolution reveals conserved and distinct features in cereal grass genomes. doi: 10.1186/1471-2164-14-71.

      Delete
    4. I think protein coding regions are more susceptible to receiving TEs because it's easier for TEs to get into euchromatin. I don't think (though could be wrong) that mutation itself is distributed asymmetrically wrt coding/non-coding. Hard to see a mechanism, at least, since exon boundaries are only apparent during transcription/translation, not during replication/recombination where most mutation takes place.

      Delete
    5. Hi Allan:
      The data in this experiment was quite clear that the introns were receiving a disproportionate number of mites(TE's). Also the data showed that expression was increased when mites were installed in introns. I am thinking possibly error correction, especially if you think protein coding regions are more susceptible but not sure.

      Delete
    6. Not sure what mechanism of error correction can remove TEs from exons but leave them in introns. How is selection not involved here?

      Delete
    7. I am not sure either. No selection because experiment is on a single generation of grass.

      Delete
    8. The experiment shows insertions that are already in the genome and presumably have been there for many generations. Consider an insertion even 1000 generations ago. If it's been selected against, it probably didn't survive to be observed today. There's your selection: not in the present, but in the past. And that's what makes transposons more common in introns than in exons, not some unknown error correction mechanism.

      Delete
    9. Whoa. I looked at the abstract. That isn't even an experiment. It's a comparison of the genomes of four different grass species. The abstract isn't about shared TEs but about common features of TE distribution, and it never mentions exons at all. At any rate, the TEs in those four species would almost all have been exposed to selection (if there were any) for a very long time, so the distribution reflects selection as well as mutation.

      Delete
    10. https://youtu.be/IOXvZXtc93U
      here is the experiment described on youtube. Look starting 45 min in and let me know what you think...thanks

      Delete
    11. Sorry, don't look at random youtube videos. Do you have a real reference to this experiment?

      Delete
    12. Hi John the attached paper has data that is closer what caught my eye. I will contact Susan to see if I can get more detail. The charts I found interesting are just after the abstract.
      Naito K, Zhang F, Tsukiyama T, Saito H, Hancock CN, Richardson AO, Okumoto Y, Tanisaka T, Wessler SR. (2009). Unexpected consequences of a sudden and massive transposon amplification on rice gene expression. Nature. 461(7267):1130-4.

      Delete
    13. Interesting, and weird. They are however talking about biased insertion, not biased repair.

      Delete
    14. Even if it were a single-generational study chucking TEs randomly at a genome, selection would be expected to bias the distribution away from exons and towards introns, and to localise within introns.

      The paper discusses distributions within introns, which certainly looks like selection, since the nearer the splice site the more disruptive.

      Delete
    15. Hi Allan and John
      If you look at the 2009 paper the data that caught my eye was the insertion bias in the 5' and 3'( as Allan mentions) regions as well as introns and exons. I have written Susan and asked more about her methods to understand what she means by expected value. Allan: I am trying to figure out why the old insertions (blue pink) and new insertions show the same tendency.

      Delete
    16. Hi Allan John: Here is Susan's response. I am looking through the experiments to see if this is consistent w the data. The conclusion is from multiple papers on her website.

      With regard to the low % of exon insertions for the mPing element - this is due to the sequence preference of mPing - that is, it prefers to insert into AT rich DNA and rice exons are GC rich. As mentioned in the talk - mPing is a "successful" TE - meaning that it can attain high copy numbers because it causes little harm. One of the reasons it causes little harm is that it avoids insertion into rice exons. In contrast, mPing transposes more frequently into eons in transgenic soybean. The reason for this is that soybean exons are less GC rich than rice exons. So….if mPing were a TE in the soybean genome it is unlikely to be as successful as mPing is in rice. Does that make sense?

      Delete
    17. It does make sense. And it refutes your claim about repair mechanisms. So, thanks for bringing up an interesting phenomenon, but it shows natural selection acting on TE insertion preferences, as a parasite that doesn't kill its host becomes more successful. Was there some other point you wanted to make about this?

      Delete
    18. Hi John...this quote from the 2009 paper is a loose end that may be solved in later papers...if I cannot find an answer I will contact Susan again.

      ,55%)17. However, mPing does not avoid the (G1C)-rich 59 un- translated region and is enriched just upstream of the transcription start site. An understanding of the mechanisms underlying these preferences is beyond the scope of this study as they may be influenced by other factors such as chromatin structure18,19, which, so far, has not been thoroughly characterized in rice.

      Delete
    19. Yes, it does look like selection against insertion into exons - the less disruptive TEs remain, 'selected' for that quality. It is, though, a mechanism of exon 'detection' I hadn't considered.

      Delete
    20. John Allan: this caught my eye from Jan 2011 Wessler paper...Taxonomic Distribution of Superfamilies. The mapping of the su- perfamily presence or absence along the eukaryotic tree of life (32, 33) revealed that 15 of the 17 superfamilies exist in at least two of the "ve eukaryotic supergroups surveyed here (Fig. 3 and Table S2). Because there is little evidence for the horizontal transfer of TEs between eukaryotic supergroups, this distribution strongly supports the view that the origin of most, if not all, superfamilies predates the divergence of eukaryotic supergroups (34)

      Delete
    21. If you quote, try to give complete citations. And also try to explain why you're quoting. Are you just giving us stuff you find interesting, so far regarding TEs, or are you trying to make some point?

      Delete
    22. Just sharing interesting stuff. I agree error correction is unlikely at this point because the TE's appear to migrate to the proper location. Here is the link:http://wesslerlab.ucr.edu/pdf/yuan_pnas.pdf

      Delete
  2. Recall that every human genome has about 100 new mutations so that even brothers and sisters will differ at 200 sites.

    I'm pretty sure you meant *twin* brothers and sisters.

    ReplyDelete
    Replies
    1. I don't think that's right either since the majority of those 100 mutations happen in the male germ cells.

      Naturally identical twins would share these mutations.

      Delete
    2. Identical twins would have the same mutations but each individual child would have it's own set of mutations because the eggs and sperm are each the product of many independent replications. They might share a few mutations that occurred before segregation of the germ line cells and in some cases the two egg cells or the two sperm cells might have a very recent ancestor in the germ line.

      However, as a general rule, each brother and sister will have about 100 new mutations and they would differ at about 200 sites.

      Delete
    3. For some reason my brain filtered out the bit about brothers and sisters and I assumed Diogenes was talking about identical twins.

      Delete
    4. But Larry, you forget to count the genetic variation due to meiosis (that is, a random sampling of maternal and paternal alleles), which it seems to me is going to completely swamp those new mutations. The brother and sister will differ at 200 sites from new mutations, but lots more from simple inheritance.

      Delete
    5. @John Harshman,

      Good point. You are correct. I should have added that they will differ at many other sites by differential inheritance of preexisting variation as you describe.

      Delete
  3. the typical human genome (yours and mine) differs from the standard reference genome at about 4.5 million sites. That's about 0.14% of our genomes. Humans and chimpanzees differ by about 1.4% or ten times more.

    So the differences between humans and chimps are 10x larger than differences of the average person to a referent genome. If Young Earth creationism were true, and all variations within humans evolved since Adam and Eve 6,000 years ago, then it would take at most 10x longer for chimps and humans to diverge from a common ancestor. So if YEC were true, it would take just 60,000 years for human and chimp to diverge from a common ancestor.

    Piece of cake.

    ReplyDelete
    Replies
    1. Ah but even though we humans differ among ourselves by 10% of the human-chimp difference, we are all still humans, aren't we? This PROVES that genes don't matter, as comrade Wells has been saying all along, therefore GOD!

      Delete
    2. Diogenes, you forgot that the genomes of one pair of chimps changed very rapidly about 4500 years ago as they walked through Syria, Israel, Egypt, Sudan, etc to get to the Congo.

      Delete
    3. Are they talking about the same part of the genome? That is, the 1.4%% refers to 'protein coding' regions, and the 0.14% refers to the whole thing? I read somewhere that the ape genome is 10% larger than the human, and most comparisons just look at the subset of DNA that is 'protein coding'. I could be wrong, just wondering.

      Delete
  4. As a yEC I can address one point.
    If humans change because of different mechanism(s) then the genetic fingerprints would show this also.
    So its not the only option that african genes are most different because of time.
    it would also be that way is , upon migration to africa from somewhere else, they changed the most. They had the most attributes change relative to a original common human single tribe. simply that. The rest of mankind changed but less so due to the areas influence.
    all that is seen is a genetic score. The reason for it is not seen in the score.
    There are creationist options to explain these things.

    ReplyDelete
    Replies
    1. byers said: "There are creationist options to explain these things."

      Every creationist of every flavor throughout the history of humans has conjured up a version of impossible fairy tale "creationist options" or believed in a version that someone else conjured up. Your version of "creationist options" is just one of billions, and there's no evidence that supports yours or any other.



      Delete
    2. "it would also be that way is , upon migration to africa from somewhere else, they changed the most. "

      So how come they didn't become marsupials?

      Delete
    3. So how come they didn't become marsupials?

      Oh, silly you!

      They didn't become marsupials because they went in the wrong direction.

      Delete
    4. The Whole Truth
      I wish it was billions but not my crowd.
      The only evidence or data anyone works with is a genetic score. A summery of genes place right now. Then its speculated/investigated/concluded this is how and why there are genetic differences in different degrees of comparison.
      Thats all.
      So some say Man came from Africa and time is the origin for more gene differences in Africa and in comparison to the rest of mankind.
      yet the obvious thing is how different Africans look. All men look different YET there is, I say, more of attributes in looks amongst Africans compared to mankind.
      then with the idea of a common single human group and then having the Africans actually migrate into Africa and be influenced by the area.
      This influence by mechanism(s), producing more genetic change quickly and so more gene difference.
      another point would also be that it was not a single group that moved to africa but many groups of people with already different languages. This also would make more gene diversity.
      The rest of us also were influenced into genetic change but it was less as the areas we migrated to didn't have such needs for adaptation.
      i think this makes more sense, debunks this gene uniformity concept, and is including genesis boundaries.
      Anyways its all a genetic result we look at. Then figuring it out.
      Other option(s).

      Delete
  5. Speaking of mutations:

    Single neuron may carry over 1,000 mutations

    http://www.sciencedaily.com/releases/2015/10/151001153931.htm

    ReplyDelete
  6. I've been trying to read your blog and keep my mouth shut but it's hard. Professor Moran what is the difference in the Discovey Institute and the Sandwalk blog? Neither of you do any science you can claim as your own. You both take the work of others and twist it to fit your take on the discussion. You have about 50 consistent followers, as do they, all the while the real scientific community doesn't care. It's a miracle from God himself that glass houses could withstand this punishment. You all are smart men, much smarter than me. Why are you wasting your gifts? This is crazy, you talk about God's people more than door to door Mormons. So weird.

    ReplyDelete
    Replies
    1. What the hell in this recent of Larry's posts motivated this idiotic response?

      "I've been trying to read your blog and keep my mouth shut but it's hard. Professor Moran what is the difference in the Discovey Institute and the Sandwalk blog?"

      Most of the people here are employed in real laboratories or scientific institutions that do their own research. Most of the people at the discovery institute are lawyers, politicians and theologians.

      "Neither of you do any science you can claim as your own."

      I beg to differ. Many people here do in fact do science and research but it is of course often collaborations between several people, so can't really be claimed as belonging to any one single person.

      "You both take the work of others and twist it to fit your take on the discussion."

      Could you give an example from this recentmost post of Larry's?

      "You have about 50 consistent followers, as do they, all the while the real scientific community doesn't care."

      Many of the recurring commenters here are PART OF the "real scientific community".

      "It's a miracle from God himself that glass houses could withstand this punishment."

      Praise The Lord!

      "You all are smart men, much smarter than me. Why are you wasting your gifts?"

      Are you sure you aren't so un-smart you are not even in a position to judge whether all these much smarter people are wasting their gifts?

      "This is crazy, you talk about God's people more than door to door Mormons..."

      ... and constantly demonstrate where they are wrong, and this bothers you so you come here to complain about it.

      Delete
    2. Professor Moran what is the difference in the Discovey Institute and the Sandwalk blog?

      The Discovery Institute is in Seattle and it has lots of employees and millions of dollars in funding from rich creationist businessmen.

      I live in Toronto and I work on Sandwalk all by myself. Believe it or not, I don't have millions of dollars to spend on writing about science.

      Given this huge discrepancy in resources, you should expect that the Discovery Institute will publish much more accurate and up-to-date information about evolution.

      Delete
    3. Beau,

      One of the huge differences is that the IDiots have only one thing in their mind: God-did-it! God-did-it! They're delusional buffoons (and/or make a living out of lying about science).

      Then, Larry is a professor who makes a living out of understanding science. Larry writes a textbook on biochemistry for actual students in science. So Larry carries a huge responsibility to present the science as it is, while the IDiots' responsibility is to present twisted representations of science (and other kinds of religious propaganda) that have to please their religious beliefs and those of their followers and donors.

      Delete
  7. Robert said The rest of mankind changed but less so due to the areas influence.
    all that is seen is a genetic score. The reason for it is not seen in the score.
    There are creationist options to explain these things.


    Let's see you explain those things. Although that will only be your personal "lines of reasoning" - as always far removed from reality.

    ReplyDelete
  8. Did they find any parts of the DNA that fixated differently in one group? That is, I have seen ancestry deduced mainly by non-degenerate correlations, not just because it's convenient, but because there is no precedent for a subset of DNA that would prove you are a pygmy, or Khoisan, or Australian aborigine (all mere examples), with 100% certainty? I would think there might be some subset (though, they aren't big among 23andMe users).

    ReplyDelete
  9. Hi Larry, I see you highlight the error rate of the sequencing methods and how the rror rate can affect the validity of the results. I have a couple of questions; first, where did you find that “They estimate that 95% of SNPs meeting this threshold are true variants. For small insertions and deletions the accuracy is about 80%”? I have been looking through the paper but can’t find those numbers …

    Since the sequencing was done at an average 7.4x depth (meaning they read 7.4 times every single nucleotide on average for every single person) and 65.7× depth for the exome (meaning they read on average 65.7 times the same nucleotide that encodes for a protein of every single person), how important do you think the error rates are? Do you think their algorithms call for variants even when the data cannot confidently call it a variant?

    Also, do you agree with the definition that a SNP is just a single nucleotide variant (SNV) with a frequency higher than 1%? Other people use the 0.5% threshold for calling a SNV a SNP, and others don’t even take into account the frequency of that variant to call it SNP.
    Thanks for your interesting post!
    Fernando.

    ReplyDelete
  10. Do you think one of the take home messages of the articles is that there should not be only one reference genome?
    I think many of the figures (specially this one http://www.nature.com/nature/journal/v526/n7571/fig_tab/nature15394_SF5.html ) suggest that more info about your genome can be inferred if we used a "regional" or "local" reference genome ...

    ReplyDelete
  11. In the line of error rates, how many of the called "single cell mutations" do you think were due to error in this paper?
    http://www.sciencemagazinedigital.org/sciencemagazine/2_october_2015?sub_id=A6HNdXNVpcYV&folio=94&pg=110#pg110

    ReplyDelete
  12. With 100 mutations per genome, the copying fidelity is 1- 1/30.000.000 = 1-0.00000003 = 0.99999997

    If copyig fidelity had to evolve gradually wat would happen with organisms having a CF of say 0.90 or 0.99?

    They would die because of genomic meltdown within <10 generations, resp. <25 generations.

    Genomes had to start with nearly perfect CF.

    There is no way out. Genomes were frontloaded.

    Frontloading is the new theory.

    ReplyDelete
  13. Well, I think we are talking about different things.
    To start with, it is not true that copying fidelity has to be "nearly perfect" for survival. If fact many viruses such us HIV or HCV base their survival in a high mutation rate that make the immune system fail because of new viral quasispecies.
    In any case, I was talking about the error rate of the sequencing method (which was address in this post as a way of saying the results should be taken with care). At least that was my understanding of this post, and the reason why I asked.
    In the Science paper I linked the error rate has to be way higher because they had to amplify single cell genomes.

    ReplyDelete