Saturday, March 23, 2024

More genomes, more variation

The "All of Us Research Program" is an American effort to sequence one million genomes. The stated goal is to study human genetic variants and link them to genetic diseases. The study is complimentary to similar studies in Great Britain, Iceland, and Japan but the American team hopes to include more diversity in their study by recruiting people from different ethnic backgrounds.

All of Us published the results from almost 250,000 genome sequences in a recent issue of Nature (All of Us Research Program Investigators, 2024). They found one billion variants of which 275 million had not been seen before.

Recall that the UK study (UK Biobank) emphasized the importance of variation in determining whether a given region of DNA was functional or not. They noted that regions that were constrained (i.e. fewer variants) were likely under purifying selection whereas regions that accumulated variants were likely junk [Identifying functional DNA (and junk) by purifying selection]. Their results indicated that only about 10% of the genome was constrained and that's consistent with the view that 90% of our genome is junk. The American study did not address this issue so we don't know how it related to the junk DNA controversy.

Note that if 90% of our genome is junk then that represents 2.8 billion base pairs and the potential for more than 8 billion variants in the human population.1 Some of these will be quite frequent in different groups just by chance but most of them will be quite rare. We'll have to wait and see how this all pans out when more genomes are sequenced. The idea of increasing the detection of unusual variants by sequencing more diverse populations is a good one but the real key is just more genome sequences.

One of the things you can do with this data is to cluster the variants according to the self-identified ethnic group of the participants and All of Us didn't hesitate to do this. They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect. Given the sensitive nature of this fact, you would also expect a lot of criticism on the internet and that's what happened.


1. I'm defining a "variant" as a difference from the reference genome sequence. I'm aware of the terminology issue but it's not important here. There will also be a large number of variants in the functional regions.

All of Us Research Program Investigators (2024) Genomic data in the All of Us Research Program. Nature 627:340. [doi: 10.1038/s41586-023-06957-x].

10 comments:

  1. They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect.

    I bet the clear differences arise because the clines separating these regions are poorly sampled. African Americans, for example, are almost entirely descended from coastal West Africans, which is seriously undersampling African genomes. Bet there's not much from Cantral Asia either. And so on. I suppose you can call those isolated samples "races" if you like, but I'm not seeing it.

    ReplyDelete
  2. You have to wonder how people are able to self-identify remarkably accurately without being able to read their own genomes for something that doesn't exist. I mean what is it we are saying that doesn't exist?

    If I say I believe I am "white" or "european", am I not simply saying that I am, by appearance, more similar to people who have been living in a certain geographical area by a substantial amount of time, and that this is true genetically too? Isn't this just what people mean when they accurately(in a way also reflected by genetic similarity or relatedness) estimate their own ethnicity or "race"?

    ReplyDelete
  3. "They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect."

    Do humans have genetic variation that is geographically correlated? Yes. Are there clear genetic differences between 'groups' such as socially defined 'races'? No.

    The All of Us study has been roundly criticized (e.g., https://liorpachter.wordpress.com/2024/02/26/all-of-us-failed/) for their portrayal of human race. The key issue is the usage of UMAP, which is a clustering algorithm that exacerbates unshared variance for aid in visualization. In doing so, it renders the distance between groups as meaningless (see their own documentation: https://pair-code.github.io/understanding-umap/). Furthermore, as is the case in all STRUCTURE plots, when you cluster by pre-designated k-values instead of inferring them (or even when you do if you assume there is no clinal pattern of relatedness), you will, by necessity, find discrete patterns. You forced the data to be so. It should be noted that the creator of STRUCTURE, Jonathan Pritchard, has also spoken out against the All of Us portrayal of human variation (https://twitter.com/jkpritch/status/1759769445759893832).

    If you plot this *exact* same data on a PCA instead of using UMAP, the pattern is a characteristic horseshoe without any discrete breaks at all. At no point could you say "ah look, clear genetic breaks!". There simply are none, and the UMAP method promotes a false view of human genetic variation by arbitrarily exacerbating differences.

    Further reading:
    https://www.nature.com/articles/d41586-024-00568-w

    ReplyDelete
  4. @Zach Hancock: I understand the criticism and the fact that UMAP was not the best way to illustrate the genetic differences between the various human populations.

    But let's not let that confuse the real issue. You say, "Are there clear genetic differences between 'groups' such as socially defined 'races'? No."

    That's only true if you are quibbling about "clear genetic differences" or "socially defined races." If you compare the allele frequencies in a group of people from Japan and a group of people from Nigeria, you will have no trouble identifying a cluster of genetic markers that can be used to distinguish those two groups with incredible accuracy.

    ReplyDelete
  5. If you compare the allele frequencies in a group of people from Japan and a group of people from Nigeria, you will have no trouble identifying a cluster of genetic markers that can be used to distinguish those two groups with incredible accuracy.

    Sure, if you choose two geographically distant populations. Nobody disagrees that there's geographically structured variation. The question is whether there are discrete "races". Try finding the races in a continuous transect rather than just two endpoints.

    ReplyDelete
  6. @John Harshman: What is your stance on the existence of genetically distinct populations within the species Homo sapiens? Are you saying that they don't exist because you can always find find some examples of intermediates that share the genetic characteristics of more than one population?

    Or are you placing the emphasis on "discrete" in order to deny that there's any one allele that is strictly confined to one of the traditional races and never found in the other?

    ReplyDelete
  7. @Larry Moran: Of course there are geographically correlated allele frequencies, as I stated. This is a simple fact of limited dispersal and exists in virtually all organisms. But that does not constitute a "race" in a biological sense. Taxonomically, a "race" is a discrete unit in which all members are more related within than between. Here, "discrete" means there are genetic breaks between them that are identifiable via PCA or some similar metric.

    Humans do not have any such breaks. In your example, if you sampled humans continuously from Nigeria to Japan at no point would you be able to draw a line and say "now we've gone from 1 race to a different one". Human ancestry is continuous with respect to geography, as has been shown many times (e.g., Ramachandran et al. 2005). Again, identifying geographic variants is not the same as correlating them to social races, because every locality on earth has unique variation (as does every family!). So unless we want to define each village as constituting a "race", we need to do better than geographic correlation.

    There's no way to objectively define or disentangle "race" from "intermediate" in a true continuum. Anywhere you start iterating ancestry with distance, you'd find the same linear pattern. You are either forced to state there are no biological races (as most geneticists long ago conceded) or designate so many races that the concept, at least taxonomically, loses all meaning and ceases to reflect anything we mean socially.

    ReplyDelete
  8. Are you saying that they don't exist because you can always find find some examples of intermediates that share the genetic characteristics of more than one population?

    I can't do better than Zach's answer for that one.

    Or are you placing the emphasis on "discrete" in order to deny that there's any one allele that is strictly confined to one of the traditional races and never found in the other?

    As far as I know there are no private alleles in any human population. Certainly the traditional races are genetically meaningless.

    ReplyDelete
  9. I mostly agree with @John Harshman and @Zach Hancock here, but there I another thing that must be pointed out.

    I see that the ‘race’ category is denied on the grounds that that there are no distinct genetic groups (the population structure is clinal). However, this rather misses the mark. It assumes that ‘race’ is objectively defined in terms of genetics (or anything), such that it can be refuted on its own terms. It’s not. The very concept itself was invented to serve a particular socio-political purpose, and I really don’t need to tell what that is. Anyone should know enough history to understand this.

    What is ‘race’ really? It means whatever it needs to mean to serve that purpose. It’s a moving goal post by design. That’s also why we got absurdities like the “one-drop-rule” or why people like Barack Obama can call themselves ‘black’, but he never can call himself ‘white’ without raising an eyebrow. This makes no sense when you think about it in terms of genetics, but pointing that out doesn’t matter. It’s not about genetics. Never has been. Instead, genetics was appropriated into ‘race’ only when the it (and the idea that it constitutes the ‘biological essence’) became common in the public sphere. This was meant to give ‘race’ a veneer of objectivity. This was extremely successful. To this day, it keeps on confusing a lot of people (even those who are well meaning) such that many get baffled whenever they hear someone point out that ‘race’ is a social construct instead of a biological fact.

    And when pushed one this, many (again, even those who are well meaning) will probably say something like.. “Well, that’s all good and all, but that’s not how *I* use the word ‘race’. When *I* say ‘race’, I mean it in a way that is consistent with what we know about genetics” ...and they say that without even realising that they are making my point for me. Race means whatever it needs to mean. No more, nor less.



    ."When I use a word,' Humpty Dumpty said in rather a scornful tone, 'it means just what I choose it to mean — neither more nor less."
    - Lewis Carroll

    “Race, is the child of racism, not the father.”
    - Ta-Nehisi Coates

    ReplyDelete
  10. Then again, outside of Homo sapiens, "race" does have a biological meaning, roughly the same as "subspecies". Most often is requires some degree of geographic isolation or at least a narrow hybrid zone. If we try to apply this meanng to humans, we are unable to find any races or subspecies. But it works OK in a lot of bird species.

    ReplyDelete