More Recent Comments

Saturday, March 23, 2024

More genomes, more variation

The "All of Us Research Program" is an American effort to sequence one million genomes. The stated goal is to study human genetic variants and link them to genetic diseases. The study is complimentary to similar studies in Great Britain, Iceland, and Japan but the American team hopes to include more diversity in their study by recruiting people from different ethnic backgrounds.

All of Us published the results from almost 250,000 genome sequences in a recent issue of Nature (All of Us Research Program Investigators, 2024). They found one billion variants of which 275 million had not been seen before.

Recall that the UK study (UK Biobank) emphasized the importance of variation in determining whether a given region of DNA was functional or not. They noted that regions that were constrained (i.e. fewer variants) were likely under purifying selection whereas regions that accumulated variants were likely junk [Identifying functional DNA (and junk) by purifying selection]. Their results indicated that only about 10% of the genome was constrained and that's consistent with the view that 90% of our genome is junk. The American study did not address this issue so we don't know how it related to the junk DNA controversy.

Note that if 90% of our genome is junk then that represents 2.8 billion base pairs and the potential for more than 8 billion variants in the human population.1 Some of these will be quite frequent in different groups just by chance but most of them will be quite rare. We'll have to wait and see how this all pans out when more genomes are sequenced. The idea of increasing the detection of unusual variants by sequencing more diverse populations is a good one but the real key is just more genome sequences.

One of the things you can do with this data is to cluster the variants according to the self-identified ethnic group of the participants and All of Us didn't hesitate to do this. They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect. Given the sensitive nature of this fact, you would also expect a lot of criticism on the internet and that's what happened.


1. I'm defining a "variant" as a difference from the reference genome sequence. I'm aware of the terminology issue but it's not important here. There will also be a large number of variants in the functional regions.

All of Us Research Program Investigators (2024) Genomic data in the All of Us Research Program. Nature 627:340. [doi: 10.1038/s41586-023-06957-x].

8 comments :

John Harshman said...

They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect.

I bet the clear differences arise because the clines separating these regions are poorly sampled. African Americans, for example, are almost entirely descended from coastal West Africans, which is seriously undersampling African genomes. Bet there's not much from Cantral Asia either. And so on. I suppose you can call those isolated samples "races" if you like, but I'm not seeing it.

Mikkel Rumraket Rasmussen said...

You have to wonder how people are able to self-identify remarkably accurately without being able to read their own genomes for something that doesn't exist. I mean what is it we are saying that doesn't exist?

If I say I believe I am "white" or "european", am I not simply saying that I am, by appearance, more similar to people who have been living in a certain geographical area by a substantial amount of time, and that this is true genetically too? Isn't this just what people mean when they accurately(in a way also reflected by genetic similarity or relatedness) estimate their own ethnicity or "race"?

Zach Hancock said...

"They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect."

Do humans have genetic variation that is geographically correlated? Yes. Are there clear genetic differences between 'groups' such as socially defined 'races'? No.

The All of Us study has been roundly criticized (e.g., https://liorpachter.wordpress.com/2024/02/26/all-of-us-failed/) for their portrayal of human race. The key issue is the usage of UMAP, which is a clustering algorithm that exacerbates unshared variance for aid in visualization. In doing so, it renders the distance between groups as meaningless (see their own documentation: https://pair-code.github.io/understanding-umap/). Furthermore, as is the case in all STRUCTURE plots, when you cluster by pre-designated k-values instead of inferring them (or even when you do if you assume there is no clinal pattern of relatedness), you will, by necessity, find discrete patterns. You forced the data to be so. It should be noted that the creator of STRUCTURE, Jonathan Pritchard, has also spoken out against the All of Us portrayal of human variation (https://twitter.com/jkpritch/status/1759769445759893832).

If you plot this *exact* same data on a PCA instead of using UMAP, the pattern is a characteristic horseshoe without any discrete breaks at all. At no point could you say "ah look, clear genetic breaks!". There simply are none, and the UMAP method promotes a false view of human genetic variation by arbitrarily exacerbating differences.

Further reading:
https://www.nature.com/articles/d41586-024-00568-w

Larry Moran said...

@Zach Hancock: I understand the criticism and the fact that UMAP was not the best way to illustrate the genetic differences between the various human populations.

But let's not let that confuse the real issue. You say, "Are there clear genetic differences between 'groups' such as socially defined 'races'? No."

That's only true if you are quibbling about "clear genetic differences" or "socially defined races." If you compare the allele frequencies in a group of people from Japan and a group of people from Nigeria, you will have no trouble identifying a cluster of genetic markers that can be used to distinguish those two groups with incredible accuracy.

John Harshman said...

If you compare the allele frequencies in a group of people from Japan and a group of people from Nigeria, you will have no trouble identifying a cluster of genetic markers that can be used to distinguish those two groups with incredible accuracy.

Sure, if you choose two geographically distant populations. Nobody disagrees that there's geographically structured variation. The question is whether there are discrete "races". Try finding the races in a continuous transect rather than just two endpoints.

Larry Moran said...

@John Harshman: What is your stance on the existence of genetically distinct populations within the species Homo sapiens? Are you saying that they don't exist because you can always find find some examples of intermediates that share the genetic characteristics of more than one population?

Or are you placing the emphasis on "discrete" in order to deny that there's any one allele that is strictly confined to one of the traditional races and never found in the other?

Zach Hancock said...

@Larry Moran: Of course there are geographically correlated allele frequencies, as I stated. This is a simple fact of limited dispersal and exists in virtually all organisms. But that does not constitute a "race" in a biological sense. Taxonomically, a "race" is a discrete unit in which all members are more related within than between. Here, "discrete" means there are genetic breaks between them that are identifiable via PCA or some similar metric.

Humans do not have any such breaks. In your example, if you sampled humans continuously from Nigeria to Japan at no point would you be able to draw a line and say "now we've gone from 1 race to a different one". Human ancestry is continuous with respect to geography, as has been shown many times (e.g., Ramachandran et al. 2005). Again, identifying geographic variants is not the same as correlating them to social races, because every locality on earth has unique variation (as does every family!). So unless we want to define each village as constituting a "race", we need to do better than geographic correlation.

There's no way to objectively define or disentangle "race" from "intermediate" in a true continuum. Anywhere you start iterating ancestry with distance, you'd find the same linear pattern. You are either forced to state there are no biological races (as most geneticists long ago conceded) or designate so many races that the concept, at least taxonomically, loses all meaning and ceases to reflect anything we mean socially.

John Harshman said...

Are you saying that they don't exist because you can always find find some examples of intermediates that share the genetic characteristics of more than one population?

I can't do better than Zach's answer for that one.

Or are you placing the emphasis on "discrete" in order to deny that there's any one allele that is strictly confined to one of the traditional races and never found in the other?

As far as I know there are no private alleles in any human population. Certainly the traditional races are genetically meaningless.