More Recent Comments

Friday, March 29, 2019

Are multiple transcription start sites functional or mistakes?

If you look in the various databases you'll see that most human genes have multiple transcription start sites. The evidence for the existence of these variants is solid—they exist—but it's not clear whether the minor start sites are truly functional or whether they are just due to mistakes in transcription initiation. They are included in the databases because annotators are unable to distinguish between these possibilities.

Let's look at the entry for the human triosephosphate isomerase gene (TPI1; Gene ID 7167).


The correct mRNA is NM_0003655, third from the top. (Trust me on this!). The three other variants have different transcription start sites: two of them are upstream and one is downstream of the major site. Are these variants functional or are they simply transcription initiation errors? This is the same problem that we dealt with when we looked at splice variants. In that case I concluded that most splice variants are due to splicing errors and true alternative splicing is rare.

This is not a difficult question to answer when you are looking at specific well-characterized genes such as TPI1. The three variants are present at very low concentrations, they not conserved in other species, and they encode variant proteins that have never been detected. It seems reasonable to go with the null hypothesis; namely, that they are non-functional transcripts due to errors in transcription initiation.

However, this approach is not practical for every one of the 25,000 genes in the human genome so several groups have looked for a genomics experiment that will address the question. I'd like to recommend a recent paper in PLoS Biology that tries do this in a very clever way. It's also a paper that does an excellent job of explaining the controversy in a way that all scientific papers should copy.1
Xu, C., Park, J.-K., and Zhang, J. (2019) Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biology, 17(3), e3000197. [doi: 10.1371/journal.pbio.3000197]

Abstract

Alternative transcriptional initiation (ATI) refers to the frequent observation that one gene has multiple transcription start sites (TSSs). Although this phenomenon is thought to be adaptive, the specific advantage is rarely known. Here, we propose that each gene has one optimal TSS and that ATI arises primarily from imprecise transcriptional initiation that could be deleterious. This error hypothesis predicts that (i) the TSS diversity of a gene reduces with its expression level; (ii) the fractional use of the major TSS increases, but that of each minor TSS decreases, with the gene expression level; and (iii) cis-elements for major TSSs are selectively constrained, while those for minor TSSs are not. By contrast, the adaptive hypothesis does not make these predictions a priori. Our analysis of human and mouse transcriptomes confirms each of the three predictions. These and other findings strongly suggest that ATI predominantly results from molecular errors, requiring a major revision of our understanding of the precision and regulation of transcription. [my emphasis - LAM]

Author summary

Multiple surveys of transcriptional initiation showed that mammalian genes typically have multiple transcription start sites such that transcription is initiated from any one of these sites. Many researchers believe that this phenomenon is adaptive because it allows production of multiple transcripts, from the same gene, that potentially vary in function or post-transcriptional regulation. Nevertheless, it is also possible that each gene has only one optimal transcription start site and that alternative transcriptional initiation arises primarily from molecular errors that are slightly deleterious. This error hypothesis makes a series of predictions about the amount of transcription start site diversity per gene, relative uses of the various start sites of a gene, among-tissue and across-species differences in start site usage, and the evolutionary conservation of cis-regulatory elements of various start sites, all of which are verified in our analyses of genome-wide transcription start site data from the human and mouse. These findings strongly suggest that alternative transcriptional initiation largely reflects molecular errors instead of molecular adaptations and require a rethink of the precision and regulation of transcription.
I'm not going to describe the experimental results; if you're interested you can read the paper yourself. Instead, I want to focus on the way the authors present the problem and how it could be resolved.

One of the important issues in these kinds of problems is not whether there are well-established cases where the phenomenon is responsible for functional alternatives but whether the phenomenon is widespread. In this case, we know of specific examples of genes with multiple transcription start sites (TSS) that have a well-established function. The authors include a brief summary of these examples and conclude with an important caveat.
Nevertheless, alternative TSSs with verified benefits account for only a tiny fraction of all known TSSs, while the vast majority of TSSs have unknown functions. More than 90,000 TSSs are annotated for approximately 20,000 human protein-coding genes in ENSEMBL genome reference consortium human build 37 (GRCh37). Recent surveys using high-throughput sequencing methods such as deep cap analysis gene expression (deepCAGE) showed that human TSSs are much more abundant than what has been annotated. Are most TSSs of a gene functionally distinct, and is ATI generally adaptive? While this possibility exits, here we propose and test an alternative, nonadaptive hypothesis that is at least as reasonable as the adaptive hypothesis. Specifically, we propose that there is only one optimal TSS per gene and that other TSSs arise from errors in transcriptional initiation that are mostly slightly deleterious. This hypothesis is based on the consideration that transcriptional initiation has a limited fidelity, and harmful ATI may not be fully suppressed by natural selection if the harm is sufficiently small or if the cost of fully suppressing harmful ATI is even larger than the benefit from suppressing it.
This is how scientific papers should be written but too often we see scientists who assume that because some variants are functional it must mean that all variants are functional. They don't bother to mention the possibility that some could be functional but most are not.

Why is it important to decide whether multiple transcription start sites are functional? The simple answer is that it's always better to know the truth but there's more to it than that. Because these variants are included in the sequence databases it means that they are usually assumed to be functional. Let's say someone wants to look at 5' UTR sequences in order to see if there are specific signals that control RNA stability. In the case of the TPI1 gene (see above) they will get 4 different results because there are four different transcription start sites and the programs that scan the databases aren't able to recognize that three of these might be artifacts. That's a problem.

It also affects the definition of a gene and the amount of DNA devoted to genes. If the longest transcript is taken as the true size of the gene, as it often is, then this misrepresents the true nature of the gene. There's no easy way to fix this problem unless we pay annotators to closely examine each individual gene to figure out which transcripts are functional and which ones are not. They've done this for many splice variants, which is why many splice variants have been removed from the sequence databases, but it's a labor-intensive and expensive task.

Up until now, most scientists have not been aware that there's a problem. As is the case with alternative splicing and other phenomena, the average scientists just assumes that the variants in the databases represent true functional alternatives that contribute to gene expression. The authors of this paper (Xu et al., 2019) want to alert everyone to the distinct possibility that their results with transcription start sites raise a much more general concern that needs to be addressed. That's why they say,
Our results on ATI echo recent findings about a number of phenomena that increase transcriptome diversity, including alternative polyadenylation, alternative splicing, and several forms of RNA editing. They have all been shown to be largely the results of molecular errors instead of adaptive regulatory mechanisms. Together, these findings reveal the astonishing imprecision of key molecular processes in the cell, contrasting the common view of an exquisitely perfected cellular life.
Read that last sentence very carefully because it addresses what I think is the main problem. It's a question of contradictory worldviews that color ones interpretation of the data. If you think that life is exquisitely designed (by natural selection) then you tend to look at all variants as part of an extremely complex system that fine-tunes gene expression. On the other hand, if you think that the drift-barrier hypothesis is valid then you tend to discount the power of natural selection to weed out all transcription and splicing errors and you see biochemistry as an inherently messy process.


1. I've been highly critical of papers about junk DNA and alternative splicing because they often ignore the fact that there is a controversy. They do not mention that there is solid evidence for junk DNA and solid evidence that alternative splicing is uncommon.

42 comments :

Joe Felsenstein said...

I'm very glad you are taking a skeptical look at the meaning of splice sites and transcription sites. This is badly needed as a counterweight to all those molecular biologists who have a mystical belief that every site in the genome is in a state that is precisely meaningful which has the happy implication that granting agencies must give us lots more money to find out what all those sites are doing.

Christopher B said...

How important is the 5’UTR for stability? It is well established that the 3’UTR generally gets modified by recruitment of the cleavage and polyadenylation specificity factor(CPSF) to the highly conserved AAUAAA site and the RNA is cut and a poly-A tail of ~250 nucleotides (with exceptions) is added. This tail and proteins binding to it are as far as I know the main factors contributing to mRNA stability. Once the poly-A tail gets shortened it acts as a signal for factors to decap the 5’UTR and then the mRNA is completely degraded. Are there mechanisms where decapping is the initiating step?

That question was just out of curiousity, otherwise I think being skeptical about the thousands of RNA byproducts from various sources "all" being functional is warranted.

Rosie Redfield said...

Thanks Larry, this is excellent!

Larry Moran said...

I don’t know the answer to your question. I just made up an example to illustrate how misleading information in the sequence databases could have serious implications.

Michael Tress said...

"the programs that scan the databases aren't able to recognize that three of these might be artifacts"

If they were to use APPRIS (avaliable for multiple genomes and versions) they wouldnt have that problem. APPRIS selects principal isoforms for a range of organisms and reference annotations. In this case it selects NM_0003655 as the principal isoform and the other two as minor (http://appris.bioinfo.cnio.es/#/database/id/homo_sapiens/7167?as=hg38&sc=refseq&ds=rs109v28). A tool like APPRIS is only going to be more important as databases expand (which they will). As a taster, have a look at the Ensembl annotation for the same gene; they have eight TSS to go with the three in RefSeq (http://www.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000111669;r=12:6867021-6871014).

By the way, there are a number of genes that do have multiple likely functional TSS. PLEC is a great example. So is the UDP glucuronosyltransferase family 1A gene, and the protocadherin A and protocadherin G genes, all three of which have unfortunately been split into multiple distinct genes in the reference annotations.

Unknown said...

I have not had time to digest the full gist of your blog. The entries I have perused seem to be fixated on “Junk” DNA.

I found your methodology skewed.

What one might find for one might not apply to all.

Also the term reductionism readily came to mind.

If you are looking for some stimulation – Life’s Greatest Secret – Cobb, In Search of Cell History – Harold, and The Secret Life of Chaos – Turning - are suggested.

It is ironic that Paulding found the work of Watson and Crick trivial (so easy a caveman could do it – maybe even an 24), and pined more for an explanation of how proteins provide for cell structure.

Larry Moran said...

Who is Paulding?

John Harshman said...

That would be Lindus Paulding, double winnder of the Nobdel Pridze.

Larry Moran said...

That can’t be right because Linus Pauling was Watson & Crick’s major competitor in the race to decipher the structure of DNA. He published an incorrect prediction a few months before the Watson & Crick paper in 1953. It makes no sense that Linus Pauling would say that the W&C work was trivial and so easy a caveman could do it. In fact, we know that he greatly admired Watson & Crick’s achievement and regretted his mistakes in his paper.

Unknown must be quoting someone named Paulding who thinks that Watson, Crick, and Linus Pauling are all stupid.

Gnomon said...

Joe Felsenstein's insistence on neutrality "has the happy implication that granting agencies must give us lots more money" to use his Maximum Likelihood method, which requires neutral assumption for nearly all sites,to draw all sorts of fanciful evolutionary trees.

Joe Felsenstein said...

I see that Gnomon doesn't understand the inference of phylogenies, either.

Gnomon said...

Joe Felsenstein,

I cite from your highly influential 1981 paper (Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach) on the Maximum Likelihood method to show how your method depends on the neutral model. It remains possible though that I have missed something obvious and important and so please feel free to correct me.

Felsenstein:“Computation is enormously facilitated if we can assume that changes at different sites in the sequence are probabilistic events which are independent. This is a restrictive assumption, but practical computation does not appear feasible without it.”…“We assume that after speciation two lineages evolve independently, and that the same stochastic process of base substitution applies in all lineages.”… “The model used here is highly idealized, and the precision of the statistical inferences must be reduced by a factor representing one's skepticism of the assumptions involved. The absence of deletions and insertions, as well as of constraints on amino acid substituion, are particular sources of concern.”

If bases have functions and under natural selection and physiological selection, the reason for their fixation or substitution would not be stochastic, probabilistic, and independent events. But if they are neutral, then yes. So, your “idealized” assumptions are unreal and would be real if most bases are neutral.

Also, reality is that some species can tolerate more non-harmful mutations than some others. A gene in bacteria can tolerate more variations than its ortholog in monkeys (many mutations may be lethal to monkeys but not to bacteria but there are few mutations that are lethal to bacteria but not monkey). Therefore, the difference in sequences between two species may be contributed by mutations mostly by one of the two, rather than by the two equally. Your method does not take this into account, and would only apply to cases where the two species have no differences in the proportions of tolerable mutations or in functional/physiological constraints on the sequences. Hence, your method is more appropriate for neutral DNAs.

John Harshman said...

It's so cool when Gnomon lectures Joe on phylogenetics.

Joe Felsenstein said...

It is very common for people using ML methods to take into account variation among sites in rate of substitution. How does that fit into neutrality? Does neutrality also allow for some purifying selection, more at some sites than at others?

Gnomon said...

Indeed. ML methods do not require a molecular clock, which apparently may lead people to believe that these methods are also independent of the neutral theory, as the two are nearly equivalent (the molecular clock was claimed by Kimura/Ohta to be “the strongest evidence” for the neutral theory and the neutral theory was inspired by the molecular clock). However, the neutral theory is more than the molecular clock and in fact does not explain the details of the molecular clock as openly acknowledged by many such as Ayala (e.g., it predicts equal rates as measured in generations and yet the molecular clock is measured in years). The ML methods are in many respects dependent on the neutral assumption as I explained above. On the other hand, the methods do not rely on the molecular clock, which however is not in conflict with the dependency on the neutral theory as the molecular clock and the neutral theory are not exactly equivalent. Regardless, both the molecular clock and the neutral theory were mis-inspired by the most astonishing finding in evolution, the genetic equidistance result of Margoliash in 1963, and their mistake was to treat saturated maximum distance as linear.

ML methods require evolutionary models. In the case of amino acid substitution models, such as giving high probability for R-K change than for R-L change, the probability matrix of change were derived from observing large numbers of protein alignments. Mismatches in such alignments are largely due to functional selection rather than neutral drift in my opinion (R-K change being more common than R-L change is indication of functional selection). If all changes are neutral, an amino acid should have equal probability of changing to any of the other 19 aa. So, here, ML methods do not use the neutral model. By being both dependent on the neutral model in some aspects but its opposite in some others, the ML methods lack self-consistency.

What is the alternative then? I am afraid that one simply cannot use evolutionary models as no one really knows the chain of events from the ancestor to the extant species, which can be extremely hard to model realistically. That leave us only with the distance methods, which in my opinion when used properly can qualify for the job. First, using slow evolving genes still at the linear phase of change will solve the issue of saturation, and we need to get rid of the approaches involving unreliable corrections on saturated distances in fast evolving genes as is commonly done. Second, changes in such slow evolving genes qualify as neutral, because they are under neither positive nor negative selection. They are too slow to meet adaptive needs to be positively selected and their occurrences per se indicate lack of strong negative selection. Hence, an amino acid may truly have near equal probability of changing to any other. Finally, slow evolving genes may vary less dramatically among species. Plus we can always restrict our analyses on closely related species (do human-ape tree, followed by ape-monkey tree, followed by monkey-prosimian tree). Here then, the distance methods would be completely dependent on the neutral theory and self-consistent. The genes that can qualify for use in such methods are an extreme minority of the genome. The neutral sites are relatively rare in the genome.

John Harshman said...

Gnomon does not seem to understand the point about site-to-site rate variation, +I +gamma, and such. In turn, I don't understand why distance methods should be fine if likelihood methods are not. And I had forgotten that he rejects the idea of junk DNA. Pathological science.

Gnomon said...

Junk DNA notion is being pounded every week by papers from bench biologists.Correspondingly, the maximum genetic diversity theory is being proven every week. I am afraid that soon it will be all over and the trash can will collect the following ill fated human inventions, molecular clock, neutral theory in the broad sense,Junk DNA, Out of Africa model, and most phylogenetic trees.

Here is the latest heavy weight evidence disproving the junk DNA notion. https://www.sciencedaily.com/releases/2019/04/190418131320.htm (Taming the genome's 'jumping' sequences.)

Julien Pontis, Evarist Planet, Sandra Offner, Priscilla Turelli, Julien Duc, Alexandre Coudray, Thorold W. Theunissen, Rudolf Jaenisch, Didier Trono. Hominoid-Specific Transposable Elements and KZFPs Facilitate Human Embryonic Genome Activation and Control Transcription in Naive Human ESCs. Cell Stem Cell, 2019; DOI: 10.1016/j.stem.2019.03.012

Gnomon said...

To estimate the neutral fraction of human genome, what is done typically is to compare with transposable elements (TE). By assuming TE to be neutral, other sequences with similar purifying selection would be classified as neutral/junk. For example, an influential paper by Ponting and Hardison concluded that ~90% of human genome are junk and stated their methods : “Many evolutionary methods (that estimate functional fractions of the genome)assume the absence of purifying selection in a fraction of genomic sequences.Often these are ‘‘ancestral repeats’’ (ARs), which are aligned transposable element–derived sequences present in the last common ancestor of the species under consideration". Therefore, the new research paper concluding functions for TE (nearly all of it), which represent 50% of human genome sequence, has finally and effectively killed the junk DNA notion.

Ponting and Hardison, 2011 What fraction of the human genome is functional? Genome Research 21(11): 1769–1776.

Larry Moran said...

Pontis et al. (2019) imagine that millions and million of degenerative TE elements are maintained in the human genome in order to facilitate evolution. They imagine that every few million years a TE insertion will prove to be beneficial in the human population and that the benefit will be sufficient to lead to fixation. In their minds, this is an explanation for devoting >50% of the genome to neutrally evolving TE's.

The alternative explanation, which I favor, is that TE's are junk DNA whose individual detrimental effects are nearly neutral in species with small populations.

Pontis et al. (2019) report that fewer than one in one thousand degenerative TE elements are transcriptionally active but many of those are activated in early embryogenesis when a large proportion of the chromatin is in an open domain and the promoters are accessible to transcription factor binding. Many human TE's contain a certain transcription factor binding site that happens to be activated then down-regulated by another factor, thus reducing the level of transcription during embryogenesis.

The authors claim that this down regulation, which I consider to be an accident of evolution, is evidence that human cells are reducing the detrimental effect of these TE's in order to preserve large numbers for future evolution.

Joe Felsenstein said...

@John: You're right, he doesn't understand the point about site-to-site rate variation. But that's OK, I don't understand Gnomon's "Maximum genetic diversity" theory either.

Gnomon said...

John and Joe: I just happened to find the rate variations among sites to be much more consistent with functional selection than with that modeled by Gamma.Why would anyone take the Gamma model seriously when it is mostly adopted for convenience?

From slides by Felsenstein (http://evolution.gs.washington.edu/gs541/2003/lecture34.pdf)
Unrealistic aspects of the model (Gamma):
There is no reason, aside from mathematical convenience, to
assume that the Gamma is the right distribution. A common
variation is to assume there is a separate probability f0 of having
rate 0.
Rates at different sites appear to be correlated, which this model
does not allow.
Rates are not constant throughout evolution – they change with
time.

Gnomon said...

Larry:It is always good to hear alternative explanations. However, to consider HUNDREDS(not just one or a few) of KLZF protein factors all act in concert in an accidental fashion to control TE transcription in a very timely precise manner is beyond imagination. If random accidents, wouldn’t one would expect a complete mess as a result of hundreds of proteins each acting randomly? Besides, how do you test such your hypothesis? It is not science if it is not testable. The neutral school should really think hard on testing their position experimentally. Computational tests have been done indeed but they are basically meaningless tests as they need to first assume neutrality for certain sequences, most often TE the archetype junk DNA (in their imagination). Why would people be so uncritical of the claims and assumptions of the neutral school? It of course all started with the invention of the molecular clock. To see why the molecular clock was grossly and mindlessly mistaken from the start, I recommend one of my many papers on the topic: Huang (2009) The Overlap Feature of the Genetic Equidistance Result A Fundamental Biological Phenomenon Overlooked for Nearly Half of a Century. Biological Theory 5(1):40-52

I would appreciate any comments on the paper. The two sides of the debate need to first understand each other's positions in order to carry a productive and intelligent conversation. As things stand, I understand most of the neutral school but few if any from the other side understand mine. There is no good excuse for this as mine is the only alternative to molecular clock and so it is not like there are hundreds of alternatives for one to invest time in order pick a correct one. In addition, the molecular clock is widely known to be unrealistic. My paper explains the real reason for this: it mistreats saturated maximum distance as linear. Maximum genetic diversity is a theory but is foremost a statement of facts, genetic distance and genetic diversity are mostly at saturation maximum levels today. It is an easily verifiable fact. The neutral school needs to either challenge it with tests or accept it rather than just to ignore it.

Larry Moran said...

@Gnomon

I don't think you understand the concept of the null hypothesis and why it's so important in science. Consider the following claim, "... HUNDREDS (not just one or a few) of KLZF protein factors all act in concert in an accidental fashion to control TE transcription in a very timely precise manner."

That's a specific claim and the burden of proof lies with those making such a claim. What is the evidence that this phenomenon is really an example of precise regulation? One test is to ask if it is conserved ... but it is not conserved.

You are trying to shift the burden of proof by demanding that someone prove that this is NOT a biologically relevant example of precise regulation instead of an accident of evolution. That's not how science works.

The authors of the paper started with the unproven assumption that TE's serve some adaptive purpose and this leads them to interpret their data as support of their assumption. They don't even consider the possibility that their underlying assumption is incorrect and their conclusion is just another example of confirmation bias.

Larry Moran said...

@Gnomon

I have read all your papers and I find them mostly incomprehensible. The parts I understand are unconvincing. You seem to be attacking strawmen and that's consistent with your comments on this blog.

You have been asked repeatedly to explain your ideas in a comprehensible manner and you have always ducked the question by referring us back to the very papers that prompted us to ask for clarification.

Mikkel Rumraket Rasmussen said...

"However, to consider HUNDREDS(not just one or a few) of KLZF protein factors all act in concert in an accidental fashion to control TE transcription in a very timely precise manner is beyond imagination. If random accidents, wouldn’t one would expect a complete mess as a result of hundreds of proteins each acting randomly? Besides, how do you test such your hypothesis? It is not science if it is not testable."

"acting in concert", "timely precise manner", "random accidents"?

Confirmed creationist. The only people who speak like this are creationists, this guy is a creationist. You will see this exact same nonsense, even the same phrases, used by people like Bill Cole, Sal Cordova, and "phoodoo" over on the skeptical zone.

Joe Felsenstein said...

Gnomon states his Maximum Genetic Diversity theory thusly: genetic distance and genetic diversity are mostly at saturation maximum levels today. This is flatly wrong. In a typical human, for example, there is one site heterozygous per 1000. That is very far short of the maximum possible heterozygosity. And if, in any group, the DNA sequences changed so much that the differences between species reached saturation, then we would be completely unable to use DNA sequences to detect relationships in their phylogeny. Which is not true. Whatever Gnomon's MGD hypothesis means, it is wrong. Has it been explained clearly? Just look at how widely it is understood by the average molecular evolutionist. Not.

John Harshman said...

I think his theory involves changes in the number of sites free to vary so that the nature of saturation changes between species, which accounts in his mind for different genetic distances among species. Why that would result in a consistent nested hierarchy is unclear.

Gnomon said...

John has got the maximum genetic diversity (MGD) theory basically correct. In a sequence alignment with humans, there is a nested hierarchy with humans less and less related to increasingly less complex species. As less complex species evolved earlier, the nest hierarchy of gene identity shows correlations with two different parameters, complexity and time. This was apparent when the first such alignments were done in the early 1960s as in the case of cytochrome C from the paper by Margoliash in 1963. Identity to humans decreases in the order of chickens, fishes, and yeasts. If one only focused on the time correlation as did Margoliash, Zuckerkandl and Pauling, one would conclude that protein non-identity is only determined by time of separation as if the substitution rate is constant and the same among species (hence the molecular clock was born). Seemingly consistent with this, the same alignment data also showed that yeasts are equidistant to fishes, chickens, and humans, or that fishes are equidistant to chickens and humans (hence the equidistant result was born). On the other hand, if focused on the complexity parameter and ignored time, one would find a nice correlation of sequence identity with species complexity. One also finds that simple species is equidistant to all more complex species.

The correlation with complexity makes sense as closer physiology should mean closer gene identity and less complex species should show be able to tolerate more mutation variations (think HIV and bacteria). Genomes have two types of sequences, functional and neutral, both of which show correlation with time in their variations. The neutral sequence is explained by the neutral theory. The variation in functional sequence is correlated with physiology (as explained by the MGD theory) and indirectly with time as physiological complexity is correlated with time with simple physiology evolved earlier in time. Functional sequences are under physiological/natural selection and variations in them would quickly reach saturation maximum or optimum, because lower than optimum would mean less fitness such as poor immunity and so quick elimination. So, variations in functional DNAs are expected to be under positive selection to quickly reach optimum/maximum level.

It is easy to tell optimum/maximum distances from linear distances. Imagine a 100 amino acid protein with only 1 neutral site. In a multispecies alignment involving at least 3 species, if one finds only one species with a mutation at this neutral site while all other species have the same non mutated residue, there is no saturation. However, if one finds that nearly every species has a unique amino acid, one would conclude mutation saturation as there are multiple independent substitution events among different species at the same site and repeated mutations at the same site do not increase distance. We have termed those sites with repeated mutations “overlap” sites. So, a diagnostic criterion for saturated maximum distance is the proportion of overlap sites. It turns out that the original cytochrome C alignments of Margoliash have high proportions of overlap sites and so are in fact maximum distances (so was the hemoglobin result of Zuckerkandl and Pauling). Thus, the interpretation of that result as molecular clock or a linear distance phenomenon (mutations always increase distances and distances always correlate with time) was mistaken, as overlapped mutations at the same sites should mean no increase of distance with time.

We have verified that the vast majority of proteins show maximum distances and only a small proportion, the slowest evolving, are still at the linear phase of changes.

Gnomon said...

Joe is right that human genetic diversity level is about 1 in 1000 bp. We have verified that this level is at optimum or maximum because a slight increase would mean diseases or negative selection. Using genome wide SNP data from thousands of subjects, we have nearly 10 papers in the past decade to show that patient groups (from many different diseases) had higher genetic diversity levels than match control groups. In addition, saturation can be inferred by another independent approach. Different racial groups are known to share most SNPs (Leowontin 1972). However, this sharing is mostly in the fast evolving DNAs and decreases according to the evolutionary rates of DNAs, which means that sharing is a saturation phenomenon or a result of independent mutations hitting the same sites.

John Harshman said...

You have managed to confuse a nested hierarchy with a ladder.

robert d said...

I apologize. I misspelled Pauling. I also, as pointed out, exaggerated. However, I was vaguely recalling a paper by Pauling were he states that while the discovery of the genetic code

4 3 20 +20,000

was interesting, the real enigma was how proteins are arranged to form cells.

John Harshman said...

Yeah, and those grapes are probably sour anyway.

Joe Felsenstein said...

I'm interested to hear that the genetic code is (4 x 3 x 20) + 20,000 = 20,240. Or perhaps that the genetic code is 4,320 + 20,000 = 24,320.

But anyway we know that how proteins are arranged to form cells is not the real enigma, the real enigma is how cells react to each other. And so on.

Gnomon said...

Our new preprint showed that contrary to naïve expectations, observed amino acid variants in slow evolving proteins were enriched with conservative changes and more neutral, lending further support for our slow clock method of phylogenetic inferences that led to the Out of East Asia model of modern humans.

Wang, M., Wang, D., Yu, J., and Huang, S. (2019) Enrichment in conservative amino acid changes among fixed and standing missense variations in slow evolving proteins. https://www.biorxiv.org/content/10.1101/644666v1

John Harshman said...

Conservative amino acid changes being more common would actually fit my naive expectations. So is this all part of your attempt to show that modern humans originated in China? It's all a nationalist message?

Gnomon said...

Please cut the politics nonsense. Perhaps you are a genius with a much better understanding of things than the field and you can show us some track record where you have expressed your expectations. Truth is that we have seen in the past two years comments by numerous reviewers on our human origins paper that used missense SNPs from slow evolving genes. Most reviewers expressed a common concern on the neutrality of these SNPs. Here is one typical example. Reviewer 1: "The authors suggest that an analysis based on arguments derived from the neutral theory should be confined to non-synonymous sites in slowly evolving genes but these sites are exactly those that traditionally are deemed least likely to be governed by neutral processes."

John Harshman said...

It's all a matter of what you're comparing it to. non-synonymous sites in slowly evolving genes should be evolving less neutrally than most other sites in most other sequences, but the changes that do occur in those genes should be closer to neutrality than the changes that don't. All that should indeed be obvious, and I don't think you would find anyone to doubt it.

Why use these sites rather than, say, introns?

Mikkel Rumraket Rasmussen said...

"Perhaps you are a genius with a much better understanding of things than the field and you can show us some track record where you have expressed your expectations."

I'm a total nobody with no formal training in evolutionary biology, and even I would have expected that. More conservative changes in amino acid substitutions I would naively predict to have smaller functional consequences, hence lesser fitness effects, so if most mutations are deleterious, conservative changes should be more likely to slip through purifying selection than more radical ones. It seems to me this follows straightforwardly. I'd predict overabundance of conservative changes for fast-evolving proteins too.

Gnomon said...

That is not the point. Of course all agree that observed changes are more neutral than the changes that do not occur. But I have not seen any paper other than our own that suggests that observed changes in slow evolving genes are more neutral than those in fast evolving genes. In fact, the field generally believes that stronger purifying selection applies to all sites in slow evolving genes including those with observed changes, relative to those in fast evolving genes. Here is one example, He et al:“However, from the viewpoint of purifying selection, this data processing method (using slow evolving genes) leads to genes/sites under neutrality being excluded and genes/sites under strong impacts of purifying selection being retained.”

He, C., Liang, D., and Zhang, P. (2019) Evaluating the Impact of Purifying Selection on Species-level Molecular Dating. bioRxiv https://doi.org/10.1101/622209

Gnomon said...

Introns are not used because they show little conservation for alignment to be possible or informative. It is plain obvious that the field has not advocated for using the slowest evolving genes in phylogenetic inferences by saying that the observed changes in those genes are more neutral. So long a protein can produce informative alignments among the species concerned, the field would use it and no one but us is calling for selecting only the slowest evolving among proteins that can give informative alignments. From the mutation saturation point of view rather than neutrality, the field does prefer slower evolving genes but it has yet to realize that the vast majority of genes are at maximum/optimum mutation saturation.

John Harshman said...

Interesting, then, that I have always used introns in my phylogenetic research, and I have little trouble aligning them. There's literature showing that introns have more phylogenetic information than exons.

And that should be fairly obvious too. If your proteins are only experiencing neutral changes at a few sites, those should become saturated fairly quickly, loging all information.

Gnomon said...

Our new results demonstrating strong evidence for mutation saturation in human genetic diversity. Racial groups are thought to have no genetic basis as indicated by common sharing of genetic variations. But contrary to assumptions by the field, our new results show that such sharing is in fact due to parallel mutations rather than a very recent common ancestor and admixture.

for the data figure, see my tweet. https://twitter.com/shi_huang5