Sandwalk: How many proteins in the human proteome?

Tuesday, December 06, 2016

How many proteins in the human proteome?

Humans have about 25,000 genes. About 20,000 of these genes are protein-coding genes.¹ That means, of course, that humans make at least 20,000 proteins. Not all of them are different since the number of protein-coding genes includes many duplicated genes and gene families. We would like to know how many different proteins there are in the human proteome.

The latest issue of Science contains an insert with a chart of the human proteome produced by The Human Protein Atlas. Publication was timed to correspond with release of a new version of the Cell Atlas at the American Society of Cell Biology meeting in San Francisco. The Cell Atlas maps the location of about 12,000 proteins in various tissues and organs. Mapping is done primarily by looking at whether or not a gene is transcribed in a given tissue.

A total of 7367 genes (60%) are expressed in all tissues. These "housekeeping" genes correspond to the major metabolic pathways and the gene expression pathway (e.g. RNA polymerase subunits, ribosomal proteins, DNA replication proteins). Most of the remaining genes are tissue-specific or developmentally specific.

This is all very interesting but it doesn't answer the most important question; namely, how many different proteins are there? The question is important for two reasons: (1) we need to know how many of the putative protein-coding genes actually encode a biologically functional protein, and (2) how many genes produce different versions of a protein through mechanisms such as alternative spicing?

The Human Protein Atlas is silent on both these issues so we need to look elsewhere.

One of the editors at Science, Sean Sanders, agrees about the importance of these questions since he introduces the chart like this,

Our DNA might provide the blueprint for how to build our bodies, but it is the proteins that really do the heavy lifting. While there are around 20,000 genes encoded in our DNA, the total number of proteins is estimated to be many times more—possibly as many as a million. This is because a single gene might produce multiple variants of a particular protein through, for example, alternative splicing of the messenger RNA. Posttranslational modification of the nascent protein, such as phosphorylation and glycosylation, may also significantly or subtly change its function, yielding many possible protein variants.

I think this is very misleading. I think most protein-coding genes produce only a single functional polypeptide and that single polypeptide is modified post-translationally in only a limited number of ways to produce just one, or a very few, functional variants.

Let's not quibble about post-translational modifications. Let's concentrate on the total number of different polypeptides produced by protein-coding genes. Speculation about the role of alternative splicing has been rampant in the scientific literature for more than twenty years. The standard myth is that humans make many different polypeptides (proteins) from each gene due to alternative splicing. Speculations range from about 100,000 to almost one million.

I call these "speculations" because that's what they are. There's no data to support such claims. They are often just wishful thinking based on The Deflated Ego Problem. That's the problem created when human exceptionalists realized that humans had about the same number of genes as "lower" organisms such as fruit flies and plants. These workers attempted to salvage their deflated egos by proposing all kinds of workarounds to compensate for the low number of genes and explain why humans could be so much more complex with just 25,000 genes. One of those excuses is alternative splicing.

One of my recent projects has been to research this issue to see what the data actually says. I still live in a fact-based world so facts are important to me. The results will be in Chapter 3 of a book I'm working on. Here's the summary ...

How many different proteins?

There have been many studies of the human proteome based largely on mass spec analyses of different tissues [see How many different proteins are made in a typical human cell? and How many proteins do humans make?]. The results of these studies don't agree on the number of proteins. That's because the techniques are difficult. There are many examples of predicted proteins that don't actually exist (false positives) and real proteins that aren't detected (false negatives) (Paik et al., 2016).

The HUPO Human Proteome Project (HPP) attempts to collate and analyze all the data and create a well-supported database of all human proteins. The latest version of this database has 19,467 predicted protein-coding genes or which 16,518 are supported by solid evidence ("confident protein identifications"). That leaves 2,949 "missing" proteins (Omenn et al., 2016). These missing proteins are likely to be proteins produced transiently during development or proteins restricted to cells that weren't analyzed in the mass spec studies.

Other databases tend to have fewer proteins, for example, the Human Protein Atlas only looked at 12,000 protein-coding genes. I think we can be confident there are somewhere between 19,000 and 20,000 protein-coding genes that actually produce functional polypeptides.

Splice variants

It's very difficult to detect most of the proteins predicted by looking at splice variants. That's because proteome analyses can only pick up a variant protein if it includes a new exon or modifies the reading frame. However, there are thousands of predictions of this sort in the splice variant databases—they haven't been detected (with minor exceptions). Is it time to conclude that they haven't been detected because they don't exist?

Ongoing curation of the human genome reference sequence has resulted in eliminating most of the splice variants for most genes. The curators have concluded that these variants probably represent splicing errors and not true alternative splicing. The latest version of RefSeq, for example, lists only one or two splice variants per gene. It predicts about 40,000 different proteins from 20,000 genes. GENCODE predicts 80,000 different proteins due to possible alternative splicing. These predictions are far below the most optimistic speculations of the past.

Most of these predictions have not been confirmed by actually detecting a polypeptide variant being produced by alternative splicing. When you look closely at individual genes you quickly see that most of these prediction don't make any sense. When protein structural biologists analyze these predictions, they usually conclude that the predicted proteins are not functional, even if they exist. Here's what a group of structural biologists concluded when they examined the predictions made by the ENCODE pilot study back in 2007. Tress et al. conclude,

Alternative premessenger RNA splicing enables genes to generate more than one gene product. Splicing events that occur within protein coding regions have the potential to alter the biological function of the expressed protein and even to create new protein functions. Alternative splicing has been suggested as one explanation for the discrepancy between the number of human genes and functional complexity. Here, we carry out a detailed study of the alternatively spliced gene products annotated in the ENCODE pilot project. We find that alternative splicing in human genes is more frequent than has commonly been suggested, and we demonstrate that many of the potential alternative gene products will have markedly different structure and function from their constitutively spliced counterparts. For the vast majority of these alternative isoforms, little evidence exists to suggest they have a role as functional proteins, and it seems unlikely that the spectrum of conventional enzymatic or structural functions can be substantially extended through alternative splicing.

Keep in mind that the predicted proteins have not been detected. You can't disprove alternative splicing on the absence of evidence so the best you can do is to apply common sense as Tress et al. (2007) are doing. There's no evidence that the human genome produces 100,000 different polypeptides due to alternative splicing and plenty of evidence suggesting this is unlikely to be true.

Michael Tress and his colleagues followed up this study by looking at more recent predictions (Tress et al., 2016). They conclude,

Alternative splicing is commonly believed to be a major source of cellular protein diversity. However, although many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. The clearest finding from proteomics experiments is that most human genes have a single main protein isoform, while those alternative isoforms that are identified tend to be the most biologically plausible: those with the most cross-species conservation and those that do not compromise functional domains. Indeed, most alternative exons do not seem to be under selective pressure, suggesting that a large majority of predicted alternative transcripts may not even be translated into proteins.

This is not an isolated example. There's a growing consensus among experts that most splice variants are due to splicing errors and true alternative splicing is not widespread. There's no solid evidence that humans make 100,000 or even 40,000 different polypeptides from only 20,000 protein-coding genes. It's time to stop spreading false myths about alternative splicing and time to start dealing with facts and evidence.

1. This is a ball-park figure. The actual number of proven protein-coding genes is closer to 19,000.

Omenn, G.S., Lane, L., Lundberg, E.K., Beavis, R.C., Overall, C.M., and Deutsch, E.W. (2016) Metrics for the Human Proteome Project 2016: Progress on identifying and characterizing the human proteome, including post-translational modifications. Journal of Proteome Research, 15:3951-3960. [doi: 10.1021/acs.jproteome.6b00511]

Paik, Y.-K., Overall, C.M., Deutsch, E.W., Hancock, W.S., and Omenn, G.S. (2016) Progress in the Chromosome-Centric Human Proteome Project as Highlighted in the Annual Special Issue IV. Journal of Proteome Research, 15:3945-3950. [doi: 10.1021/acs.jproteome.6b00803]

Tress, M.L., Martelli, P.L., Frankish, A., Reeves, G.A., Wesselink, J.J., Yeats, C., ĺsólfur Ólason, P., Albrecht, M., Hegyi, H., and Giorgetti, A. (2007) The implications of alternative splicing in the ENCODE protein complement. Proceedings of the National Academy of Sciences, 104:5495-5500. [doi: 10.1073/pnas.0700800104]

Tress, M.L., Abascal, F., and Valencia, A. (2016) Alternative Splicing May Not Be the Key to Proteome Complexity. Trends in Biochemical Sciences [doi: 10.1016/j.tibs.2016.08.008]

10 comments :

Anonymous said...: There are 1.5-2 million proteins produced if you include antibodies and T-cell receptors; Tuesday, December 06, 2016 2:59:00 PM
Larry Moran said...: Good quibble!; Tuesday, December 06, 2016 3:03:00 PM
Anonymous said...: Whether its a quibble depends on the context of the discussion. If ones discussing the number of proteins required to make a human its a quibble.

I'm surprised the people making the 'lotsa proteins' argument haven't brought up the topic of antibodies as a way to make it seem more plausible.; Tuesday, December 06, 2016 3:28:00 PM
Federico Abascal said...: You can think also of post-translational modifications, protein-protein interaction, etc... all that increases the complexity of the proteome too.

Very interesting article, I can't avoid to like it :-); Wednesday, December 07, 2016 5:15:00 AM
Eric said...: My gut instinct is that splice variants are probably the result of something similar to low copy number RNA transcripts. As Larry has mentioned elsewhere, biology is messy. No protein has 100% specificity, meaning that splice variants, like off-target transcription, are going to happen. To use an analogy, every factory throws out a certain percentage of their product due to manufacturing errors. The null hypothesis should be errors in mRNA maturation, but we all know that some scientists are more than willing to ditch the null hypothesis.; Thursday, December 08, 2016 1:20:00 PM
Chas Peterson said...: Thanks for an interesting and informative article. Your view makes more sense to me.
These big gee-whiz numerical factoids take on a life of their own far too often. Another I saw debunked recently was the idea that human bodies contain 100 trillion cells! And 10 times that many bacterial cells!!! Careful consideration of the data suggest 30 trillion human cells (90% of them erythrocytes!)and 30-50 trillion bacteria. "The numbers are similar enough that each defecation event may flip the ratio to favor human cells over bacteria."
http://www.sciencedirect.com/science/article/pii/S0092867416000532; Friday, December 09, 2016 6:38:00 PM
anonymous said...: On the subject of post-translational modification not being a "quibble"

http://tinyurl.com/zus35tm has a somewhat disingenuous graphic which some present may find provocative.; Monday, December 12, 2016 9:05:00 AM
Unknown said...: grch38 gtf file has over 152,000 "protien_coding" transcripts annotated.... it cant all be bullshit... 19-20k is deff not the answer to "how many proteins can the human genome make?"; Friday, July 30, 2021 2:40:00 PM
Larry Moran said...: 152,000 protein-coding mRNAs is definitely bullshit. The whole point of this post was to present the evidence for one protein per gene for most of the 20,000 protein-coding genes.

The idea that each gene can make several different proteins by alternative splicing is a myth without any substantive supporting evidence.; Friday, July 30, 2021 6:06:00 PM
Michael Tress said...: There are slightly more than 137,000 predicted proteins translated from these mRNA, if you combine RefSeq and Ensembl/GENCODE. BUT the real number of different proteins is substantially lower because (a) Ensembl/GENCODE's set includes many unfinished fragments that if extended would be sequence identical and (b) RefSeq annotates in common SNVs, which makes many of sequences slightly different from the equivalent Ensembl/GENCODE sequence, when in reality it is the same protein. Also more than 45,000 sequences from RefSeq are automatic predictions, often without supporting evidence. So, there are a lot fewer than 152,000 predicted proteins in the human genome. And it is quite clear that most coding genes have a single important protein isoform (Ezkurdia et al, 2015, Most highly expressed protein-coding genes have a single dominant isoform)

How many alternative transcripts are translated into functionally important proteins? Well, we don't know, but there are indications that it is not many more than 10,000, and might be fewer. Conserved alternative splice variants are clearly likely to be functionally relevant and we find that 5% of events have a last common ancestor in fish (Martinez-Gomez et al, 2021, The clinical importance of tandem exon duplication-derived substitutions), many of which have arisen from tandem duplications. However, three quarters of alternative transcripts are primate-derived (Rodriguez et al, 2020, An analysis of tissue-specific alternative splicing at the protein level).

Lack of conservation does not preclude functional importance, of course, but there is little evidence to support that these splice variants have a role, even if translated. For example, a large number of these primate exons are derived from SINE Alu regions. While we found that some of these alternative exons are translated and had even been incorporated into the main isoform of a handful of genes (Gomez-Martinez, 2020, Few SINEs of life: Alu elements have little evidence for biological relevance despite elevated translation), we did not find any evidence of purifying selection even for the few cases with peptide support. So, the vast majority of annotated alternative transcripts are unlikely to have cellular roles.; Wednesday, August 04, 2021 12:01:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Tuesday, December 06, 2016

How many proteins in the human proteome?

10 comments :