More Recent Comments

Tuesday, December 06, 2016

How many proteins in the human proteome?

Humans have about 25,000 genes. About 20,000 of these genes are protein-coding genes.1 That means, of course, that humans make at least 20,000 proteins. Not all of them are different since the number of protein-coding genes includes many duplicated genes and gene families. We would like to know how many different proteins there are in the human proteome.

The latest issue of Science contains an insert with a chart of the human proteome produced by The Human Protein Atlas. Publication was timed to correspond with release of a new version of the Cell Atlas at the American Society of Cell Biology meeting in San Francisco. The Cell Atlas maps the location of about 12,000 proteins in various tissues and organs. Mapping is done primarily by looking at whether or not a gene is transcribed in a given tissue.

A total of 7367 genes (60%) are expressed in all tissues. These "housekeeping" genes correspond to the major metabolic pathways and the gene expression pathway (e.g. RNA polymerase subunits, ribosomal proteins, DNA replication proteins). Most of the remaining genes are tissue-specific or developmentally specific.

This is all very interesting but it doesn't answer the most important question; namely, how many different proteins are there? The question is important for two reasons: (1) we need to know how many of the putative protein-coding genes actually encode a biologically functional protein, and (2) how many genes produce different versions of a protein through mechanisms such as alternative spicing?

The Human Protein Atlas is silent on both these issues so we need to look elsewhere.

One of the editors at Science, Sean Sanders, agrees about the importance of these questions since he introduces the chart like this,
Our DNA might provide the blueprint for how to build our bodies, but it is the proteins that really do the heavy lifting. While there are around 20,000 genes encoded in our DNA, the total number of proteins is estimated to be many times more—possibly as many as a million. This is because a single gene might produce multiple variants of a particular protein through, for example, alternative splicing of the messenger RNA. Posttranslational modification of the nascent protein, such as phosphorylation and glycosylation, may also significantly or subtly change its function, yielding many possible protein variants.
I think this is very misleading. I think most protein-coding genes produce only a single functional polypeptide and that single polypeptide is modified post-translationally in only a limited number of ways to produce just one, or a very few, functional variants.

Let's not quibble about post-translational modifications. Let's concentrate on the total number of different polypeptides produced by protein-coding genes. Speculation about the role of alternative splicing has been rampant in the scientific literature for more than twenty years. The standard myth is that humans make many different polypeptides (proteins) from each gene due to alternative splicing. Speculations range from about 100,000 to almost one million.

I call these "speculations" because that's what they are. There's no data to support such claims. They are often just wishful thinking based on The Deflated Ego Problem. That's the problem created when human exceptionalists realized that humans had about the same number of genes as "lower" organisms such as fruit flies and plants. These workers attempted to salvage their deflated egos by proposing all kinds of workarounds to compensate for the low number of genes and explain why humans could be so much more complex with just 25,000 genes. One of those excuses is alternative splicing.

One of my recent projects has been to research this issue to see what the data actually says. I still live in a fact-based world so facts are important to me. The results will be in Chapter 3 of a book I'm working on. Here's the summary ...

How many different proteins?

There have been many studies of the human proteome based largely on mass spec analyses of different tissues [see How many different proteins are made in a typical human cell? and How many proteins do humans make?]. The results of these studies don't agree on the number of proteins. That's because the techniques are difficult. There are many examples of predicted proteins that don't actually exist (false positives) and real proteins that aren't detected (false negatives) (Paik et al., 2016).

The HUPO Human Proteome Project (HPP) attempts to collate and analyze all the data and create a well-supported database of all human proteins. The latest version of this database has 19,467 predicted protein-coding genes or which 16,518 are supported by solid evidence ("confident protein identifications"). That leaves 2,949 "missing" proteins (Omenn et al., 2016). These missing proteins are likely to be proteins produced transiently during development or proteins restricted to cells that weren't analyzed in the mass spec studies.

Other databases tend to have fewer proteins, for example, the Human Protein Atlas only looked at 12,000 protein-coding genes. I think we can be confident there are somewhere between 19,000 and 20,000 protein-coding genes that actually produce functional polypeptides.

Splice variants

It's very difficult to detect most of the proteins predicted by looking at splice variants. That's because proteome analyses can only pick up a variant protein if it includes a new exon or modifies the reading frame. However, there are thousands of predictions of this sort in the splice variant databases—they haven't been detected (with minor exceptions). Is it time to conclude that they haven't been detected because they don't exist?

Ongoing curation of the human genome reference sequence has resulted in eliminating most of the splice variants for most genes. The curators have concluded that these variants probably represent splicing errors and not true alternative splicing. The latest version of RefSeq, for example, lists only one or two splice variants per gene. It predicts about 40,000 different proteins from 20,000 genes. GENCODE predicts 80,000 different proteins due to possible alternative splicing. These predictions are far below the most optimistic speculations of the past.

Most of these predictions have not been confirmed by actually detecting a polypeptide variant being produced by alternative splicing. When you look closely at individual genes you quickly see that most of these prediction don't make any sense. When protein structural biologists analyze these predictions, they usually conclude that the predicted proteins are not functional, even if they exist. Here's what a group of structural biologists concluded when they examined the predictions made by the ENCODE pilot study back in 2007. Tress et al. conclude,
Alternative premessenger RNA splicing enables genes to generate more than one gene product. Splicing events that occur within protein coding regions have the potential to alter the biological function of the expressed protein and even to create new protein functions. Alternative splicing has been suggested as one explanation for the discrepancy between the number of human genes and functional complexity. Here, we carry out a detailed study of the alternatively spliced gene products annotated in the ENCODE pilot project. We find that alternative splicing in human genes is more frequent than has commonly been suggested, and we demonstrate that many of the potential alternative gene products will have markedly different structure and function from their constitutively spliced counterparts. For the vast majority of these alternative isoforms, little evidence exists to suggest they have a role as functional proteins, and it seems unlikely that the spectrum of conventional enzymatic or structural functions can be substantially extended through alternative splicing.
Keep in mind that the predicted proteins have not been detected. You can't disprove alternative splicing on the absence of evidence so the best you can do is to apply common sense as Tress et al. (2007) are doing. There's no evidence that the human genome produces 100,000 different polypeptides due to alternative splicing and plenty of evidence suggesting this is unlikely to be true.

Michael Tress and his colleagues followed up this study by looking at more recent predictions (Tress et al., 2016). They conclude,
Alternative splicing is commonly believed to be a major source of cellular protein diversity. However, although many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. The clearest finding from proteomics experiments is that most human genes have a single main protein isoform, while those alternative isoforms that are identified tend to be the most biologically plausible: those with the most cross-species conservation and those that do not compromise functional domains. Indeed, most alternative exons do not seem to be under selective pressure, suggesting that a large majority of predicted alternative transcripts may not even be translated into proteins.
This is not an isolated example. There's a growing consensus among experts that most splice variants are due to splicing errors and true alternative splicing is not widespread. There's no solid evidence that humans make 100,000 or even 40,000 different polypeptides from only 20,000 protein-coding genes. It's time to stop spreading false myths about alternative splicing and time to start dealing with facts and evidence.

1. This is a ball-park figure. The actual number of proven protein-coding genes is closer to 19,000.

Omenn, G.S., Lane, L., Lundberg, E.K., Beavis, R.C., Overall, C.M., and Deutsch, E.W. (2016) Metrics for the Human Proteome Project 2016: Progress on identifying and characterizing the human proteome, including post-translational modifications. Journal of Proteome Research, 15:3951-3960. [doi: 10.1021/acs.jproteome.6b00511]

Paik, Y.-K., Overall, C.M., Deutsch, E.W., Hancock, W.S., and Omenn, G.S. (2016) Progress in the Chromosome-Centric Human Proteome Project as Highlighted in the Annual Special Issue IV. Journal of Proteome Research, 15:3945-3950. [doi: 10.1021/acs.jproteome.6b00803]

Tress, M.L., Martelli, P.L., Frankish, A., Reeves, G.A., Wesselink, J.J., Yeats, C., ĺsólfur Ólason, P., Albrecht, M., Hegyi, H., and Giorgetti, A. (2007) The implications of alternative splicing in the ENCODE protein complement. Proceedings of the National Academy of Sciences, 104:5495-5500. [doi: 10.1073/pnas.0700800104]

Tress, M.L., Abascal, F., and Valencia, A. (2016) Alternative Splicing May Not Be the Key to Proteome Complexity. Trends in Biochemical Sciences [doi: 10.1016/j.tibs.2016.08.008]


Anonymous said...

There are 1.5-2 million proteins produced if you include antibodies and T-cell receptors

Larry Moran said...

Good quibble!

Anonymous said...

Whether its a quibble depends on the context of the discussion. If ones discussing the number of proteins required to make a human its a quibble.

I'm surprised the people making the 'lotsa proteins' argument haven't brought up the topic of antibodies as a way to make it seem more plausible.

Federico Abascal said...

You can think also of post-translational modifications, protein-protein interaction, etc... all that increases the complexity of the proteome too.

Very interesting article, I can't avoid to like it :-)

Eric said...

My gut instinct is that splice variants are probably the result of something similar to low copy number RNA transcripts. As Larry has mentioned elsewhere, biology is messy. No protein has 100% specificity, meaning that splice variants, like off-target transcription, are going to happen. To use an analogy, every factory throws out a certain percentage of their product due to manufacturing errors. The null hypothesis should be errors in mRNA maturation, but we all know that some scientists are more than willing to ditch the null hypothesis.

Chas Peterson said...

Thanks for an interesting and informative article. Your view makes more sense to me.
These big gee-whiz numerical factoids take on a life of their own far too often. Another I saw debunked recently was the idea that human bodies contain 100 trillion cells! And 10 times that many bacterial cells!!! Careful consideration of the data suggest 30 trillion human cells (90% of them erythrocytes!)and 30-50 trillion bacteria. "The numbers are similar enough that each defecation event may flip the ratio to favor human cells over bacteria."

anonymous said...

On the subject of post-translational modification not being a "quibble" has a somewhat disingenuous graphic which some present may find provocative.

Unknown said...

grch38 gtf file has over 152,000 "protien_coding" transcripts annotated.... it cant all be bullshit... 19-20k is deff not the answer to "how many proteins can the human genome make?"

Larry Moran said...

152,000 protein-coding mRNAs is definitely bullshit. The whole point of this post was to present the evidence for one protein per gene for most of the 20,000 protein-coding genes.

The idea that each gene can make several different proteins by alternative splicing is a myth without any substantive supporting evidence.

Michael Tress said...

There are slightly more than 137,000 predicted proteins translated from these mRNA, if you combine RefSeq and Ensembl/GENCODE. BUT the real number of different proteins is substantially lower because (a) Ensembl/GENCODE's set includes many unfinished fragments that if extended would be sequence identical and (b) RefSeq annotates in common SNVs, which makes many of sequences slightly different from the equivalent Ensembl/GENCODE sequence, when in reality it is the same protein. Also more than 45,000 sequences from RefSeq are automatic predictions, often without supporting evidence. So, there are a lot fewer than 152,000 predicted proteins in the human genome. And it is quite clear that most coding genes have a single important protein isoform (Ezkurdia et al, 2015, Most highly expressed protein-coding genes have a single dominant isoform)

How many alternative transcripts are translated into functionally important proteins? Well, we don't know, but there are indications that it is not many more than 10,000, and might be fewer. Conserved alternative splice variants are clearly likely to be functionally relevant and we find that 5% of events have a last common ancestor in fish (Martinez-Gomez et al, 2021, The clinical importance of tandem exon duplication-derived substitutions), many of which have arisen from tandem duplications. However, three quarters of alternative transcripts are primate-derived (Rodriguez et al, 2020, An analysis of tissue-specific alternative splicing at the protein level).

Lack of conservation does not preclude functional importance, of course, but there is little evidence to support that these splice variants have a role, even if translated. For example, a large number of these primate exons are derived from SINE Alu regions. While we found that some of these alternative exons are translated and had even been incorporated into the main isoform of a handful of genes (Gomez-Martinez, 2020, Few SINEs of life: Alu elements have little evidence for biological relevance despite elevated translation), we did not find any evidence of purifying selection even for the few cases with peptide support. So, the vast majority of annotated alternative transcripts are unlikely to have cellular roles.