This jumble of different kinds of genes makes it difficult to estimate the total number of genes in the human genome. The current estimates are about 20,000 protein-coding genes and about 5,000 genes for functional RNAs.
Aside from the obvious highly conserved genes for ubiquitous RNAs (rRNA, tRNAs etc.), protein-coding genes are the easiest to recognize from looking at a genome sequence. If the protein is expressed in many different species then the exon sequences will be conserved and it's easy for a computer program to identify the gene. The tough part comes when the algorithm predicts a new protein-coding gene based on an open reading frame spanning several presumed exons. Is it a real gene?
There are about 20,000 potential protein-coding genes in the human genome. Proteins for 85% of these genes have been detected by various assays.Over the years, the number of protein-coding genes in the human genome has dropped from the initial reports of about 30,000 to about 20,000. All these estimates were in line with the predictions of knowledgeable exerts based on data going back to the 1960s.1
This drop in the number of predicted genes is fairly typical of genome studies. The gene-finding algorithms tend to over-predict the number of protein-coding genes and subsequent annotation leads to a "finished" genome sequence that eliminates all the false positives. This is one reason why you shouldn't trust the initial estimates of gene number in newly published genomes. Most of them will never be "finished" so the number of genes will always be inflated by a large number of false positives. Many of these unconfirmed genes will be ORFans—possible genes that are unique to that species. Don't get excited about ORFans unless you are dealing with a finished, well-annotated, genome.
The most obvious way of confirming a potential protein-coding gene is to find and confirm synthesis of a functional protein. Hundreds of human genes have been intensely studied over the years and in those cases there's no doubt at all about their existence. Several thousand more genes have matched to proteins in various ways.
That still left thousand of putative genes with no evidence that they actually make a protein.
Advancing technology in the past 15 years has made it possible to identify individual proteins even though they may be present in just a few copies per cell. The technique relies on mass-spectrometry and the existence of a well-annotated genome sequence [Genomics, Proteomics and Mass Spectrometry] [Biochemistry and Mass Spectrometry].
Here's a brief description of how it works. Take any tissue or groups of cells and isolate all the proteins. Digest them with a protease—enzymes that chop the protein into small pieces by cutting at a specific site. Trypsin is commonly used in these studies; it catalyzes cleavage of the peptide bond on the carbonyl side of lysine (K) and arginine (R) residues producing peptide fragments that all end with lysine and arginine.
The next step involves "weighing" all of the thousands of peptides in a mass spectrometer. You can visualize the process by imagining that the mixture of peptides is sprayed into one end of a long tube surrounded by an electromagnetic field that makes the charged peptides "fly" to the far end of the tube.2 Small peptides fly faster than larger ones so if you measure the time of flight (TOF) you can calculate the mass of the peptides. That's why it's called "mass spectrometry."
At the end of the experiment you have a huge list of the molecular weights of thousands and thousands of individual peptides. Here comes the fun part.
If you look at the coding sequence of a gene you can predict all of the peptides that will be produced by digesting with various proteases. For example, one of the predicted fragments of the human serum albumin protein is FKDLGEENFK (ends in lysine=K). This has an exact molecular mass of 1226.59 so if you find a peptide of that size in your mixture then it probably means that human serum albumin was present. In practice you want to identify several unique peptide masses for each protein just to be sure.
This technique only works if you have a reliable gene sequence and it only works on the genome level if you have a "finished" genome sequence, properly annotated so you can identify all potential reading frames. It only works if you have good computer programs and lots of memory and storage. It only works is you have very good mass spectrometers and experts who know how to use them. We do a simple experiment like this in our undergraduate biochemistry lab (3rd year) using a purified protein from a bacterial species with a small genome. It doesn't always work. That's several orders of magnitude easier than identifying all the proteins in a species with a large complex genome.
All those things can be found at several top-notch locations around the world. Recently two groups have published the results of massive mass-spec analyses of human proteins from many different tissues in order to determine the full extent of the human proteome.
Kim, M.-S., Pinto, S.M., Getnet, D., Nirujogi, R.S., Manda, S.S., Chaerkady, R., Madugundu, A.K., Kelkar, D.S., Isserlin, R., Jain, S., Thomas, J.K., Muthusamy, B., Leal-Rojas, P., Kumar, P., Sahasrabuddhe, N.A., Balakrishnan, L., Advani, J., George, B., Renuse, S., Selvan, L.D.N., Patil, A.H., Nanjappa, V., Radhakrishnan, A., Prasad, S., Subbannayya, T., Raju, R., Kumar, M., Sreenivasamurthy, S.K., Marimuthu, A., Sathe, G.J., Chavan, S., Datta, K.K., Subbannayya, Y., Sahu, A., Yelamanchi, S.D., Jayaram, S., Rajagopalan, P., Sharma, J., Murthy, K.R., Syed, N., Goel, R., Khan, A.A., Ahmad, S., Dey, G., Mudgal, K., Chatterjee, A., Huang, T.-C., Zhong, J., Wu, X., Shaw, P.G., Freed, D., Zahari, M.S., Mukherjee, K.K., Shankar, S., Mahadevan, A., Lam, H., Mitchell, C.J., Shankar, S.K., Satishchandra, P., Schroeder, J.T., Sirdeshmukh, R., Maitra, A., Leach, S.D., Drake, C.G., Halushka, M.K., Prasad, T.S.K., Hruban, R.H., Kerr, C.L., Bader, G.D., Iacobuzio-Donahue, C.A., Gowda, H., and Pandey, A. (2014) A draft map of the human proteome. Nature, 509:575-581. [doi: 10.1038/nature13302]
Wilhelm, M., Schlegl, J., Hahne, H., Gholami, A.M., Lieberenz, M., Savitski, M.M., Ziegler, E., Butzmann, L., Gessulat, S., Marx, H., Mathieson, T., Lemeer, S., Schnatbaum, K., Reimer, U., Wenschuh, H., Mollenhauer, M., Slotta-Huspenina, J., Boese, J.-H., Bantscheff, M., Gerstmair, A., Faerber, F., and Kuster, B. (2014) Mass-spectrometry-based draft of the human proteome. Nature, 509:582-587. [doi: 10.1038/nature13319]
About 2,400 genes are "housekeeping" genes whose protein products are found in all cells. About 1,500 genes were only expressed in one of the 30 tissues and cells types analyzed.
According to Kim et al. there are still several thousand potential protein-coding genes that have not been confirmed by detecting a protein product. In addition, they found 44 new ORFs that have not been annotated in the latest release of the human genome. These are potential new genes but the authors caution that the proteins may not have a function. Only 144 pseudogenes produced a polypeptide out of about 15,000 in the human genome. This is not unexpected since recently inactivated genes might still produce nonfunctional protein or protein fragments. It reminds us that cells can produce junk proteins as well as junk DNA.
Wilhelm et al. estimate that the human genome contains 10,000-12,000 core genes that are ubiquitously expressed (= "housekeeping" genes).
These authors also looked at potential coding regions in large non-coding RNAs (lncRNAs). There are more than 21,000 potentially functional lncRNAs in the human genome although almost nobody believes that they are all functional (but see 3,000 new genes discovered in the human genome - dark matter revealed ]. Wilhelm et al. found that 404 of these potential lncRNA genes encode detectable peptides. Here's what they say about them ...
"The biological significance of translated lincRNAs and [other RNAs] is not clear at present. These may constitute proteins 'in evolution' representing hitherto undiscovered biology or arise by stochastic chance marking such proetins as 'biological noise.'"The results from these two groups indicate that we know almost all the protein-coding genes in the human genome (~20,000) and there are likely to be very few undiscovered protein-coding genes.
Some scientists are skeptical. Ezkurdia et al. (2014) think that both studies over-estimate the number of genes that are expressed. Their arguments are convincing but I don't think it makes a huge difference.
Their new database incorporates the data from the two Nature papers (Kim et al., 2014; Wilhelm et al., 2015) and comes up with different numbers of confirmed proteins. The Human Proteome Project (HPP) is an extension of C-HPP. Their most recent database has 16,491 "confident" proteins based on mass-spec and other types of experiments (Omenn et al., 2015). The pie-chart3 below summarizes the data.
The lead paper in the journal summarizes all of the work and highlights two papers that discuss ways of identifying new genes (Paik et al., 2015). They issue this caution that's undoubtedly directed at workers outside of the C-HPP Consortium.
These two papers provide a thorough examination of the challenges of claiming, confirming, and validating peptide findings and protein matches. We recommend that all investigators scrutinize the discussion sections of these paper and apply the guidelines to their own data sets and other publicly available data sets. Such quality assurance will be subjected to open discussion at the HUPO 2015 Congress in Vancouver. Claims of novel translated products from pseudogenes or long noncoding RNAs require at least as great scrutiny as missing proteins from genes with transcripts or homologies, including use of class-specific FDRs.
They identified proteins from 17,132 genes out of 20,344 (84%) in at least one study but only 13,841 that were confirmed in at least two studies. Of these, 8,874 (44%) were detected in all tissues. These housekeeping genes encode mitochondrial proteins, basic metabolism proteins, structural proteins including many membrane proteins, and proteins required for transcription, translation, DNA replication, and processing/modification.
Once again, the bottom line is that most of the potential protein-coding genes have been confirmed. There's still some doubt about some of the putative protein-coding genes that have no supporting evidence. The human genome contains at least 17,000-18,000 protein-coding genes but probably no more than 21,000.
1. There's a widespread myth going around that the experts were "shocked" to discover only 30,000 genes when the human genome sequence was completed. See False History and the Number of Genes, and Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome.
2. This "electrospray mass spectrometry" has now been mostly replaced by "matrix-assisted laser desorption ionization" or MALDI. When coupled to time of flight measurements the technique is known as MADLI-TOF. The details of the technique aren't important.
3. Don't forget that my wife's direct ancestor, William Playfair, invented pie charts! [Bar Graphs, Pie Charts, and Darwin]
Ezkurdia, I., Vázquez, J., Valencia, A., and Tress, M. (2014) Analyzing the First Drafts of the Human Proteome. Journal of Proteome Research, 13(8), 3854-3855. [doi: 10.1021/pr500572z]
Omenn, G. S., Lane, L., Lundberg, E. K., Beavis, R. C., Nesvizhskii, A. I., and Deutsch, E. W. (2015) Metrics for the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification. Journal of Proteome Research, 14(9), 3452-3460. [doi: 10.1021/acs.jproteome.5b00499]
Uhlén, M., Fagerberg, L., Hallström, B. M., Lindskog, C., Oksvold, P., Mardinoglu, A., Sivertsson, Å., Kampf, C., Sjöstedt, E., and Asplund, A. (2015) Tissue-based map of the human proteome. Science, 347(6220), 1260419. [doi: 10.1126/science.1260419 ]