Monday, February 06, 2012

How Much of Our Genome Is Sequenced?

I'm getting ready for a class on the size and composition of the human genome so I thought I'd check to see the latest estimate of its size. Recall that in an earlier posting I concluded that the size of the human genome was 3,200,000,000 bp (3,200,000 kb, 3,200 Mb, 3.2 Gb) [How Big Is the Human Genome?].

You might think that all you have to do is check out the human genome websites and look up the exact size. That doesn't work because not all of the human genome has been sequenced and organized into a contiguous assembly of 24 different strands (one for each chromosome). So that prompts the question, how much of the human genome has actually been sequenced?1

The latest assembly is GRCh37 Patch Release 7 (GRCh37.p7), released on Feb. 3, 2012. If you look at the data for this assembly you will see an estimate of the "Total Sequenced Bases in the Assembly." The number is 3,173,036,847 bp or 3.17 Gb. This value is close to estimates of the genome size from the years before the first draft of the genome sequence was published.

I was suspicious of this number since we know that there are many gaps in the human genome sequence. The largest gaps cover highly repetitive parts of the genome—mostly around the centromeres and other heterochromatic regions. There were also gaps at the locations of several gene clusters (e.g. ribosomal RNA genes) where it's impossible to determine the exact number of copies. In the case of ribosomal RNA gene clusters, these gaps have now been closed.

Deanna Church posted a few comments on my earlier posting. She's with the Genome Reference Consortium (GRC). That's the group responsible for updating the human genome. Deanna explained that "Total Sequenced Bases in the Assembly" is not an accurate representation of the truth.2 What it actually means is total sequenced bases plus estimated sizes of the gaps. In other words, it's a good estimate of the size of the genome.

So, how much of the genome is actually sequenced and organized into "scaffolds," or contiguous stretches of DNA? You can see the actual numbers by clicking on Ungapped Lengths on the NCBI website.

The total number of sequenced base pairs that have been organized into scaffolds and placed on a particular chromosome is 2,861,332,606 bp. An additional 6,110,758 bp have been sequenced but the blocks of sequence cannot be placed in the assembly. Most of this unassigned sequence is on chromosomes 1,4,9, and 17 but some of it can't even be associated with a particular chromosome.

If we assume that the true haploid genome size is 3.2 Gb, or 3,200 Mb, then the sequenced and assigned part of the genome represents 89.6% and the unassigned sequenced part is 0.2%.

We can say that only 90% of the human genome has been sequenced and the remaining 10% falls into 357 gaps scattered throughout the genome. (Every chromosome has unsequenced gaps but some have more than others and it doesn't depend on the size of the chromosome.)

The The Wellcome Trust Sanger Institute is part of the Genome Reference Consortium but it maintains its own website on the human genome [Whole Genome]. The data on the e!Ensembl page refers to build CRCh37.p5 from Feb. 2009 but it also says the data was updated in Dec. 2011.

According to the Sanger Institute, the size of the sequenced genome is 3,283,984,159 bp and the "golden path length" is 3,101,804,739 bp. I've tried to find out what these numbers mean but if the information is present on the Ensembl website then it's very well hidden.

Are you interested in the number of genes? Here's the data from Ensembl. It indicates that the human genome contains 33,399 genes! [What Is a Gene?] [What is a gene, post-ENCODE?] This inflated value is calculated by including 12,523 genes that make an RNA product that's not translated. This is almost certainly a highly inflated number.

The data indicates that there are 181,744 gene transcripts or between 5 and 9 transcripts per gene depending on how you count the genes. I don't believe there are this many biologically functional transcripts per gene. I think the actual number is much closer to one (1) [Genes and Straw Men].

1. It certainly doesn't "beg the question." That means something else entirely [Begging the Question].

2. That's a euphemism for "It's a lie!"


  1. I struggle to keep up, but I do appreciate your postings, Prof. Moran. But I have to ask - 24? I thought there were 23?

    (22 pairs of autosomes and one pair of sex chromosomes)

    Sorry if it is a stupid question.

    1. 22 pairs of autosomes and a pair of sex chromosomes. The sequences of the X and Y chromosomes bring it to 24.

    2. 22 autosomes plus two different sex chromosomes (X & Y) equals 24 chromosomes


    "The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions)."

    In other words, the golden path length = total sequenced bases plus estimated sizes of the gaps. The term "golden path" comes from restriction mapping and refers to the best assembly of the various restriction fragments in order. Since they quote 3,101,804,739 bp rather than 3,173,036,847 bp, I presume they're using an older genome build.

    I'm not quite sure how they then get up to a total of 3,283,984,159 bp. From the glossary it looks as though the higher total includes duplicated regions like the pseudoautosomal regions of the sex chromosomes, and perhaps other large contigs like MHC haplotypes.

  3. How do you know the "12,523 genes that make an RNA product that's not translated" really aren't translated? This has been bugging me for awhile.

  4. There are an awful lot of RNA genes. I guess mice have as many as humans?

  5. So, not even after sequencing more than 30,000 human genomes (not all deep sequencing though), we still have not covered all regions?

    Hum, there is something seriously wrong with how the projects are been managed. I would have guessed that someone in the teams would be interested in putting together a complete model genome. Even if the model was bound to "evolve" as more sequences got obtained.

    1. Many of the regions may be covered by sequence, but are difficult to assemble due to the biological complexity of the genome and the technical limitations of next generation sequencing.
      On the difficulty of assembly NGS data:
      On the complexity of the human genome
      How the human assembly is now being managed:

      Also see:

    2. Thanks! Very useful (I knew about the problems and the complexity, but I did not know about the genome reference projects.

  6. Note: The 'Total number of sequenced bases in the assembly' represents the sequenced bases and the gaps- but, the sequenced bases also include regions of the genome that contain structural variants. For example, there are 8 representations of the MHC region in the assembly now- the one incorporated in the chromosome, plus 7 additional alternate paths. The reference assembly is no longer just a flattened representation of a haploid genome as we know this does not accurately represent biology.
    I encourage you to read our paper that describes the assembly model:
    But, the short answer is adding up all of the bases in the assembly isn't going to give you the size of the genome. The current assembly only attempts to represent the euchromatic portion of the genome, so large, important but difficult to sequence regions are represented by Ns. Additionally complex regions for which we have good data may have >1 representation.

  7. Is it really fair to ask about the size of "the" human genome (only half joking)? I wonder how much we should expect genome size to vary between individuals based on large scale variation that has already been studied (e.g. Copy Number Variants), or other repetitive sequences? Do you know if anyone has ever studied variation in centromere or telomere size between individuals?

    On a different thought, regarding the sex chromosomes, I think it might be a little misleading to count the length of each autosome once, and then add the length of both the X and the Y. Considering the total number of bases sequenced in some haploid reference human genome (e.g. 3.2Gb) implies that doubling it would equal the total expected number of bases, but the sex chromosomes throw a wrench into the mix. To be correct, there should probably be two haploid estimates (although given the relatively small size of the Y, they wouldn't differ too much), one including the X, and one including the Y (because no healthy haploid genome would contain both the X and the Y).

    But maybe I'm just nit-picking. :)

  8. mathbionerd: you bring up a really interesting point. Work from Evan Eichler's group show that about 50% of gaps in the Euchromatic part of the assembly
    are polymorphic. The reference assembly is really a mosaic and not meant to represent any one person- however, if one wanted to get a rough approximation of a haploid genome size, using the stats for the Primary assembly is the way to go- and we could continue to nit pick about the sex chromosomes.

  9. May I suggest to add something about synteny to the content of your course? IMO synteny maps impressingly display the relatedness among e.g. mammalian species and the degree of chromosomal rearrangements that took place during evolution.

  10. In an introductory Bioinformatics course held in my training programme (GTPB) a naïve student "accidentally landed" in the p arm of Chr 13 (Human). No sequences are available to play with. Actually, in classes we display the image of the human kariotype image from genome browsers and we often forget that the little caps over the p-arms of Chr 13, 14, 15, 21 and 22 represent a substantial number of base pairs, almost their full lengths, for which the sequence has not been assembled yet. The reasons for this are lack of experimental data (due to chromatin) and computational (low complexity makes assembly difficult).
    Not having transcripts in such large region has implications of several depths.

    For the whole of the Human Genome
    Assembly: GRCh37.p7, Feb 2009
    Database version: 67.37
    Base Pairs: 3,287,209,763
    Golden Path Length: 3,101,804,739

    the Golden Path Length is less than 6% shorter than the total base pairs.
    A gross measure, certainly, but not as small as a layman would guess from reading the news in the popular press.

    What strikes me most is that this fact is often ignored in the literature, where,for example, statements about copy numbers and copy number variations refer to genes found in the sequence that is known and made available, not to the still unassembled sequences in human,where the numbers could be different...

  11. There was so much useful info (potential explanations for statements in the literature that appear contradictory) in this post and some of the replies that I just had to say thank you. So Thanks.

  12. Very informative blog post! I have one question though: You say "In the case of ribosomal RNA gene clusters, these gaps have now been closed.", but in the linked post you say "The reported human genome sequences do not contain the region of the 5S RNA genes. It is impossible to clone and assemble fragments of repeated DNA sequences.". So what is it? Have the gaps been closed or is there only meta information about copy numbers and the like?

    1. I was thinking of the operons for the large ribosomal RNAs (45S) when I said that the gaps have been closed. There are five clusters in the human genome. They are at 13p12, 14p12, 15p12, 21p12, and 22p12. There are about 140 repeats in each cluster.

      I don't know if that's also true of the much smaller 5S RNA genes. I doubt that there's a reliable sequences spanning that cluster. (It's on chromosome 1 at 1q42.) There are about 100 copies with a repeat length of 2.2 kb.

  13. Hello Prof, We learned so much about Human genome that now we know non-coding regions are very very important. I feel saying that the difference between Human and Chimp is only 2% if not correct. What do you think?

    1. I've known that non coding regions are very, very important since 1968.

      I'm not concerned about your "feelings" with respect to the difference between humans and chimpanzees. Deal with the facts.

  14. I did a count of the total number of codons in Chromosome 1, and the number of unsequences codons (NNN). My figures are
    82972321 codons in total, of which 6154457 are NNN
    So 7.4% of Chromosome 1 is as yet unsequenced . Does anyone know when or where these hidden sequences might be available?

    1. Most of these "codons" are highly repetitive sequences that are hard to deal with, especially to assemble. Don't count on anything in the near future. And of course they aren't codons, which live in exons.

    2. The latest build is GRCh38.p10 (Jan. 6, 2017). The total amount of sequenced DNA is 3,080,585,178 bp. The best estimate of the total size of the genome as judged by the remaining 875 gaps is 3,241,953,429 bp. Thus, 95% of the genome has been sequenced.

      The remaining 5% is mostly highly repetitive sequence and it's very unlikely that an exact sequence of those remaining gaps will ever be published. It's not worth the effort.

      Miga, K. H. (2015). Completing the human genome: the progress and challenge of satellite DNA assembly. Chromosome Research, 23(3), 421-426. doi: 10.1007/s10577-015-9488-2