Thursday, March 24, 2011

How Big Is the Human Genome?

The earliest direct estimates of the size of human genome clustered around 3,000 Mb (megabase pairs) or 3.0 ×109 bp (base pairs). The textbooks settled on about 3,200 Mb based mostly on reassociation kinetics. According to those results from the 1970s, roughly 10% of the genome consists of highly repetitive DNA, 25-30% is moderately repetitive and the rest is unique sequence DNA.

A study by Morton (1991) looked at all of the estimates of genome size that had been published to date and concluded that the average size of the haploid genome in females is 3,227 Mb. This includes a complete set of autosomes and one X chromosome. The sum of autosomes plus a Y chromosome comes to 3,122 Mb. The average is about 3,200 which corresonds to 3.5 pg (picograms) and that's the value on Ryan Gregory's Animal Genome Size Database.

In the past decade or so the common assumption about the size of the human genome has dropped to about 3,000 Mb. This is because the draft sequence of the human genome came in at 2,800 Mb and the so-called "finished" sequence was still considerably less than 3,200 Mb. Most people didn't realize that there were significant gaps in the draft sequence and in the "finished" sequence.

The latest information on the human genome from the human genome consortium is 3,156,105,057 bp (3,156 Mb) (Build 37 version 2, patch 2=GRCh37.p3 (November 2010)). I believe this build still has gaps around the centromeres of the chromosomes. That region consists of highly repetitive sequences that are almost impossible to clone and sequence. These regions, also known as heterochromatin, were not targets of the original sequencing project. Their total size was estimated at 198 Mb (International Human Genome Sequencing Consortium, 2004) corresponding to about 6% of the genome.

The estimate may have been too large to begin with and, in addition, I'm pretty sure that some of these heterochromatic regions are included in the total size of Build 37 v2. That means that the total size of the human genome is very likely to be ~3,200 Mb or 3.2 ×109 bp.

[Image Credit: Wikipedia: Creative Commons Attribution 2.0 Generic license]

Morton, N.E. (1991) Parameters of the Human Genome. Proc. Natl. Acad. Sci. (USA) 88:7474-7476 [free article on PubMed Central]

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945 [doi:10.1038/nature03001]


  1. Yay, that's the number that I have been using.

  2. Hi Larry,

    It's actually a tough question (though recall that even the human chromosome number was debated for some time). The issue is that sequencing is actually a very bad way to estimate total genome size. On the other hand, our other methods (e.g., Feulgen densitometry, flow cytometry) are all relative estimates made using a standard of "known" (i.e., generally accepted) genome size. Human is more often a standard than a study subject, and we use 3.5pg (= 3.4Gb) in the database simply because that was widely used in the past and we can easily correct all the estimates for other species based on it if we use a single value for each standard. I suspect 3.5pg is a bit high, but you're absolutely correct that we don't entirely know how much has been missed in the sequencing programs. In any case, the human genome is very average in size for a mammal at 3.2-3.4Gb-ish.

  3. [Part 1 of 2]

    Forgive my ignorance, but I am trying to make sense of different descriptions of the human genome, and different descriptions of the information needed to fully specify a large mammal, like a man.

    You talk about 3.5 Gb for the genome, and, giving Ray Kurzweil the benefit of the doubt, 50 million bytes after loss-less compression.

    If someone made extravagant claims about a computer program that runs on some unknown hardware and unknown OS, I would be unamused if they handed me a thumb-drive containing the compressed binary executable, and nothing more. This single file would demonstrate nothing.

    I would demand the original source code, the specification for the code (including the business decisions the code is meant to automate, at the very least), some documentation demonstrating that I can move back and forth between points in the specification and the source code lines encoding that part of the specification, and the code for the automated tests (so an automated test can demonstrate what changes to the code will still keep it within specification, at the very least).

    And maybe the same for some of the libraries and hardware - maybe needing the full specification if the libraries, OS, and hardware if they all are very novel, quite unlike any I have worked with before.

    So there would be a dramatic explosion of information needed, moving from the binary executable to a bare minimum specification of a computer program as defined above.

  4. [Part 2 of 2]

    In the debate between PZ and Kurzweil, PZ makes this point:


    Let me give you a few specific examples of just how wrong Kurzweil's calculations are. Here are a few proteins that I plucked at random from the NIH database; all play a role in the human brain.

    First up is RHEB (Ras Homolog Enriched in Brain). It's a small protein, only 184 amino acids, which Kurzweil pretends can be reduced to about 12 bytes of code in his simulation. Here's the short description.

    MTOR (FRAP1; 601231) integrates protein translation with cellular nutrient status and growth signals through its participation in 2 biochemically and functionally distinct protein complexes, MTORC1 and MTORC2. MTORC1 is sensitive to rapamycin and signals downstream to activate protein translation, whereas MTORC2 is resistant to rapamycin and signals upstream to activate AKT (see 164730). The GTPase RHEB is a proximal activator of MTORC1 and translation initiation. It has the opposite effect on MTORC2, producing inhibition of the upstream AKT pathway (Mavrakis et al., 2008).

    Got that? You can't understand RHEB until you understand how it interacts with three other proteins, and how it fits into a complex regulatory pathway.


    I am inclined to grant PZ the point, and say his understanding of the immensity of the task outstrips Kurzweil's understanding.

    Would the explosion of information needed to move from the complete genome to the complete specification of a large mammal be on the same order of the explosion of information needed to move from the binary executable to a bare minimum specification of a computer program as defined above? Did I capture the gist of it, or am I hopelessly mistaken?

  5. How do genome size estimates based on sequencing handle repetitive regions that are likely to collapse into a single contig? Do they look at read depth across those regions (i.e., similar to Eichler et al.'s early identification of segmental duplicates)?

  6. It's a small protein, only 184 amino acids, which Kurzweil pretends can be reduced to about 12 bytes of code in his simulation.

    This should qualify for the prize in most retarded understanding of bilogy by a non-biologist. Kurzweil has probably gone senile from popping close to a thousand of pills every week in an effort to live forever.

    184 aa protein contains so much information that we at present cannot handle it! The folding problem can in principle be reduced to a decryption task and at present we cannot predict protein folding with any degree of reliability without cheating. "12 bytes"!

    One can pretend to not pay attention to this information because, supposedly, folded state is encoded in the sequence. Not quite! Without correct interactions with the rest of the cell, folding typically fails. (Ask anyone expressing mammalian proteins in bacteria).

    And then, even if one disregards folding, this 184 aa protein interacts with >15,000 other proteins (and carbohydrates, and nucleic acids, and various small molecules). Yep, many thousands. Most of these interactions are weak and fleeting, playing no major role but in aggregation they all matter because in sum total they make up a cell. Imaginary experiment of wiping out all of the "non-specific" interactions results most likely in a non-functional cell or, at the very least, a very different cell.

  7. The size of the human genome is difficult to estimate. Different individuals can have large scale sequence differences so the size of the genome in one individual will differ in size from the genome in another individual.
    The original assembly was meant to be a haploid representation of the euchromatic genome. The GRC is now trying to represent large scale structural diversity, so some regions are now represented by >1 path.
    The number you use in the post represents gap estimates (including heterochromatic gaps) as well as the sequence from the alternate alleles. There are 2.86 Mb of sequenced bases in the Primary assembly (the non-redundant haploid representation of the assembly). The GRC is working on trying to represent segmental duplication, and while there is still work to do the representation is good in some regions. Regions that are being worked on are being tracked by the GRC and are available from the GRC website.

  8. grc says,

    The number you use in the post represents gap estimates (including heterochromatic gaps) as well as the sequence from the alternate alleles. There are 2.86 Mb of sequenced bases in the Primary assembly (the non-redundant haploid representation of the assembly).

    Thanks. I'm not an expert on the human genome sequence but neither am I uninformed. If I'm having trouble figuring out the size of the human genome doesn't that suggest a problem?

    Why don't you have a clear and concise answer to the question on the NCBI website? The number I quoted (3,156,105,057 bp) is given as "Total Sequenced Bases in Assembly."

    Are you now telling me that this is a lie because it includes gap estmates?

    Could you tell me what the current estimate of genome size is and how it breaks down into actual sequence (2.86 Mb?) and estimates of missing sequence? Is it 3.16 Mb?

    BTW, who are you?

  9. Larry,
    My name is Deanna Church and I work with the GRC. The GRC is tasked with producing the human (and mouse and zebrafish) reference assemblies. I wasn't trying to suggest you were lying, it is just that assembly statistics can be complicated. If you look at the GRC statistics page:

    We do give an explanation of how the stats are calculated. Although I think you make a good point in that we should list the Total bases in the assembly and the Total bases in the Primary Assembly (which is meant to represent the haploid genome). We are working on publishing a paper describing our efforts but in reviewing our web page, I see we could certainly add some additional explanatory text and examples of both alternate loci and PATCHES. I'm certainly open to any suggestions that will make the site clearer!

  10. I would like to have a copy of the human genome on DVD. LWZ compressed if possible. How many dvd's would it take?

    1. Less than 1 DVD, even without any compression. There's ~3.2 Gbp (giga-basepairs) of information. Even if you use a whole byte for each, that's only 3.2 GB (gigabytes) of data, and a DVD holds anything from 4.7 to 8.7 GB depending whether you're talking single-layer or double-layer.

      You can bring that down by a factor of 4 with more efficient encoding (you only need two bits to encode each base pair, not a whole byte). With even minimally effective compression, it would easily fit on a single CD with room to spare.

  11. M.Gaber
    I'm not a biologist but wold like to ask this theoretical question. In the future if I have the complete human genome for a particular man on a DVD or CD or whatever, is it possible (theoretically) to reproduce this genome and put it in a human egg and get a baby clone for this man?

  12. I have a feeling that something is missing. How can the human complicity be reduced to less than 3.2 GB!

  13. Mohamad Gaber,

    I think because those 3.2 GB of information produce molecules that interact in complex ways in a living system. There are feedback loops among molecules that cause different portions of that information to be expressed at different times over the life of the organism, creating essentially, an unlimited level of complexity. Basically, it's not how much information is in the genome, but how that information is used (regulatory sequences play a large role in that).

  14. It's not Gigabytes (GB) but Giga bits. In computers there is 8 bits per byte, but computer bits are binary where DNA has 3 base pairs. So the human genome would take up 573MB (3.2*1.5*1000*1000*1000/8/1024/1024). If that dosen't seam like a lot then remember that you could store a 300 million written pages in that same space.