Sandwalk: How Big Is the Human Genome?

Thursday, March 24, 2011

How Big Is the Human Genome?

The earliest direct estimates of the size of human genome clustered around 3,000 Mb (megabase pairs) or 3.0 ×10⁹ bp (base pairs). The textbooks settled on about 3,200 Mb based mostly on reassociation kinetics. According to those results from the 1970s, roughly 10% of the genome consists of highly repetitive DNA, 25-30% is moderately repetitive and the rest is unique sequence DNA.

A study by Morton (1991) looked at all of the estimates of genome size that had been published to date and concluded that the average size of the haploid genome in females is 3,227 Mb. This includes a complete set of autosomes and one X chromosome. The sum of autosomes plus a Y chromosome comes to 3,122 Mb. The average is about 3,200 which corresonds to 3.5 pg (picograms) and that's the value on Ryan Gregory's Animal Genome Size Database.

In the past decade or so the common assumption about the size of the human genome has dropped to about 3,000 Mb. This is because the draft sequence of the human genome came in at 2,800 Mb and the so-called "finished" sequence was still considerably less than 3,200 Mb. Most people didn't realize that there were significant gaps in the draft sequence and in the "finished" sequence.

The latest information on the human genome from the human genome consortium is 3,156,105,057 bp (3,156 Mb) (Build 37 version 2, patch 2=GRCh37.p3 (November 2010)). I believe this build still has gaps around the centromeres of the chromosomes. That region consists of highly repetitive sequences that are almost impossible to clone and sequence. These regions, also known as heterochromatin, were not targets of the original sequencing project. Their total size was estimated at 198 Mb (International Human Genome Sequencing Consortium, 2004) corresponding to about 6% of the genome.

The estimate may have been too large to begin with and, in addition, I'm pretty sure that some of these heterochromatic regions are included in the total size of Build 37 v2. That means that the total size of the human genome is very likely to be ~3,200 Mb or 3.2 ×10⁹ bp.

[Image Credit: Wikipedia: Creative Commons Attribution 2.0 Generic license]

Morton, N.E. (1991) Parameters of the Human Genome. Proc. Natl. Acad. Sci. (USA) 88:7474-7476 [free article on PubMed Central]

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945 [doi:10.1038/nature03001]

16 comments :

Reed A. Cartwright said...: Yay, that's the number that I have been using.; Thursday, March 24, 2011 1:25:00 PM
T Ryan Gregory said...: Hi Larry,

It's actually a tough question (though recall that even the human chromosome number was debated for some time). The issue is that sequencing is actually a very bad way to estimate total genome size. On the other hand, our other methods (e.g., Feulgen densitometry, flow cytometry) are all relative estimates made using a standard of "known" (i.e., generally accepted) genome size. Human is more often a standard than a study subject, and we use 3.5pg (= 3.4Gb) in the database simply because that was widely used in the past and we can easily correct all the estimates for other species based on it if we use a single value for each standard. I suspect 3.5pg is a bit high, but you're absolutely correct that we don't entirely know how much has been missed in the sequencing programs. In any case, the human genome is very average in size for a mammal at 3.2-3.4Gb-ish.; Thursday, March 24, 2011 1:27:00 PM
manuel moe g said...: [Part 1 of 2]

Forgive my ignorance, but I am trying to make sense of different descriptions of the human genome, and different descriptions of the information needed to fully specify a large mammal, like a man.

You talk about 3.5 Gb for the genome, and, giving Ray Kurzweil the benefit of the doubt, 50 million bytes after loss-less compression.

If someone made extravagant claims about a computer program that runs on some unknown hardware and unknown OS, I would be unamused if they handed me a thumb-drive containing the compressed binary executable, and nothing more. This single file would demonstrate nothing.

I would demand the original source code, the specification for the code (including the business decisions the code is meant to automate, at the very least), some documentation demonstrating that I can move back and forth between points in the specification and the source code lines encoding that part of the specification, and the code for the automated tests (so an automated test can demonstrate what changes to the code will still keep it within specification, at the very least).

And maybe the same for some of the libraries and hardware - maybe needing the full specification if the libraries, OS, and hardware if they all are very novel, quite unlike any I have worked with before.

So there would be a dramatic explosion of information needed, moving from the binary executable to a bare minimum specification of a computer program as defined above.; Thursday, March 24, 2011 2:49:00 PM
manuel moe g said...: [Part 2 of 2]

In the debate between PZ and Kurzweil, PZ makes this point:

http://scienceblogs.com/pharyngula/2010/08/ray_kurzweil_does_not_understa.php

"""

Let me give you a few specific examples of just how wrong Kurzweil's calculations are. Here are a few proteins that I plucked at random from the NIH database; all play a role in the human brain.

First up is RHEB (Ras Homolog Enriched in Brain). It's a small protein, only 184 amino acids, which Kurzweil pretends can be reduced to about 12 bytes of code in his simulation. Here's the short description.

MTOR (FRAP1; 601231) integrates protein translation with cellular nutrient status and growth signals through its participation in 2 biochemically and functionally distinct protein complexes, MTORC1 and MTORC2. MTORC1 is sensitive to rapamycin and signals downstream to activate protein translation, whereas MTORC2 is resistant to rapamycin and signals upstream to activate AKT (see 164730). The GTPase RHEB is a proximal activator of MTORC1 and translation initiation. It has the opposite effect on MTORC2, producing inhibition of the upstream AKT pathway (Mavrakis et al., 2008).

Got that? You can't understand RHEB until you understand how it interacts with three other proteins, and how it fits into a complex regulatory pathway.

"""

I am inclined to grant PZ the point, and say his understanding of the immensity of the task outstrips Kurzweil's understanding.

Would the explosion of information needed to move from the complete genome to the complete specification of a large mammal be on the same order of the explosion of information needed to move from the binary executable to a bare minimum specification of a computer program as defined above? Did I capture the gist of it, or am I hopelessly mistaken?; Thursday, March 24, 2011 2:49:00 PM
Rich said...: How do genome size estimates based on sequencing handle repetitive regions that are likely to collapse into a single contig? Do they look at read depth across those regions (i.e., similar to Eichler et al.'s early identification of segmental duplicates)?; Thursday, March 24, 2011 3:05:00 PM
DK said...: It's a small protein, only 184 amino acids, which Kurzweil pretends can be reduced to about 12 bytes of code in his simulation.

This should qualify for the prize in most retarded understanding of bilogy by a non-biologist. Kurzweil has probably gone senile from popping close to a thousand of pills every week in an effort to live forever.

184 aa protein contains so much information that we at present cannot handle it! The folding problem can in principle be reduced to a decryption task and at present we cannot predict protein folding with any degree of reliability without cheating. "12 bytes"!

One can pretend to not pay attention to this information because, supposedly, folded state is encoded in the sequence. Not quite! Without correct interactions with the rest of the cell, folding typically fails. (Ask anyone expressing mammalian proteins in bacteria).

And then, even if one disregards folding, this 184 aa protein interacts with >15,000 other proteins (and carbohydrates, and nucleic acids, and various small molecules). Yep, many thousands. Most of these interactions are weak and fleeting, playing no major role but in aggregation they all matter because in sum total they make up a cell. Imaginary experiment of wiping out all of the "non-specific" interactions results most likely in a non-functional cell or, at the very least, a very different cell.; Thursday, March 24, 2011 5:47:00 PM
grc said...: The size of the human genome is difficult to estimate. Different individuals can have large scale sequence differences so the size of the genome in one individual will differ in size from the genome in another individual.
The original assembly was meant to be a haploid representation of the euchromatic genome. The GRC is now trying to represent large scale structural diversity, so some regions are now represented by >1 path.
The number you use in the post represents gap estimates (including heterochromatic gaps) as well as the sequence from the alternate alleles. There are 2.86 Mb of sequenced bases in the Primary assembly (the non-redundant haploid representation of the assembly). The GRC is working on trying to represent segmental duplication, and while there is still work to do the representation is good in some regions. Regions that are being worked on are being tracked by the GRC and are available from the GRC website.; Thursday, March 24, 2011 9:11:00 PM
Larry Moran said...: grc says,

The number you use in the post represents gap estimates (including heterochromatic gaps) as well as the sequence from the alternate alleles. There are 2.86 Mb of sequenced bases in the Primary assembly (the non-redundant haploid representation of the assembly).

Thanks. I'm not an expert on the human genome sequence but neither am I uninformed. If I'm having trouble figuring out the size of the human genome doesn't that suggest a problem?

Why don't you have a clear and concise answer to the question on the NCBI website? The number I quoted (3,156,105,057 bp) is given as "Total Sequenced Bases in Assembly."

Are you now telling me that this is a lie because it includes gap estmates?

Could you tell me what the current estimate of genome size is and how it breaks down into actual sequence (2.86 Mb?) and estimates of missing sequence? Is it 3.16 Mb?

BTW, who are you?; Friday, March 25, 2011 11:34:00 AM
grc said...: Larry,
My name is Deanna Church and I work with the GRC. The GRC is tasked with producing the human (and mouse and zebrafish) reference assemblies. I wasn't trying to suggest you were lying, it is just that assembly statistics can be complicated. If you look at the GRC statistics page:
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml

We do give an explanation of how the stats are calculated. Although I think you make a good point in that we should list the Total bases in the assembly and the Total bases in the Primary Assembly (which is meant to represent the haploid genome). We are working on publishing a paper describing our efforts but in reviewing our web page, I see we could certainly add some additional explanatory text and examples of both alternate loci and PATCHES. I'm certainly open to any suggestions that will make the site clearer!; Thursday, April 07, 2011 6:57:00 AM
Anonymous said...: I would like to have a copy of the human genome on DVD. LWZ compressed if possible. How many dvd's would it take?; Thursday, January 19, 2012 8:22:00 AM
Peter said...: Less than 1 DVD, even without any compression. There's ~3.2 Gbp (giga-basepairs) of information. Even if you use a whole byte for each, that's only 3.2 GB (gigabytes) of data, and a DVD holds anything from 4.7 to 8.7 GB depending whether you're talking single-layer or double-layer.

You can bring that down by a factor of 4 with more efficient encoding (you only need two bits to encode each base pair, not a whole byte). With even minimally effective compression, it would easily fit on a single CD with room to spare.; Monday, February 06, 2012 2:57:00 PM
mohamad gaber said...: M.Gaber
I'm not a biologist but wold like to ask this theoretical question. In the future if I have the complete human genome for a particular man on a DVD or CD or whatever, is it possible (theoretically) to reproduce this genome and put it in a human egg and get a baby clone for this man?; Monday, March 18, 2013 8:29:00 AM
mohamad gaber said...: I have a feeling that something is missing. How can the human complicity be reduced to less than 3.2 GB!; Monday, March 18, 2013 8:37:00 AM
Larry Moran said...: Yes.; Monday, March 18, 2013 12:15:00 PM
Biological Sciences said...: Mohamad Gaber,

I think because those 3.2 GB of information produce molecules that interact in complex ways in a living system. There are feedback loops among molecules that cause different portions of that information to be expressed at different times over the life of the organism, creating essentially, an unlimited level of complexity. Basically, it's not how much information is in the genome, but how that information is used (regulatory sequences play a large role in that).; Tuesday, May 21, 2013 3:18:00 PM
Unknown said...: It's not Gigabytes (GB) but Giga bits. In computers there is 8 bits per byte, but computer bits are binary where DNA has 3 base pairs. So the human genome would take up 573MB (3.2*1.5*1000*1000*1000/8/1024/1024). If that dosen't seam like a lot then remember that you could store a 300 million written pages in that same space.; Friday, July 12, 2013 9:38:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Thursday, March 24, 2011

How Big Is the Human Genome?

16 comments :