Sandwalk: How Much of Our Genome Is Sequenced?

Monday, February 06, 2012

How Much of Our Genome Is Sequenced?

I'm getting ready for a class on the size and composition of the human genome so I thought I'd check to see the latest estimate of its size. Recall that in an earlier posting I concluded that the size of the human genome was 3,200,000,000 bp (3,200,000 kb, 3,200 Mb, 3.2 Gb) [How Big Is the Human Genome?].

You might think that all you have to do is check out the human genome websites and look up the exact size. That doesn't work because not all of the human genome has been sequenced and organized into a contiguous assembly of 24 different strands (one for each chromosome). So that prompts the question, how much of the human genome has actually been sequenced?¹

The latest assembly is GRCh37 Patch Release 7 (GRCh37.p7), released on Feb. 3, 2012. If you look at the data for this assembly you will see an estimate of the "Total Sequenced Bases in the Assembly." The number is 3,173,036,847 bp or 3.17 Gb. This value is close to estimates of the genome size from the years before the first draft of the genome sequence was published.

I was suspicious of this number since we know that there are many gaps in the human genome sequence. The largest gaps cover highly repetitive parts of the genome—mostly around the centromeres and other heterochromatic regions. There were also gaps at the locations of several gene clusters (e.g. ribosomal RNA genes) where it's impossible to determine the exact number of copies. In the case of ribosomal RNA gene clusters, these gaps have now been closed.

Deanna Church posted a few comments on my earlier posting. She's with the Genome Reference Consortium (GRC). That's the group responsible for updating the human genome. Deanna explained that "Total Sequenced Bases in the Assembly" is not an accurate representation of the truth.² What it actually means is total sequenced bases plus estimated sizes of the gaps. In other words, it's a good estimate of the size of the genome.

So, how much of the genome is actually sequenced and organized into "scaffolds," or contiguous stretches of DNA? You can see the actual numbers by clicking on Ungapped Lengths on the NCBI website.

The total number of sequenced base pairs that have been organized into scaffolds and placed on a particular chromosome is 2,861,332,606 bp. An additional 6,110,758 bp have been sequenced but the blocks of sequence cannot be placed in the assembly. Most of this unassigned sequence is on chromosomes 1,4,9, and 17 but some of it can't even be associated with a particular chromosome.

If we assume that the true haploid genome size is 3.2 Gb, or 3,200 Mb, then the sequenced and assigned part of the genome represents 89.6% and the unassigned sequenced part is 0.2%.

We can say that only 90% of the human genome has been sequenced and the remaining 10% falls into 357 gaps scattered throughout the genome. (Every chromosome has unsequenced gaps but some have more than others and it doesn't depend on the size of the chromosome.)

The The Wellcome Trust Sanger Institute is part of the Genome Reference Consortium but it maintains its own website on the human genome [Whole Genome]. The data on the e!Ensembl page refers to build CRCh37.p5 from Feb. 2009 but it also says the data was updated in Dec. 2011.

According to the Sanger Institute, the size of the sequenced genome is 3,283,984,159 bp and the "golden path length" is 3,101,804,739 bp. I've tried to find out what these numbers mean but if the information is present on the Ensembl website then it's very well hidden.

Are you interested in the number of genes? Here's the data from Ensembl. It indicates that the human genome contains 33,399 genes! [What Is a Gene?] [What is a gene, post-ENCODE?] This inflated value is calculated by including 12,523 genes that make an RNA product that's not translated. This is almost certainly a highly inflated number.

The data indicates that there are 181,744 gene transcripts or between 5 and 9 transcripts per gene depending on how you count the genes. I don't believe there are this many biologically functional transcripts per gene. I think the actual number is much closer to one (1) [Genes and Straw Men].

1. It certainly doesn't "beg the question." That means something else entirely [Begging the Question].

2. That's a euphemism for "It's a lie!"

23 comments :

burntloafer said...: I struggle to keep up, but I do appreciate your postings, Prof. Moran. But I have to ask - 24? I thought there were 23?

(22 pairs of autosomes and one pair of sex chromosomes)

Sorry if it is a stupid question.; Monday, February 06, 2012 11:58:00 AM
Seth said...: 22 pairs of autosomes and a pair of sex chromosomes. The sequences of the X and Y chromosomes bring it to 24.; Monday, February 06, 2012 1:12:00 PM
Larry Moran said...: 22 autosomes plus two different sex chromosomes (X & Y) equals 24 chromosomes; Monday, February 06, 2012 1:25:00 PM
Peter said...: http://www.ensembl.org/Help/Glossary

"The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions)."

In other words, the golden path length = total sequenced bases plus estimated sizes of the gaps. The term "golden path" comes from restriction mapping and refers to the best assembly of the various restriction fragments in order. Since they quote 3,101,804,739 bp rather than 3,173,036,847 bp, I presume they're using an older genome build.

I'm not quite sure how they then get up to a total of 3,283,984,159 bp. From the glossary it looks as though the higher total includes duplicated regions like the pseudoautosomal regions of the sex chromosomes, and perhaps other large contigs like MHC haplotypes.; Monday, February 06, 2012 2:50:00 PM
Larry Moran said...: Thanks. Clear as mud. :-); Monday, February 06, 2012 3:12:00 PM
Anonymous said...: How do you know the "12,523 genes that make an RNA product that's not translated" really aren't translated? This has been bugging me for awhile.; Monday, February 06, 2012 6:51:00 PM
Atheistoclast said...: There are an awful lot of RNA genes. I guess mice have as many as humans?; Monday, February 06, 2012 7:46:00 PM
Anonymous said...: So, not even after sequencing more than 30,000 human genomes (not all deep sequencing though), we still have not covered all regions?

Hum, there is something seriously wrong with how the projects are been managed. I would have guessed that someone in the teams would be interested in putting together a complete model genome. Even if the model was bound to "evolve" as more sequences got obtained.; Monday, February 06, 2012 8:19:00 PM
Deanna Church said...: Note: The 'Total number of sequenced bases in the assembly' represents the sequenced bases and the gaps- but, the sequenced bases also include regions of the genome that contain structural variants. For example, there are 8 representations of the MHC region in the assembly now- the one incorporated in the chromosome, plus 7 additional alternate paths. The reference assembly is no longer just a flattened representation of a haploid genome as we know this does not accurately represent biology.
I encourage you to read our paper that describes the assembly model: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130012/
But, the short answer is adding up all of the bases in the assembly isn't going to give you the size of the genome. The current assembly only attempts to represent the euchromatic portion of the genome, so large, important but difficult to sequence regions are represented by Ns. Additionally complex regions for which we have good data may have >1 representation.; Monday, February 06, 2012 9:59:00 PM
mathbionerd said...: Is it really fair to ask about the size of "the" human genome (only half joking)? I wonder how much we should expect genome size to vary between individuals based on large scale variation that has already been studied (e.g. Copy Number Variants), or other repetitive sequences? Do you know if anyone has ever studied variation in centromere or telomere size between individuals?

On a different thought, regarding the sex chromosomes, I think it might be a little misleading to count the length of each autosome once, and then add the length of both the X and the Y. Considering the total number of bases sequenced in some haploid reference human genome (e.g. 3.2Gb) implies that doubling it would equal the total expected number of bases, but the sex chromosomes throw a wrench into the mix. To be correct, there should probably be two haploid estimates (although given the relatively small size of the Y, they wouldn't differ too much), one including the X, and one including the Y (because no healthy haploid genome would contain both the X and the Y).

But maybe I'm just nit-picking. :); Monday, February 06, 2012 10:58:00 PM
Deanna Church said...: mathbionerd: you bring up a really interesting point. Work from Evan Eichler's group show that about 50% of gaps in the Euchromatic part of the assembly
are polymorphic. The reference assembly is really a mosaic and not meant to represent any one person- however, if one wanted to get a rough approximation of a haploid genome size, using the stats for the Primary assembly is the way to go- and we could continue to nit pick about the sex chromosomes.; Tuesday, February 07, 2012 9:01:00 AM
Deanna Church said...: Many of the regions may be covered by sequence, but are difficult to assemble due to the biological complexity of the genome and the technical limitations of next generation sequencing.
On the difficulty of assembly NGS data:
http://www.ncbi.nlm.nih.gov/pubmed/22147368
http://www.ncbi.nlm.nih.gov/pubmed/21926179
On the complexity of the human genome
http://www.ncbi.nlm.nih.gov/pubmed/21030649
How the human assembly is now being managed:
http://www.ncbi.nlm.nih.gov/pubmed/21750661

Also see: http://genomereference.org; Tuesday, February 07, 2012 12:52:00 PM
Anonymous said...: Thanks! Very useful (I knew about the problems and the complexity, but I did not know about the genome reference projects.; Tuesday, February 07, 2012 3:09:00 PM
Anonymous said...: May I suggest to add something about synteny to the content of your course? IMO synteny maps impressingly display the relatedness among e.g. mammalian species and the degree of chromosomal rearrangements that took place during evolution.; Tuesday, February 07, 2012 11:27:00 PM
Pfern said...: In an introductory Bioinformatics course held in my training programme (GTPB) a naïve student "accidentally landed" in the p arm of Chr 13 (Human). No sequences are available to play with. Actually, in classes we display the image of the human kariotype image from genome browsers and we often forget that the little caps over the p-arms of Chr 13, 14, 15, 21 and 22 represent a substantial number of base pairs, almost their full lengths, for which the sequence has not been assembled yet. The reasons for this are lack of experimental data (due to chromatin) and computational (low complexity makes assembly difficult).
Not having transcripts in such large region has implications of several depths.

For the whole of the Human Genome
Assembly: GRCh37.p7, Feb 2009
Database version: 67.37
Base Pairs: 3,287,209,763
Golden Path Length: 3,101,804,739

the Golden Path Length is less than 6% shorter than the total base pairs.
A gross measure, certainly, but not as small as a layman would guess from reading the news in the popular press.

What strikes me most is that this fact is often ignored in the literature, where,for example, statements about copy numbers and copy number variations refer to genes found in the sequence that is known and made available, not to the still unassembled sequences in human,where the numbers could be different...; Saturday, July 28, 2012 5:50:00 AM
Unknown said...: There was so much useful info (potential explanations for statements in the literature that appear contradictory) in this post and some of the replies that I just had to say thank you. So Thanks.; Tuesday, November 19, 2013 4:13:00 PM
Unknown said...: Very informative blog post! I have one question though: You say "In the case of ribosomal RNA gene clusters, these gaps have now been closed.", but in the linked post you say "The reported human genome sequences do not contain the region of the 5S RNA genes. It is impossible to clone and assemble fragments of repeated DNA sequences.". So what is it? Have the gaps been closed or is there only meta information about copy numbers and the like?; Friday, September 25, 2015 4:40:00 AM
Larry Moran said...: I was thinking of the operons for the large ribosomal RNAs (45S) when I said that the gaps have been closed. There are five clusters in the human genome. They are at 13p12, 14p12, 15p12, 21p12, and 22p12. There are about 140 repeats in each cluster.

I don't know if that's also true of the much smaller 5S RNA genes. I doubt that there's a reliable sequences spanning that cluster. (It's on chromosome 1 at 1q42.) There are about 100 copies with a repeat length of 2.2 kb.; Friday, September 25, 2015 9:55:00 AM
Aahaa said...: Hello Prof, We learned so much about Human genome that now we know non-coding regions are very very important. I feel saying that the difference between Human and Chimp is only 2% if not correct. What do you think?; Thursday, December 22, 2016 6:54:00 AM
Larry Moran said...: I've known that non coding regions are very, very important since 1968.

I'm not concerned about your "feelings" with respect to the difference between humans and chimpanzees. Deal with the facts.; Thursday, December 22, 2016 7:38:00 AM
Craig Paardekooper said...: I did a count of the total number of codons in Chromosome 1, and the number of unsequences codons (NNN). My figures are
82972321 codons in total, of which 6154457 are NNN
So 7.4% of Chromosome 1 is as yet unsequenced . Does anyone know when or where these hidden sequences might be available?; Monday, March 27, 2017 2:58:00 AM
John Harshman said...: Most of these "codons" are highly repetitive sequences that are hard to deal with, especially to assemble. Don't count on anything in the near future. And of course they aren't codons, which live in exons.; Monday, March 27, 2017 11:04:00 AM
Larry Moran said...: The latest build is GRCh38.p10 (Jan. 6, 2017). The total amount of sequenced DNA is 3,080,585,178 bp. The best estimate of the total size of the genome as judged by the remaining 875 gaps is 3,241,953,429 bp. Thus, 95% of the genome has been sequenced.

The remaining 5% is mostly highly repetitive sequence and it's very unlikely that an exact sequence of those remaining gaps will ever be published. It's not worth the effort.

Miga, K. H. (2015). Completing the human genome: the progress and challenge of satellite DNA assembly. Chromosome Research, 23(3), 421-426. doi: 10.1007/s10577-015-9488-2; Monday, March 27, 2017 3:03:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Monday, February 06, 2012

How Much of Our Genome Is Sequenced?

23 comments :