More Recent Comments

Sunday, August 25, 2019

How much of the human genome has been sequenced?

It's been more than seven years since I posted information on how much of the human genome has been sequenced [How Much of Our Genome Is Sequenced?]. At that time, the latest version of the human reference genome was GRCh37.p7 (Feb. 3, 2012) and 89.6% of the genome had been sequenced. It's time to update that information.

We have a pretty good idea of the size of the human genome based on quantitative Feulgen staining (1940-1980) and reassociation kinetic experiments from the 1970s (Morton, 1991). We can safely assume that the correct size of the human genome is close to 3,200,000,000 bp (3,200,000 kb, 3,200 Mb, 3.2 Gb) [How Big Is the Human Genome?]. That's the value cited most often in the literature. However, the actual values calculated by Morton (1991) were 3.227 Gb for the haploid female genome and less than that for the haploid male genome. The human reference genome contains all 22 autosomes plus one copy of the X chromosome and one copy of the Y chromosome. This gives a total of 3.286 Gb.

You might think that all you have to do is check out the human genome websites and look up the exact size. That doesn't work because not all of the human genome has been sequenced and organized into a contiguous assembly of 24 different strands (one for each chromosome). The latest assembly is GRCh38 Patch Release 13 (GRCh37.p13), released on Feb. 28, 2019. If you look at the data for this assembly you will see an estimate of the "Total Bases in the Assembly." The number is 3,272,116,950 bp or 3.27 Gb. This value is close to estimates of the genome size from the years before the first draft of the genome sequence was published. It includes a number of regions whose exact sequence is not known but where the approximate size can be estimated (the reference sequence contains long stretches of "N"). The actual number of identified nucleotides in the reference genome is 3,110,748,599 or 95.1% of the total. There are 875 gaps in the current reference genome.

So, the answer to the question is about 95% of the human genome has been sequenced; an increase of about 5% over the past seven years.

The diagram below shows the data from an earlier version of the human genome (GRCh38/hg38-b37-hg19; December 2017) but it shows the regions where sequence information is missing. The two main loci are the centromeres of all chromosomes (pink) and the pericentromeric regions (blue), especially those in the short arms of several small chromosomes. These are depicted as a series of unknown nucleotides (N) in the reference genome.

The parts of the genome that are missing contain abundant short repeats that are difficult to assemble into a single contiguous stretch of chromosome. The latest sequencing technology can produce reads of several thousand base pairs of fairly accurate sequence so in theory it may be possible to generate enough overlapping long reads to cover some of the highly repetitive stretches. Some scientists doubt that the effort will be worth it (Miga, 2015).

Miga, K.H. (2015) Completing the human genome: the progress and challenge of satellite DNA assembly. Chromosome Research, 23:421-426. [doi: 10.1007/s10577-015-9488-2]

Morton, N.E. (1991) Parameters of the human genome. Proceedings of the National Academy of Sciences, 88:7474-7476. [doi: 10.1073/pnas.88.17.7474]


John Harshman said...

Quibble: the amount that's been sequenced is surely 100%, many times over. The problem is with assembly.

Vince said...

A start to full human chromosome sequencing has been made. End-to-end sequencing of the X chromosome.

Unknown said...

Quick typo in first mention of 3.2Mb in the second paragraph. Assuming it is all just repetitive elements the unknown regions, getting a glimpse of any important functional role would give more motivation for sequencing it... I for one would give a crack at it if I knew of any hints at functional importance.

Unknown said...

unfortunately I don't seem to have a name. I didnt realize until today. I'm Freeman one of your undergrads in biochemistry many years ago.

Larry Moran said...

Thanks. I fixed the typos.

Larry Moran said...

This post is the setup for posts on the sequence of the Y chromosome centromere, the sequences of the X chromosome, and the amount of junk in the centromere regions. Be patient.