Comments on Sandwalk: How Much of Our Genome Is Sequenced?

The latest build is GRCh38.p10 (Jan. 6, 2017). The...

2017-03-27T15:03:05.022-04:00

The latest build is GRCh38.p10 (Jan. 6, 2017). The total amount of sequenced DNA is 3,080,585,178 bp. The best estimate of the total size of the genome as judged by the remaining 875 gaps is 3,241,953,429 bp. Thus, 95% of the genome has been sequenced.

The remaining 5% is mostly highly repetitive sequence and it's very unlikely that an exact sequence of those remaining gaps will ever be published. It's not worth the effort.

Miga, K. H. (2015). Completing the human genome: the progress and challenge of satellite DNA assembly. Chromosome Research, 23(3), 421-426. doi: 10.1007/s10577-015-9488-2

Most of these "codons" are highly repeti...

2017-03-27T11:04:42.246-04:00

Most of these "codons" are highly repetitive sequences that are hard to deal with, especially to assemble. Don't count on anything in the near future. And of course they aren't codons, which live in exons.

I did a count of the total number of codons in Chr...

2017-03-27T02:58:46.409-04:00

I did a count of the total number of codons in Chromosome 1, and the number of unsequences codons (NNN). My figures are
82972321 codons in total, of which 6154457 are NNN
So 7.4% of Chromosome 1 is as yet unsequenced . Does anyone know when or where these hidden sequences might be available?

I've known that non coding regions are very, v...

2016-12-22T07:38:36.642-05:00

I've known that non coding regions are very, very important since 1968.

I'm not concerned about your "feelings" with respect to the difference between humans and chimpanzees. Deal with the facts.

Hello Prof, We learned so much about Human genome ...

2016-12-22T06:54:38.924-05:00

Hello Prof, We learned so much about Human genome that now we know non-coding regions are very very important. I feel saying that the difference between Human and Chimp is only 2% if not correct. What do you think?

I was thinking of the operons for the large riboso...

2015-09-25T09:55:33.670-04:00

I was thinking of the operons for the large ribosomal RNAs (45S) when I said that the gaps have been closed. There are five clusters in the human genome. They are at 13p12, 14p12, 15p12, 21p12, and 22p12. There are about 140 repeats in each cluster.

I don't know if that's also true of the much smaller 5S RNA genes. I doubt that there's a reliable sequences spanning that cluster. (It's on chromosome 1 at 1q42.) There are about 100 copies with a repeat length of 2.2 kb.

Very informative blog post! I have one question th...

2015-09-25T04:40:03.408-04:00

Very informative blog post! I have one question though: You say "In the case of ribosomal RNA gene clusters, these gaps have now been closed.", but in the linked post you say "The reported human genome sequences do not contain the region of the 5S RNA genes. It is impossible to clone and assemble fragments of repeated DNA sequences.". So what is it? Have the gaps been closed or is there only meta information about copy numbers and the like?

There was so much useful info (potential explanati...

2013-11-19T16:13:49.213-05:00

There was so much useful info (potential explanations for statements in the literature that appear contradictory) in this post and some of the replies that I just had to say thank you. So Thanks.

In an introductory Bioinformatics course held in m...

2012-07-28T05:50:26.794-04:00

In an introductory Bioinformatics course held in my training programme (GTPB) a naïve student "accidentally landed" in the p arm of Chr 13 (Human). No sequences are available to play with. Actually, in classes we display the image of the human kariotype image from genome browsers and we often forget that the little caps over the p-arms of Chr 13, 14, 15, 21 and 22 represent a substantial number of base pairs, almost their full lengths, for which the sequence has not been assembled yet. The reasons for this are lack of experimental data (due to chromatin) and computational (low complexity makes assembly difficult).
Not having transcripts in such large region has implications of several depths.

For the whole of the Human Genome
Assembly: GRCh37.p7, Feb 2009
Database version: 67.37
Base Pairs: 3,287,209,763
Golden Path Length: 3,101,804,739

the Golden Path Length is less than 6% shorter than the total base pairs.
A gross measure, certainly, but not as small as a layman would guess from reading the news in the popular press.

What strikes me most is that this fact is often ignored in the literature, where,for example, statements about copy numbers and copy number variations refer to genes found in the sequence that is known and made available, not to the still unassembled sequences in human,where the numbers could be different...

May I suggest to add something about synteny to th...

2012-02-07T23:27:03.112-05:00

May I suggest to add something about synteny to the content of your course? IMO synteny maps impressingly display the relatedness among e.g. mammalian species and the degree of chromosomal rearrangements that took place during evolution.

Thanks! Very useful (I knew about the problems and...

2012-02-07T15:09:03.139-05:00

Thanks! Very useful (I knew about the problems and the complexity, but I did not know about the genome reference projects.

Many of the regions may be covered by sequence, bu...

2012-02-07T12:52:26.769-05:00

Many of the regions may be covered by sequence, but are difficult to assemble due to the biological complexity of the genome and the technical limitations of next generation sequencing.
On the difficulty of assembly NGS data:
http://www.ncbi.nlm.nih.gov/pubmed/22147368
http://www.ncbi.nlm.nih.gov/pubmed/21926179
On the complexity of the human genome
http://www.ncbi.nlm.nih.gov/pubmed/21030649
How the human assembly is now being managed:
http://www.ncbi.nlm.nih.gov/pubmed/21750661

Also see: http://genomereference.org

mathbionerd: you bring up a really interesting poi...

2012-02-07T09:01:54.033-05:00

mathbionerd: you bring up a really interesting point. Work from Evan Eichler's group show that about 50% of gaps in the Euchromatic part of the assembly
are polymorphic. The reference assembly is really a mosaic and not meant to represent any one person- however, if one wanted to get a rough approximation of a haploid genome size, using the stats for the Primary assembly is the way to go- and we could continue to nit pick about the sex chromosomes.

Is it really fair to ask about the size of "t...

2012-02-06T22:58:56.515-05:00

Is it really fair to ask about the size of "the" human genome (only half joking)? I wonder how much we should expect genome size to vary between individuals based on large scale variation that has already been studied (e.g. Copy Number Variants), or other repetitive sequences? Do you know if anyone has ever studied variation in centromere or telomere size between individuals?

On a different thought, regarding the sex chromosomes, I think it might be a little misleading to count the length of each autosome once, and then add the length of both the X and the Y. Considering the total number of bases sequenced in some haploid reference human genome (e.g. 3.2Gb) implies that doubling it would equal the total expected number of bases, but the sex chromosomes throw a wrench into the mix. To be correct, there should probably be two haploid estimates (although given the relatively small size of the Y, they wouldn't differ too much), one including the X, and one including the Y (because no healthy haploid genome would contain both the X and the Y).

But maybe I'm just nit-picking. :)

Note: The 'Total number of sequenced bases in ...

2012-02-06T21:59:40.317-05:00

Note: The 'Total number of sequenced bases in the assembly' represents the sequenced bases and the gaps- but, the sequenced bases also include regions of the genome that contain structural variants. For example, there are 8 representations of the MHC region in the assembly now- the one incorporated in the chromosome, plus 7 additional alternate paths. The reference assembly is no longer just a flattened representation of a haploid genome as we know this does not accurately represent biology.
I encourage you to read our paper that describes the assembly model: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130012/
But, the short answer is adding up all of the bases in the assembly isn't going to give you the size of the genome. The current assembly only attempts to represent the euchromatic portion of the genome, so large, important but difficult to sequence regions are represented by Ns. Additionally complex regions for which we have good data may have >1 representation.

So, not even after sequencing more than 30,000 hum...

2012-02-06T20:19:45.164-05:00

So, not even after sequencing more than 30,000 human genomes (not all deep sequencing though), we still have not covered all regions?

Hum, there is something seriously wrong with how the projects are been managed. I would have guessed that someone in the teams would be interested in putting together a complete model genome. Even if the model was bound to "evolve" as more sequences got obtained.

There are an awful lot of RNA genes. I guess mice ...

2012-02-06T19:46:21.673-05:00

There are an awful lot of RNA genes. I guess mice have as many as humans?

How do you know the "12,523 genes that make a...

2012-02-06T18:51:18.658-05:00

How do you know the "12,523 genes that make an RNA product that's not translated" really aren't translated? This has been bugging me for awhile.

Thanks. Clear as mud. :-)

2012-02-06T15:12:05.467-05:00

Thanks. Clear as mud. :-)

http://www.ensembl.org/Help/Glossary "The go...

2012-02-06T14:50:12.353-05:00

http://www.ensembl.org/Help/Glossary

"The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions)."

In other words, the golden path length = total sequenced bases plus estimated sizes of the gaps. The term "golden path" comes from restriction mapping and refers to the best assembly of the various restriction fragments in order. Since they quote 3,101,804,739 bp rather than 3,173,036,847 bp, I presume they're using an older genome build.

I'm not quite sure how they then get up to a total of 3,283,984,159 bp. From the glossary it looks as though the higher total includes duplicated regions like the pseudoautosomal regions of the sex chromosomes, and perhaps other large contigs like MHC haplotypes.

22 autosomes plus two different sex chromosomes (X...

2012-02-06T13:25:27.878-05:00

22 autosomes plus two different sex chromosomes (X & Y) equals 24 chromosomes

22 pairs of autosomes and a pair of sex chromosome...

2012-02-06T13:12:20.004-05:00

22 pairs of autosomes and a pair of sex chromosomes. The sequences of the X and Y chromosomes bring it to 24.

I struggle to keep up, but I do appreciate your po...

2012-02-06T11:58:22.322-05:00

I struggle to keep up, but I do appreciate your postings, Prof. Moran. But I have to ask - 24? I thought there were 23?

(22 pairs of autosomes and one pair of sex chromosomes)

Sorry if it is a stupid question.