More Recent Comments

Tuesday, August 27, 2019

First complete sequence of a human chromosome

A paper announcing the first complete sequence of a human chromosome has recently been posted on the bioRxiv server.

Miga, K. H., Koren, S., Rhie, A., Vollger, M. R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., Logsdon, G. A., et al. (2019) Telomere-to-telomere assembly of a complete human X chromosome. bioRxiv, 735928. doi: [doi: 10.1101/735928]

Abstract: After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.
The authors focused their efforts on the X chromosome from a cell line that is effectively haploid so it has only one copy of each chromosome. This is important because the missing regions of the chromosomes in the current reference genome consist of long stretches of repetitive DNA and there is considerable variation in the human population at these sites [see How much of the human genome has been sequenced?]. In diploid cells the two homologues will almost certainly be different making it difficult to assign sequenced DNA to the correct chromosome.

In the current release of the human reference genome (CRCh38), the assembled sequence of the X chromosome consists of three large gaps at the centromere (CENX in the figure above) and two large segmental duplication (DMRTC1 and another near the tip of the long arm). In addition there were 26 smaller gaps in the sequence.

The authors employed new sequencing technology to generate ultra-long reads of more than 100,000 bp. These sequences are less accurate than the shorter reads that were used to generate the reference genome but that limitation can be overcome by generating a large number of overlapping reads that cancel out the errors. In this case, they produced a whole genome sequence from 39x coverage combined with shorter reads from a previous 70x coverage to give an overall accuracy of at least 99.99%.

The result was extensive closure of existing gaps with the exception of the centromeric region. In some chromosomes the only missing sequence was at the centromere. In the case of the X chromosome, there were three large contigs shown in orange and blue in the figure. One of them spanned the centromere region in the diagram but this is misleading since that region has been collapsed in the reference genome. There is actually a large gap in the top orange contig. The two other gaps, at the junctions of the orange and blue contigs, span the segmental duplications. Note that all the other gaps in the CRCh38 reference genome were closed in the initial assembly.

The two gaps at the segmental duplications were closed by manually assembling the data from the ultra-long reads and confirming the assembly with data from other techniques. (The assembly software couldn't handle the assembly.)

The centromere region was the major challenge because it consists of about 2.8 Mb (2800 Kb) of highly repetitive satellite DNA containing hundreds of copies of α-satellite sequences (about 171 bp) and other AT-rich repeats that are much shorter [Centromere DNA]. The long sequence reads were correctly aligned and assembled by identifying site-specific single-nucleotide variants and using them as anchors to create a contiguous array. This was the same technique used last year to sequence the centromere of the Y-chromosome (Jain et al., 2918).

The figure below shows the Y-chromsome centromere in order to illustrate the complexity of the centromeric region.

The central part of the centromere sequence contains 52 higher order repeats (HOR) of α-satellite sequence. Each one contains about 34 monomers (light blue). In addition, there are three stretches of variant HOR regions that do not match the more common HOR (purple). The central region is surrounded by a pericentromeric region consisting of highly diverged α-satellite sequences (AT-rich DNA, dark blue). This is a typical arrangement for human centromeres except that the pericentromeric regions are often larger and the HORs contain different numbers of &alpha-satellite monomers. (The most common HOR in the X chromosome is a 12-mer.)

The first successful assembly of a human chromosome is a significant achievement but the significance is not so much in assembling the centromeric region but in closing all the other gaps. In fact, the true significance of the paper is in achieving high-quality ultra-long sequence reads of 100 Kb to >1000 Kb and in producing enough of these to achieve an average of 39-fold coverage of the entire genome. The authors conclude that by some quality metrics their new genome sequence is better than the current reference standard.

Image Credit: The drawing of a centromere is from Alberts et al. (2002) Figure 4-50.

Jain, M., Olsen, H.E., Turner, D.J., Stoddart, D., Bulazel, K.V., Paten, B., Haussler, D., Willard, H.F., Akeson, M., and Miga, K.H. (2018) Linear assembly of a human centromere on the Y chromosome. Nature biotechnology, 36:321-327. doi: [doi: 10.1038/nbt.4109]


John Harshman said...

Could you explain why the cell line is effectively haploid?

Larry Moran said...

The cell line is CHM13hTERT derived from a molar pregnancy (hydatidiform mole). This happens when a sperm fertilizes an egg that has no nucleus. The sperm chromosomes are duplicated as the cells divide so the cell line derived from such a tissue contains two sets of identical chromosomes.

The karyotype of the CHM13 line is 46,XX.