More Recent Comments

Thursday, July 09, 2020

Structure and expression of the SARS-CoV-2 (coronavirus) genome

Coronaviruses are RNA viruses, which means that their genome is RNA, not DNA. All of the coronaviruses have similar genomes but I'm sure you are mostly interested in SARS-CoV-2, the virus that causes COVID-19. The first genome sequence of this virus was determined by Chinese scientists in early January and it was immediately posted on a public server [GenBank MN908947]. The viral RNA came from a patient in intensive care at the Wuhan Yin-Tan Hospital (China). The paper was accepted on Jan. 20th and it appeared in the Feb. 3rd issue of Nature (Zhou et al. 2020).

By the time the paper came out, several universities and pharmaceutical companies had already constructed potential therapeutics and several others had already cloned the genes and were preparing to publish the structures of the proteins.1

By now there are dozens and dozens of sequences of SARS-CoV-2 genomes from isolates in every part of the world. They are all very similar because the mutation rate in these RNA viruses is not high (about 10-6 per nucleotide per replication). The original isolate has a total length of 29,891 nt not counting the poly(A) tail. Note that these RNA viruses are about four times larger than a typical retrovirus; they are the largest known RNA viruses.

The RNA genome that's inside the virus particle looks very much like a typical eukaryotic mRNA molecule. It has a 5′ cap and a 3′ poly(A) tail of about 40-50 nucleotides. This RNA is translated by the host protein synthesis components as soon as it is injected into the cell.

The genome contains a number of genes where the word "gene" is used to define the open reading frame of the proteins produced by the virus. The initial translation products are two large polyproteins that are subsequently cleaved by proteases to produce smaller proteins. Most of time the viral RNA is translated to give the 1a polyprotein (~460 kDa) that is subsequently cleaved to produce 11 distinct non-structural proteins (nsps). Sometimes the ribosomes stall near the stop codon when they encounter a frameshift element (FSE) containing a "slippery site" that causes the ribosomes to skip one nucleotide. This avoids the stop codon and allows translation to continue into the 1b gene. The large 1ab polyprotein (~780 kDa) produces another five proteins after cleavage.

The functions of many (but not all) of these proteins have been discovered. Nsp12 is an RNA dependent RNA polymerase (RdRp). This is the enzyme that will copy the viral RNA to produce more infectious RNAs but it also produces a number of other transcripts (see below). RdRp is part of a large replication-transcription complex (RTC) that includes a number of accessory proteins (nsp2, nsp4, nsp6, nsp7+nsp8, nsp9, and nsp10). The exact functions of all these accessory proteins haven't been worked out in detail.

Nsp3 is a papain-like protease (PLpro) and nsp5 is a 3C-like cysteine protease (3CLpro). They are responsible for cleaving polyproteins 1a and 1ab.

Nsp13 is a 5′→3′ helicase (Hel) that's required for transcription. Nsp14 is a 3′→5′ exonuclease involved in proofreading. Nsp15 appears to be a uridine-specific endonuclease and nsp16 is an S-adenosylmethionine methyltransfersase.

The open reading frames at the 3′ end of the viral RNA cannot be translated because of the stop codon at the end of the 1ab "gene." Production of these proteins (e.g. S, M, E etc.) has to wait until later in the life cycle of the virus after the assembly of the RTC complex. As we shall see shortly, the synthesis of these late proteins involves a complicated process that requires production of many different transcripts.

The injected virus RNA is a (+) strand so production of new viral RNA requires two rounds of transcription. First, the RTC complex binds to the 3′ end of the (+) strand and copies it all the way to the 5′ end producing a (-) strand. This strand is then copied to produce new (+) strands that can be incorporated into new virus particles. The new (+) strands also act as messenger RNA to produce more 1a and 1ab polyproteins.2

Transcription from the 3′ end of the (+) strand also produces a group of subgenomic RNAs (sgRNAs). The 3′ end contains a number of transcription-regulating sequences (TRS-B) consisting of a 10 nucleotide AU-rich stretch of RNA. There is another TRS (TRS-L) at the 5′ end next to a leader sequence (L). When the RTC encounters a TRS it will pause and this may cause it to switch and continue transcription at TRS-L. This produces an sgRNA consisting of a stretch from the 3′ end (body) joined to the leader sequence at the 5′ end (leader).

The example shown below shows template switching between a TRS-B located at the 5′ end of the S gene and TRS-L to produce an S sgRNA. This sgRNA is then transcribed to produce an mRNA that can be translated to produce S protein.

Each of the genes at the 3′ end of the virus genome is associated with a TRS-B sequence so transcription from the 3′ end produces 9 different sgRNAs corresponding to the nine functional genes. (Open reading frame 10 is not a functional gene.) The figure on the right is from Kim et al. (2020).

Some of these "late" genes are required for assembly of new virus particles. S is the gene for the trimeric spike protein that mediates attachment of the virus to the ACE2 receptor on the surface of the host cell. M is a membrane glycoprotein— it is the most abundant structural protein. E is the envelope protein. N is the nucleocapsid protein that binds RNA and helps package it into the virus particle.

Reading frame 3 seems to produce two proteins, 3a and 3b. It's likely that 3a is an ion channel protein on the virus surface. Proteins 7 and 8 are additional viral assembly proteins. I don't know the function of protein 6 and I'm not sure if anyone else knows. Many coronaviruses don't make protein 6.

DISCLAIMER: I am not an expert on coronaviruses. Everything in this post is stuff I have learned in the past few days from reading published papers. Feel free to correct all the mistake I have made.

1. The behavior of these Chinese scientists doesn't match with the conspiracy theory that China engineered this virus—perhaps they weren't in on the conspiracy? :-)

2. I don't know how the transcription complex manages to copy right to the ends of the viral RNA. It seems to involve some complicated RNA secondary structures but I didn't bother reading the relevant papers.

References and Bibliography

Bar-On, Y.M., Flamholz, A., Phillips, R. and Milo, R. (2020) Science Forum: SARS-CoV-2 (COVID-19) by the numbers. Elife 9: e57309. [doi: 10.7554/eLife.57309]

Kim, D., Lee, J-Y., Yang, J-S., Kim, J.W., Kim. V.N., and Chang, H. (2020) The Architecture of SARS-CoV-2 Transcriptome. Cell 181:914-921 [doi: 10.1016/j.cell.2020.04.011]

Fung, T.S. and Liu, D.X. (2019) Human coronavirus: host-pathogen interaction. Annual review of microbiology 73: 529-557. [doi: 10.1146/annurev-micro-020518-115759]

Sawicki, S.G., Sawicki, D.L. and Siddell, S.G. (2007) A contemporary View of Coronavirus Transcription. Journal of virology 81(1):20-29. [doi: 10.1128/JVI.01358-06]

Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L., Zhang, W., Si, H.-R., Zhu, Y., Li, B. and Huang, C.-L. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798): 270-273. [doi: 10.1038/s41586-020-2012-7]


  1. Wouldn‘t it be more appropriate to use nt rather than be?

  2. Sorry, the autocorrection turned bp into be.

    1. Yes, nucleotides is better. I thought I had made all the corrections but I clearly missed a lot. Fixed now.

  3. Though you state that the Coronavirus currently responsible for this pandemic seems to have a low mutation rate, I still wonder how different have the most divergent strains of the current Coronavirus pandemic become so far?
    In absolute numbers, how many mutations separate them from their common ancestor as it was back in December-January, at this point?

    1. From my Facebook post ...

      "SARS-CoV-2 is evolving at a normal rate from the initial isolate first identified and sequenced in China. Tung Phan at the University of Pittsburg (USA) has compared the sequences of 86 strains from all over the world and he has identified a total of 93 mutations including three deletions (see figure below). Forty-two of these mutations are missense mutations located in the ORF1ab polyprotein gene, the spike surface glycoprotein gene, the matrix protein gene, and the nucleocapsid proetin gene. No mutations in the envelope protein genes were found.

      Phan, T. (2020). "Genetic diversity and evolution of SARS-CoV-2." Infection, genetics and evolution 81: 104260. [doi: j.meegid.2020.104260]"

    2. According to the most divergent SARS-CoV-2 strain found from Senegal contains 24 mutations

  4. There is an interesting post about the sequence and structure of Covid19 written by Carl Zimmer in the New York Times.