Friday, October 28, 2011

The Core Genome

Hundreds of genomes have been sequenced. It should be relatively easy to search all these genomes to identify those genes that are found in every single species. This small class of genes should represent the core genome—the genes that were probably present in the first living cell.

Turns out it's not that easy. For one thing, you have to remove parasitic bacteria from your set of genomes because these species could easily be getting by without some essential genes that are supplied by their hosts. Next you have to make sure you have a huge variety of different species that cover all possible forms of life. In practice, this means that you need about 300 different genomes, mostly bacteria.

I'm reading The Logic of Chance: The Nature and Origin of Biological Evolution, by Eugene Koonin. This is just one of many books that are critical of the most popular views of evolution. Most of these books are written by kooks or religious nutters but some of them are valid scientific critiques of modern evolutionary theory. Koonin's book is one of those and I agree with most of what he has to say. One of his topics is genome evolution.

As Koonin describes it, the first genome comparisons looked at Haemophilus influenzae and Mycoplasma genitalium, two species of bacteria that aren't distantly related. There were about 240 orthologous genes found in both species.1 The first surprise was that this core set was missing some very important members that should have been there.

Some essential metabolic reactions must have been catalyzed by enzymes in the very first cells but the Haemophilus enzyme isn't present in Mycoplasma and vice versa. It took a bit of digging but eventually the problem was solved with the discovery of different enzymes that carried out the same reaction. The genes for these enzymes are completely unrelated.

As more and and more genomes were sequenced, the size of the core genome set shrunk until today it comprises fewer than 100 genes. Most of these genes are genes for the three ribosomal RNAs, about 30 tRNAs, and a few other essential RNA molecules. There are only about 33 protein-encoding genes in the universal core set. They include genes for the three large RNA polymerase subunits and 30 proteins required for translation (mostly ribosomal proteins).

DNA polymerase isn't in the core set because some species of bacteria have unusual DNA polymerases that replicate DNA just fine but are unrelated to the enzymes found in most cells. There are multiple, unrelated, versions of the aminoacyl tRNA synthetases—the enzyme that attaches an amino acid to its cognate tRNA. Some species have one version and other species have the second version. Some species have both. In any case, no single synthetase gene is found in every species so it's not part of the core set.

Koonin refers to this observation as non-orthologous gene displacement (NOGD). He envisages a scenario where a cell with gene X takes up a copy of a non-orthologous gene (gene Y) that catalzyes the same reaction. Over time the newly acquired gene displaces the original version. In this way a non-orthologous version (e.g. gene Y) could have arisen after the formation of the first cell and spread to a variety of different species by horizontal gene transfer. The scenario doesn't rule out the possibility that the two non-orthologous versions could have arisen independently in two separate origins of life but this seems less likely.

Let's look at a couple of examples. Biochemistry textbook writers have known for decades that there are different versions of some common metabolic genes 2 The aldolase enzyme in gluconeogenesis & glucolysis is a classic. Some species have the class I enzyme/gene while others have the class II enzyme/gene. Some species have both.

This is an example of convergent evolution. The enzymes have different mechanisms and, as you can see from the figure, completely different structures. It doesn't seem to matter if a species has a class I enzyme or a class II enzyme since both enzymes are very good at catalyzing the fusion of two three-carbon molecules into a six-carbon fructose molecule or cleaving the six-carbon molecule in the reverse reaction.

The pyruvate dehydrogenase complex (PDC) is a huge enzyme that catalyzes an important metabolic reaction making acetyl-CoA—the substrate for the citric acid cycle. It seemed likely that every single species would have the genes for all of the PDC subunits but many species of bacteria were missing the entire complex. They have a different enzyme, pyruvate:ferredoxin oxidoreducatase that catalyzes a similar reaction. The enzymes have completely different mechanisms and are unrelated.

In this case, we have reason to believe that the enzyme requiring ferredoxin is more primitive and the more common pyruvate dehydrogenase complex evolved later. The PDC genes displaced the gene for pyruvate:ferredoxin oxidoreductase in many, but not all, species. That's why the genes for neither enzyme are part of the core set.

We don't know whether the existing core set of 100 genes truly represents genes that were present in the first living cell or whether they completely displaced the original versions. The fact that many of these genes are part of large operons might have made it easier for them to be transferred by horizontal gene transfer. (The selfish operon model.)

The bottom line is that attempts to reconstruct the genome of the first cell have failed because of NOGD and we now have to incorporate that concept into our way of thinking about early evolution. The good news is that the evolution of completely new genes seems to be much easier than we first imagined. We even have examples of three or four completely different enzymes carrying out the same reaction.3

1. Koonin refers to conserved genes as Clusters of Orthologous Genes or COGs. It actually counts conserved domains rather than entire genes but the differences aren't great so I'll just refer to them as genes.

2. That is, those textbook writers that emphasize comparative biochemistry or an evolutionary approach to biochemistry. Some textbooks just cover human (mammalian) biochemistry so they won't even mention whether bacteria do biochemistry.

3. I'm not sure how the Intelligent Design Creationists explain these observations. Maybe there were several different designers who each came up with their ideal solution to the problem? Maybe there was only one designer who just got a kick out of making different versions of the same enzyme activity but got bored at only two or three?


  1. In practice, this means that you need about 300 different genomes, mostly bacteria.
    What do you think to the hypothesis that viruses should be included in the mix - i.e. that substantial chunks of the bacterial / archaeal / eukaryotic gene repertoires may ultimately have a viral origin?


  2. A minor point of clarification: The Core Genome Hypothesis is the view that what makes asexual bacterial species is a shared set of core genes. It is not widely adopted, but this might cause some small confusion.

    Wertz, J. E., C. Goldstone, D. M. Gordon, and M. A. Riley. 2003. A molecular phylogeny of enteric bacteria and implications for a bacterial species concept. J Evol Biol 16 (6):1236-1248.

  3. Koonin has one of the strangest accents I've ever heard. This, coupled with his bizarre vocal mannerisms, had me transfixed and mesmerised while watching a video of one of his lectures at the From RNA to Humans Symposium.

    The lecture was quite good too.

  4. It's Mycoplasma genitalium, not Micrococcus genitalium, isn't it?

  5. Larry are Isozymes a good term to refer to these convergent non-homologus enzymes that perform the same function?