Lateral gene transfer (LGT), or horizontal gene transfer (HGT), is widespread in bacteria. It leads to the creation of pangenomes for many bacterial species where different subpopulations contain different subsets of genes that have been incorporated from other species. It also leads to confusing phylogenetic trees such that the history of bacterial evolution looks more like a web of life than a tree [The Web of Life].
Bacterial-like genes are also found in eukaryotes. Many of them are related to genes found in the ancestors of modern mitochondria and chloroplasts and their presence is easily explained by transfer from the organelle to the nucleus. Eukaryotic genomes also contain examples of transposons that have been acquired from bacteria. That's also easy to understand because we know how transposons jump between species.More Recent Comments
Tuesday, November 07, 2017
Contaminated genome sequences
The authors of the original draft of the human genome sequence claimed that hundreds of genes had been acquired from bacteria by lateral gene transfer (LGT) (Lander et al., 2001). This claim was abandoned when the "finished" sequence was published a few years later (International Human Genome Consortium, 2004) because others had shown that the data was easily explained by differential gene loss in other lineages or by bacterial contamination in the draft sequence (see Salzberg, 2017).
Thursday, November 02, 2017
Parental age and the human mutation rate
Mutation
-definition
-mutation types
-mutation rates
-phylogeny
-controversies
Mutations are mostly due to errors in DNA replication. We have a pretty good idea of the accuracy of DNA replication—the overall error rate is about 10-10 per bp. There are about 30 cell divisions in females between zygote and formation of all egg cells. In males, there are about 400 mitotic cell divisions between zygote and formation of sperm cells. Using these average values, we can calculate the number of mutations per generation. It works out to about 130 mutations per generation [Estimating the Human Mutation Rate: Biochemical Method].
This value is similar to the estimate from comparing the sequences of different species (e.g. human and chimpanzee) based on the number of differences and the estimated time of divergence. This assumes that most of the genome is evolving at the rate expected for fixation of neutral alleles. This phylogenetic method give a value of about 112 mutations per generation [Estimating the Human Mutation Rate: Phylogenetic Method].The third way of measuring the mutation rate is to directly compare the genome sequence of a child and both parents (trios). After making corrections for false positives and false negatives, this method yields values of 60-100 mutations per generation depending on how the data is manipulated [Estimating the Human Mutation Rate: Direct Method]. The lower values from the direct method call into question the dates of the split between the various great ape lineages. This controversy has not been resolved [Human mutation rates] [Human mutation rates - what's the right number?].
It's clear that males contribute more to evolution than females. There's about a ten-fold difference in the number of cell divisions in the male line compared to the female line; therefore, we expect there to be about ten times more mutations inherited from fathers. This difference should depend on the age of the father since the older the father the more cell divisions required to produce sperm.
This effect has been demonstrated in many publications. A maternal age effect has also been postulated but that's been more difficult to prove. The latest study of Icelandic trios helps to nail down the exact effect (Jónsson et al., 2017).
The authors examined 1,548 trios consisting of parents and at least one offspring. They analyzed 2.682 Mb of genome sequence (84% of the total genome) and discovered an average of 70 mutations events per child.1 This gives an overall mutation rate of 83 mutations per generation with an average generation time of 30 years. This is consistent with previous results.
Jónsson et al. looked at 225 cases of three generation data in order to make sure that the mutations were germline mutations and not somatic cell mutations. They plotted the numbers of mutations against the age of the father and mother to produce the following graph from Figure 1 of their paper.
Look at parents who are 30 years old. At this age, females contribute about 10 mutations and males contribute about 50. This is only a five-fold difference—much lees than we expect from the number of cell divisions. This suggests that the initial estimates of 400 cell divisions in males might be too high.
An age effect on mutations from the father is quite apparent and expected. A maternal age effect has previously been hypothesized but this is the first solid data that shows such an effect. The authors speculate that oocyotes accumulate mutations with age, particularly mutations due to strand breakage.
Of these, 93% were single nucleotide changes and 7% were small deletions or insertions.
Jónsson, H., Sulem, P., Kehr, B., Kristmundsdottir, S., Zink, F., Hjartarson, E., Hardarson, M.T., Hjorleifsson, K.E., Eggertsson, H.P., and Gudjonsson, S.A. (2017) Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 549:519-522. [doi: 10.1038/nature24018]
Saturday, October 28, 2017
Creationists questioning pseudogenes: the GULO pseudogene
This is the second post discussing creationist1 papers on pseudogenes. The first post addressed a paper by Jeffrey Tomkins on the β-globin pseudogene [Creationists questioning pseudogenes: the beta-globin pseudogene]. This post covers another paper by Tomkins claiming that the GULO pseudogenes in various primate species are not derived from a common ancestor but instead have been deactivated independently in each lineage.
The Tomkins' article was published in 2014 in Answers Research Journal, a publication that describes itself like this:ARJ is a professional, peer-reviewed technical journal for the publication of interdisciplinary scientific and other relevant research from the perspective of the recent Creation and the global Flood within a biblical framework.
Saturday, October 14, 2017
Creationists questioning pseudogenes: the beta-globin pseudogene
Jonathan Kane recently (Oct. 6, 2017) posted an article on The Panda's Thumb where he claimed that Young Earth Creationists often don't get enough credit for raising serious issues about evolution [Five principles for arguing against creationism].
He mentioned some articles about pseudogenes as prime examples. I asked him for references and he responded with two articles by Jeffrey Tomkins that were published on the Answers in Genesis website. The first was on the β-globin pseudogene and the second was on the GULO pseudogene. Both articles claim that these DNA sequences aren't really pseudogenes because they have functions.
I'll deal with the β-globin pseudogene in this post and the GULO pseudogene in a subsequent post.Wednesday, September 13, 2017
Sequencing human diploid genomes
Monday, September 11, 2017
What's in Your Genome?: Chapter 4: Pervasive Transcription (revised)
I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! The first chapter is an introduction to genomes and DNA [What's in Your Genome? Chapter 1: Introducing Genomes ]. Chapter 2 is an overview of the human genome. It's a summary of known functional sequences and known junk DNA [What's in Your Genome? Chapter 2: The Big Picture]. Chapter 3 defines "genes" and describes protein-coding genes and alternative splicing [What's in Your Genome? Chapter 3: What Is a Gene?].
Chapter 4 is all about pervasive transcription and genes for functional noncoding RNAs. I've finally got a respectable draft of this chapter. This is an updated summary—the first version is at: What's in Your Genome? Chapter 4: Pervasive Transcription.Saturday, September 09, 2017
Cold Spring Harbor tells us about the "dark matter" of the genome (Part I)
This is a podcast from Cold Spring Harbor [Dark Matter of the Genome, Pt. 1 (Base Pairs Episode 8)]. The authors try to convince us that most of the genome is mysterious "dark matter," not junk. The main theme is that the genome contains transposons that could play an important role in evolution and disease.
Wednesday, August 30, 2017
Experts meet to discuss non-coding RNAs - fail to answer the important question
There's a reason why this question is important. It's because we have every reason to believe that spurious transcription is common in large genomes like ours. Spurious, or accidental, transcription occurs when the transcription initiation complex binds nonspecifically to sites in the genome that are not real promoters. Spurious transcription also occurs when the initiation complex (RNA plymerase plus factors) fires in the wrong direction from real promoters. Binding and inappropriate transcription are aided by the binding of transcription factors to nonpromoter regions of the genome—a well-known feature of all DNA binding proteins [see Are most transcription factor binding sites functional?].
Friday, August 25, 2017
How much of the human genome is devoted to regulation?
One of the common rationalizations is to speculate that while humans may have "only" 25,000 genes they are regulated and controlled in a much more sophisticated manner than the genes in other species. It's this extra level of control that makes humans special. Such speculations have been around for almost fifty years but they have gained in popularity since publication of the human genome sequence.
In some cases, the extra level of regulation is thought to be due to abundant regulatory RNAs. This means there must be tens of thousand of extra genes expressing these regulatory RNAs. John Mattick is the most vocal proponent of this idea and he won an award from the Human Genome Organization for "proving" that his speculation is correct! [John Mattick Wins Chen Award for Distinguished Academic Achievement in Human Genetic and Genomic Research]. Knowledgeable scientists know that Mattick is probably wrong. They believe that most of those transcripts are junk RNAs produced by accidental transcription at very low levels from non-conserved sequences.
Friday, July 14, 2017
Revisiting the genetic load argument with Dan Graur
The genetic load argument is one of the oldest arguments for junk DNA and it's one of the most powerful arguments that most of our genome must be junk. The concept dates back to J.B.S. Haldane in the late 1930s but the modern argument traditionally begins with Hermann Muller's classic paper from 1950. It has been extended and refined by him and many others since then (Muller, 1950; Muller, 1966).
Sunday, July 02, 2017
Confusion about the number of genes
[According to Ensembl86] the human genome encodes 58,037 genes, of which approximately one-third are protein-coding (19,950), and yields 198,093 transcripts. By comparison, the mouse genome encodes 48,709 genes, of which half are protein-coding (22,018 genes), and yields 118,925 transcripts overall.The very latest Ensembl estimates (April 2017) for Homo sapiens and Mus musculus are similar. The difference in gene numbers between mouse and human is not significant according to the authors ...
The discrepancy in total number of annotated genes between the two species is unlikely to reflect differences in underlying biology, and can be attributed to the less advanced state of the mouse annotation.This is correct but it doesn't explain the other numbers. There's general agreement on the number of protein-coding genes in mammals. They all have about 20,000 genes. There is no agreement on the number of genes for functional noncoding RNAs. In its latest build, Ensemble says there are 14,727 lncRNA genes, 5,362 genes for small noncoding RNAs, and 2,222 other genes for nocoding RNAs. The total number of non-protein-coding genes is 22,311.
There is no solid evidence to support this claim. It's true there are many transcripts resembling functional noncoding RNAs but claiming these identify true genes requires evidence that they have a biological function. It would be okay to call them "potential" genes or "possible" genes but the annotators are going beyond the data when they decide that these are actually genes.
Breschi et al. mention the number of transcripts. I don't know what method Ensembl uses to identify a functional transcript. Are these splice variants of protein-coding genes?
The rest of the review discusses the similarities between human and mouse genes. They point out, correctly, that about 16,000 protein-coding genes are orthologous. With respect to lncRNAs they discuss all the problems in comparing human and mouse lncRNA and conclude that "... the current catalogues of orthologous lncRNAs are still highly incomplete and inaccurate." There are several studies suggesting that only 1,000-2,000 lncRNAs are orthologous. Unfortunately, there's very little overlap between the two most comprehensive studies (189 lncRNAs in common).
There are two obvious possibilities. First, it's possible that these RNAs are just due to transcriptional noise and that's why the ones in the mouse and human genomes are different. Second, all these RNAs are functional but the genes have arisen separately in the two lineages. This means that about 10,000 genes for biologically functional lncRNAs have arisen in each of the genomes over the past 100 million years.
Breschi et al. don't discuss the first possibility.
Breschi, A., Gingeras, T.R., and Guigó, R. (2017) Comparative transcriptomics in human and mouse. Nature Reviews Genetics [doi: 10.1038/nrg.2017.19]
Genome size confusion
Breschi, A., Gingeras, T. R., and Guigó, R. (2017). Comparative transcriptomics in human and mouse. Nature Reviews Genetics [doi: 10.1038/nrg.2017.19]I was confused by the comments made by the authors when they started comparing the human and mouse genomes. They said,
Cross-species comparisons of genomes, transcriptomes and gene regulation are now feasible at unprecedented resolution and throughput, enabling the comparison of human and mouse biology at the molecular level. Insights have been gained into the degree of conservation between human and mouse at the level of not only gene expression but also epigenetics and inter-individual variation. However, a number of limitations exist, including incomplete transcriptome characterization and difficulties in identifying orthologous phenotypes and cell types, which are beginning to be addressed by emerging technologies. Ultimately, these comparisons will help to identify the conditions under which the mouse is a suitable model of human physiology and disease, and optimize the use of animal models.
The most recent genome assemblies (GRC38) include 3.1 Gb and 2.7 Gb for human and mouse respectively, with the mouse genome being 12% smaller than the human one.I think this statement is misleading. The size of the human genome isn't known with precision but the best estimate is 3.2 Gb [How Big Is the Human Genome?]. The current "golden path length" according to Ensembl is 3,096,649,726 bp. [Human assembly and gene annotation]. It's not at all clear what this means and I've found it almost impossible to find out; however, I think it approximates the total amount of sequenced DNA in the latest assembly plus an estimate of the size of some of the gaps.
The golden path length for the mouse genome is 2,730,871,774 bp. [Mouse assembly and gene annotation]. As is the case with the human genome, this is NOT the genome size. Not as much mouse DNA sequence has been assembled into a contiguous and accurate assembly as is the case with humans. The total mouse sequence is at about the same stage the human genome assembly was a few years ago.
If you look at the mouse genome assembly data you see that 2,807,715,301 bp have been sequenced and there's 79,356,856 bp in gaps. That's 2.88 Gb which doesn't match the golden path length and doesn't match the past estimates of the mouse genome size.
We don't know the exact size of the mouse genome. It's likely to be similar to that of the human genome but it could be a bit larger or a bit smaller. The point is that it's confusing to say that the mouse genome is 12% smaller than the human one. What the authors could have said is that less of the mouse genome has been sequenced and assembled into accurate contigs.
If you go to the NCBI site for Homo sapiens you'll see that the size of the genome is 3.24 Gb. The comparable size for Mus musculus is 2.81 Gb. That 15% smaller than the human genome size. How accurate is that?
There's a problem here. With all this sequence information, and all kinds of other data, it's impossible to get an accurate scientific estimate of the total genome sizes.
[Image Credit: Wikipedia: Creative Commons Attribution 2.0 Generic license]
Wednesday, March 08, 2017
What's in Your Genome? Chapter 4: Pervasive Transcription
I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! The first chapter is an introduction to genomes and DNA [What's in Your Genome? Chapter 1: Introducing Genomes ]. Chapter 2 is an overview of the human genome. It's a summary of known functional sequences and known junk DNA [What's in Your Genome? Chapter 2: The Big Picture]. Chapter 3 defines "genes" and describes protein-coding genes and alternative splicing [What's in Your Genome? Chapter 3: What Is a Gene?].
Chapter 4 is all about pervasive transcription and genes for functional noncoding RNAs.Chapter 4: Pervasive Transcription
- How much of the genome is transcribed?
- How do we know about pervasive transcription?
- Different kinds of noncoding RNAs
- Box 4-1: Long noncoding RNAs (lncRNAs)
- Understanding transcription
- Box 4-2: Revisiting the Central Dogma
- What the scientific papers don’t tell you
- Box 4-3: John Mattick proves his hypothesis?
- On the origin of new genes
- The biggest blow to junk?
- Box 4-4: How do you tell if it’s functional?
- Biochemistry is messy
- Evolution as a tinkerer
- Box 4-5: Dealing with junk RNA
- Change your worldview
What's in Your Genome? Chapter 3: What Is a Gene?
I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! The first chapter is an introduction to genomes and DNA [What's in Your Genome? Chapter 1: Introducing Genomes ]. Chapter 2 is an overview of the human genome. It's a summary of known functional sequences and known junk DNA [What's in Your Genome? Chapter 2: The Big Picture]. Here's the TOC entry for Chapter 3: What Is a Gene?. The goal is to define "gene" and determine how many protein-coding genes are in the human genome. (Noncoding genes are described in the next chapter.)
Chapter 3: What Is a Gene?
- Defining a gene
- Box 3-1: Philosophers and genes
- Counting Genes
- Misleading statements about the number of genes
- Introns and the evolution of split genes
- Introns are mostly junk
- Box 3-2: Yeast loses its introns
- Alternative splicing
- Box 3-2: Competing databases
- Alternative splicing and disease
- Box 3-3: The false logic of the argument from complexity
- Gene families
- The birth & death of genes
- Box 3-4: Real orphans in the human genome
- Different kinds of pseudogenes
- Box 3-5: Conserved pseudogenes and Ken Miller’s argument against intelligent design
- Are they really pseudogenes?
- How accurate is the genome sequence?
- The Central Dogma of Molecular Biology
- ENCODE proposes a “new” definition of “gene”
- What is noncoding DNA?
- Dark matter
Monday, March 06, 2017
What's in Your Genome? Chapter 2: The Big Picture
I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! I thought I'd post the TOC for each chapter as I finish the first drafts. Here's chapter 2.
Chapter 2: The Big Picture
- How much of the genome has been sequenced?
- Whose genome was sequenced?
- How many genes?
- Pseudogenes
- Regulatory sequences
- Origins of replication
- Centromeres
- Telomeres
- Scaffold Attachment regions (SARs)
- Transposons
- Viruses
- Mitochondrial DNA (NumtS)
- How much of our genome is functional?
What's in Your Genome? Chapter 1: Introducing Genomes
I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! I thought I'd post the TOC for each chapter as I finish the first drafts. Here's chapter 1.
Chapter 1: Introducing Genomes
- The genome war
- What is DNA?
- Chromatin
- How big is your genome?
- Active genes?
- What do you need to know?
Sunday, February 12, 2017
ENCODE workshop discusses function in 2015
A reader directed me to a 2015 ENCODE workshop with online videos of all the presentations [From Genome Function to Biomedical Insight: ENCODE and Beyond]. The workshop was sponsored by the National Human Genome Research Institute in Bethesda, Md (USA). The purpose of the workshop was ...
- Discuss the scientific questions and opportunities for better understanding genome function and applying that knowledge to basic biological questions and disease studies through large-scale genomics studies.
- Consider options for future NHGRI projects that would address these questions and opportunities.
Friday, January 06, 2017
Genetic variation in the human population
With a current population size of over 7 billion, the human population should contain a huge amount of genetic variation. Most of it resides in junk DNA so it's of little consequence. We would like to know more about the amount of variation in functional regions of the genome because it tells us something about population genetics and evolutionary theory.
A recent paper in Nature (Aug. 2016) looked at a large dataset of 60,706 individuals. They sequenced the protein-coding regions of all these people to see what kind of variation existed (Lek et al., 2016) (ExAC). The group included representatives from all parts of the world although it was heavily weighted toward Europeans. The authors used a procedure called "principal component analysis" (PCA) to cluster the individuals according to their genetic characteristics. The analysis led to the typical clustering by "population clusters." (That term is used to avoid the words "race" and/or "subspecies.")Thursday, January 05, 2017
Birth and death of genes in a hybrid frog genome
De novo genes1 are quite rare but genome duplications are quite common. Sometimes the duplicated regions contain genes so the new genome contains two copies of a gene that was formerly present in only one copy. "Common" in this sense means on a scale of millions of years. Michael Lynch and his colleague have calculated that the rate of fixed gene duplication is about 0.01 per gene per million years (Lynch and Conery, 2003 a,b; Lynch 2007). Since a typical vertebrate has more than 20,000 genes, this means that 200 genes will be duplicated and fixed every million years.
The initial duplication event is likely to be deleterious since there will now be redundant DNA in the genome. The slightly deleterious allele (duplication) can be purged by negative selection in species with large population sizes (e.g. bacteria). But in species with smaller populations, natural selection is not powerful enough to eliminate slightly deleterious alleles so the duplication persists and may become fixed in the population.