The concept of a gene is a fundamental part of the fields of genetics, molecular biology, evolution and all the rest of biology. Gene concepts can be divided into two main categories: abstract and physical. Abstract genes are the kind we refer to when we talk about genes “for” a certain trait, including many genetic diseases. Most geneticists and many evolutionary biologists use an abstract gene concept.
Philosophers have coined the term “Gene-P” for the abstract gene concept. The “P” stands for “phenotype” indicating that this gene concept defines a gene by it’s phenotypic effects and not its physical structure.
Physical genes consist of stretches of DNA with a beginning and an end. These are molecular genes that can be cloned and sequenced. Philosophers call them “Gene-D” where “D” stands for “development”—a very unfortunate choice.
This essay describes various modern definitions of physical genes (Gene-D). I like to define a gene as “a DNA sequence that’s transcribed” but that’s a bit too brief for a formal definition. We need to include something that restricts the definition of gene to those entities that are biologically significant. Hence,
A gene is a DNA sequence that is transcribed to produce a functional product.This eliminates those parts of the chromosome that are transcribed by accident or error. These regions are significant in large genomes; in fact, the confusion between accidental transcripts and real transcripts is responsible for the overestimates of gene number in many genome projects. (In technical parlance, most ESTs are artifacts and the sequences they come from are not genes.)
We could refine the definition by including RNA genes but that’s such a insignificant percentage of all genes that the refinement is hardly worth it. As we shall see, there are more significant limitations to the definition.
This "DNA sequence that's transcribed" definition describes a physical entity. Let’s examine a simple molecular gene to see how the definition applies.
This is a simple bacterial protein-encoding gene. The horizontal line represents a stretch of double-stranded DNA with the rectangular part being the gene. The gene is copied into RNA as shown by the arrow below the gene. This process is called transcription. Transcription begins when the transcription enzyme (RNA polymerase) binds to a promoter region (P) and starts copying the DNA beginning at the initiation site (i). The DNA is copied until a termination site (t) is reached at the end of the gene. According to my preferred definition of a gene, it starts at “i” and ends at “t.”
The part of the gene that’s transcribed includes the coding region, shown in black. This is the part of the gene that contains sequential codons specifying the amino acid sequence of the protein. At the beginning of the gene, called the 5ʹ (5-prime) end, there’s a short stretch of sequence that will be transcribed but not translated into protein. This 5ʹ untranslated region (5ʹ UTR) will contain various signals for starting protein synthesis.
The other end of the gene is called the 3ʹ (3-prime) end and there’s almost always a stretch that’s transcribed but not translated (3ʹ UTR). The 3ʹ UTR contains signals that cause transcription termination and also signals that regulate translation.
There are regions upstream of the promoter that control whether or not the gene is transcribed. These regions are called regulatory regions. They may contain binding sites for various proteins that will attach there in order to enhance the binding of RNA polymerase to the promoter. One of the differences between my preferred definition of a gene and others is that some other definitions include the promoter and the regulatory region.
There are two problems with such definitions. First, they’re not consistent with standard usage when we talk about the regulation of gene expression. We don’t say that only “part” of a gene is transcribed, which would be correct if we included the regulatory region in our definition of a gene. How often have we heard anyone say that regulatory sequences control the expression of part of the gene? That doesn’t make sense.
Second, by including regulatory sequences in the definition of a gene the actual extent of the gene becomes ill-defined. For most genes, we don’t know where all the regulatory sequences are located so we don’t know for sure where the gene begins or ends. Furthermore, there are some regulatory sequences, especially in eukaryotes, that are not contiguous with the gene and this leads to “genes” that are split into various pieces. It’s much easier to use a definition like “a DNA sequence that’s transcribed” because it defines a start and an end.
The organization of a typical eukaryote gene is shown below.
The main difference between this type of gene and a typical bacterial gene is the presence of introns and exons. These genes are transcribed from an initiation site to a termination site just like bacterial genes. When the RNA transcript is finished it undergoes an additional step called RNA processing. In that step, parts of the original transcript are spliced out and discarded. These parts correspond to the introns in the gene—shown as thinner rectangular region within the genes.
Note that the coding region (black) can be interrupted by these introns so the final messenger RNA (mRNA) cannot be translated until RNA processing is completed. The important point for our purposes is that the introns are part of the gene since they are transcribed.
My preferred definition has been used by molecular biologists for many decades but there are several other definitions that have been popular over the years. All of them have good points and bad points. I’ve already dealt with the definition that includes regulatory regions.
Some people still prefer a gene definition that corresponds to one used over half a century ago; namely, a gene is a sequence that encodes a polypeptide. This is the so-called one gene:one protein definition. It’s very old-fashioned. We’ve known for years that there are genes that do not encode proteins in spite of the fact that we commonly show protein-encoding genes whenever we describe typical genes. (As I did above.) There are genes for transfer RNA (tRNA), genes for ribosomal RNA, and genes for a large heterogeneous class of small RNAs. None of them have coding regions. The transcript is the functional product, often after RNA processing.
Because this old-fashioned definition is rarely used, the examples of alternative splicing producing different proteins pose no problem for modern definitions. These modern definitions refer to the transcript as the important product and not a protein.
There are exceptions to every generality in biology. Here’s a short list of gene examples that do not conform to my preferred definition.
Operons: In some cases adjacent “genes” are transcribed together to produce a large initial transcript containing several coding regions. In other cases the primary transcript is subsequently cleaved to produce multiple functional RNAs. In these cases it doesn’t make sense to refer to the co-transcribed genes as a single “gene.” Instead, we identify the stretches of DNA that correspond to a single functional unit as the “gene.” Thus, the lac operon contains three “genes” and the ribosomal RNA operons contain two, three, or four genes.
Trans-splicing: There are examples of “genes” that are split into pieces. The transcript from one piece is joined to the transcript from another to produce a functional RNA.
Overlapping Genes: Some “genes” overlap. This means that a single stretch of DNA can be part of two, and in at least one case, three genes.
RNA Editing: In some cases the primary transcript is extensively edited before it becomes functional. In the most extreme cases nucleotides are inserted and deleted. What this means is that the information content of the “gene” is insufficient to ensure a functional product and the assistance of other “genes” is required.