More Recent Comments

Sunday, January 28, 2007

What Is a Gene?

(Other definitions are at Discovering Biology in a Digital World, Pharyngula, and Greg Laden.)

The concept of a gene is a fundamental part of the fields of genetics, molecular biology, evolution and all the rest of biology. Gene concepts can be divided into two main categories: abstract and physical. Abstract genes are the kind we refer to when we talk about genes “for” a certain trait, including many genetic diseases. Most geneticists and many evolutionary biologists use an abstract gene concept.

Philosophers have coined the term “Gene-P” for the abstract gene concept. The “P” stands for “phenotype” indicating that this gene concept defines a gene by it’s phenotypic effects and not its physical structure.

Physical genes consist of stretches of DNA with a beginning and an end. These are molecular genes that can be cloned and sequenced. Philosophers call them “Gene-D” where “D” stands for “development”—a very unfortunate choice.

This essay describes various modern definitions of physical genes (Gene-D). I like to define a gene as “a DNA sequence that’s transcribed” but that’s a bit too brief for a formal definition. We need to include something that restricts the definition of gene to those entities that are biologically significant. Hence,
A gene is a DNA sequence that is transcribed to produce a functional product.
This eliminates those parts of the chromosome that are transcribed by accident or error. These regions are significant in large genomes; in fact, the confusion between accidental transcripts and real transcripts is responsible for the overestimates of gene number in many genome projects. (In technical parlance, most ESTs are artifacts and the sequences they come from are not genes.)

We could refine the definition by including RNA genes but that’s such a insignificant percentage of all genes that the refinement is hardly worth it. As we shall see, there are more significant limitations to the definition.

This "DNA sequence that's transcribed" definition describes a physical entity. Let’s examine a simple molecular gene to see how the definition applies.

This is a simple bacterial protein-encoding gene. The horizontal line represents a stretch of double-stranded DNA with the rectangular part being the gene. The gene is copied into RNA as shown by the arrow below the gene. This process is called transcription. Transcription begins when the transcription enzyme (RNA polymerase) binds to a promoter region (P) and starts copying the DNA beginning at the initiation site (i). The DNA is copied until a termination site (t) is reached at the end of the gene. According to my preferred definition of a gene, it starts at “i” and ends at “t.”

The part of the gene that’s transcribed includes the coding region, shown in black. This is the part of the gene that contains sequential codons specifying the amino acid sequence of the protein. At the beginning of the gene, called the 5ʹ (5-prime) end, there’s a short stretch of sequence that will be transcribed but not translated into protein. This 5ʹ untranslated region (5ʹ UTR) will contain various signals for starting protein synthesis.

The other end of the gene is called the 3ʹ (3-prime) end and there’s almost always a stretch that’s transcribed but not translated (3ʹ UTR). The 3ʹ UTR contains signals that cause transcription termination and also signals that regulate translation.

There are regions upstream of the promoter that control whether or not the gene is transcribed. These regions are called regulatory regions. They may contain binding sites for various proteins that will attach there in order to enhance the binding of RNA polymerase to the promoter. One of the differences between my preferred definition of a gene and others is that some other definitions include the promoter and the regulatory region.

There are two problems with such definitions. First, they’re not consistent with standard usage when we talk about the regulation of gene expression. We don’t say that only “part” of a gene is transcribed, which would be correct if we included the regulatory region in our definition of a gene. How often have we heard anyone say that regulatory sequences control the expression of part of the gene? That doesn’t make sense.

Second, by including regulatory sequences in the definition of a gene the actual extent of the gene becomes ill-defined. For most genes, we don’t know where all the regulatory sequences are located so we don’t know for sure where the gene begins or ends. Furthermore, there are some regulatory sequences, especially in eukaryotes, that are not contiguous with the gene and this leads to “genes” that are split into various pieces. It’s much easier to use a definition like “a DNA sequence that’s transcribed” because it defines a start and an end.

The organization of a typical eukaryote gene is shown below.

The main difference between this type of gene and a typical bacterial gene is the presence of introns and exons. These genes are transcribed from an initiation site to a termination site just like bacterial genes. When the RNA transcript is finished it undergoes an additional step called RNA processing. In that step, parts of the original transcript are spliced out and discarded. These parts correspond to the introns in the gene—shown as thinner rectangular region within the genes.

Note that the coding region (black) can be interrupted by these introns so the final messenger RNA (mRNA) cannot be translated until RNA processing is completed. The important point for our purposes is that the introns are part of the gene since they are transcribed.

My preferred definition has been used by molecular biologists for many decades but there are several other definitions that have been popular over the years. All of them have good points and bad points. I’ve already dealt with the definition that includes regulatory regions.

Some people still prefer a gene definition that corresponds to one used over half a century ago; namely, a gene is a sequence that encodes a polypeptide. This is the so-called one gene:one protein definition. It’s very old-fashioned. We’ve known for years that there are genes that do not encode proteins in spite of the fact that we commonly show protein-encoding genes whenever we describe typical genes. (As I did above.) There are genes for transfer RNA (tRNA), genes for ribosomal RNA, and genes for a large heterogeneous class of small RNAs. None of them have coding regions. The transcript is the functional product, often after RNA processing.

Because this old-fashioned definition is rarely used, the examples of alternative splicing producing different proteins pose no problem for modern definitions. These modern definitions refer to the transcript as the important product and not a protein.

There are exceptions to every generality in biology. Here’s a short list of gene examples that do not conform to my preferred definition.

Operons: In some cases adjacent “genes” are transcribed together to produce a large initial transcript containing several coding regions. In other cases the primary transcript is subsequently cleaved to produce multiple functional RNAs. In these cases it doesn’t make sense to refer to the co-transcribed genes as a single “gene.” Instead, we identify the stretches of DNA that correspond to a single functional unit as the “gene.” Thus, the lac operon contains three “genes” and the ribosomal RNA operons contain two, three, or four genes.

Trans-splicing: There are examples of “genes” that are split into pieces. The transcript from one piece is joined to the transcript from another to produce a functional RNA.

Overlapping Genes: Some “genes” overlap. This means that a single stretch of DNA can be part of two, and in at least one case, three genes.

RNA Editing: In some cases the primary transcript is extensively edited before it becomes functional. In the most extreme cases nucleotides are inserted and deleted. What this means is that the information content of the “gene” is insufficient to ensure a functional product and the assistance of other “genes” is required.


John S. Wilkins said...

One philosopher - Lenny Moss - uses (and coined) the Gene-P and Gene-D terminology. It hasn't really caught on (yet). It's based on the old preformationist/epigeneticist distinction of the 17thC, in ways I don't entirely understand.

And Lenny trained as a molecular biologist.

Peter Ellis said...

Nice concise definition of a gene - it's what I was aiming at, but put better.

I don't see why you exclude RNA genes though - you could simply alter the definition to read "nucleic acid sequence" rather than "DNA sequence". Currently you're left with the rather odd proposition that the smallpox virus has genes, but the SARS virus doesn't, for example. A retrovirus doesn't have genes when you catch it, but then *does* have genes once it integrates into the genome.

I'm also not 100% sure that it wouldn't be better to call an operon a single gene, if we're aiming for a definition rooted in nucleic acid events. Deciding that an operon represents several "genes" is superimposing a polypeptide-based definition onto a nucleic acid-based framework.

If you call an operon several genes, then logically why would the same not apply to cases where a single polypeptide chain is produced, and then cleaved into fragments with different activities? That's effectively the same phenomenon, occurring post-translationally. So why wouldn't you call the stretch of nucleic acids for each fragment a separate gene?

SPARC said...

Just to make clear my identity: Unfortunately my comments here appear under my google account name (is there a way to change this?). Normally I comment as SPARC. So, you already know my concerns about your definition. I would summarize them like this:

Nothing in transcription makes sense except in the light of regulatory sequences.

Larry Moran said...

John Wilkins says,

One philosopher - Lenny Moss - uses (and coined) the Gene-P and Gene-D terminology. It hasn't really caught on (yet).

I thought it was good to mention the distinction between the two basic gene concepts. I was basing it on the papers by Moss and also on those by Paul Griffiths and Karola Sotz who also discuss the terms Gene-P and Gene-D. Perhaps you've heard of Paul Giffiths? :-)

(For the benefit of others, John Wilkins works with Paul Griffiths in Brisbane, Australia. Griffiths is a leading expert on the gene concept.)

Larry Moran said...

Martin (SPARC) says,

Nothing in transcription makes sense except in the light of regulatory sequences.

I agree with this sentiment. Regulatory sequences control the expression of the gene. Or, do you think of them as controlling the expression of part of the gene?

Just because regulatory sequences are important does not mean they have to be included in the definition of a gene.

Larry Moran said...

Peter Ellis says,

I don't see why you exclude RNA genes though - you could simply alter the definition to read "nucleic acid sequence" rather than "DNA sequence". Currently you're left with the rather odd proposition that the smallpox virus has genes, but the SARS virus doesn't, for example.

I'm not "excluding" RNA genes—I'm simply relegating them to the category of exception to the rule. It's true that we could substitute "nucleic acid" for "DNA" in the definition but I think that weakens the definition considerably.

The tricky part about definitions in biology is that they can almost never be airtight. What we're usually looking for is a generality that conveys the truth about most of the things we're defining. In this case we're trying to describe a typical gene and in 99.99% of the cases, that gene is made of DNA.

The other problem is to reconcile a "definition" with general usage. While I agree with you that we could call an entire operon a "gene" this doesn't really make a lot of sense in light of the fact that nobody would ever agree with us. Like it or not, molecular biologists will continue to refer to the β-galactosidase gene and not the β-galactosidase fragment of the lac gene.

Greg Laden said...


This is a good definition of a gene. I like much of it.

I started to comment on the question of regulatory bits. (I agree with you) and ended up with comments extensive enough that I made my own post:

I do think regulatory regions are not genetic any more than the boardroom at the Ford Plant over in Saint Paul is a pickup truck. You need lots of things to make a pickup truck, but those things do not become the pickup truck.

This is why the Verizon commercial is funny and not real live. All those people in "the network" following you around really do work for verizon (well, they are actors, but...) but they are not part of your cell phone.

I happen to think the same of non-coding RNA consequences in the DNA, an idea that is either terribly old fashioned or very very modern. I'm still thinking about it.

Anonymous said...

you guys are so clever :| im an undergraduate at the university of nottingham, England, studying BSc Biochemistry. my tutor set me an essay to write : "WHAT IS A GENE?". so confused :-(

solitarybee said...

In defining a gene, it is easy to focus on the coding sequence that results in the functional protein. However perhaps we should consider it as a system; where the trigger and feedback systems that modulate it's activity are taken into account. It does after all sit in a context, and if you do take the gene out of it's context as in a naive attempt to GM an organism the chances are you are heading for a fail. Of course defining the full context is a bit of an art.

Physeter said...

Thanks for your valuable and spammy advice.

I like your definition, but I have a further question (hope you're still reading comments in old posts): what is a "function" in biology?

Larry Moran said...

I don't know if I can come up with a catchy definition of "function." What I mean is that the transcript or it's product has to do some biochemical duty in order to qualify. It doesn't have to be an essential function but it has to make a difference of some sort.

Physeter said...

Thanks Larry. I understood what you meant, didn't want to imply you were ambiguous. I think the concept of biological function is an interesting issue by itself. Most of the definitions I've heard around are restricted to evolution, and more specifically to traits evolved by NS. That doesn't convince me.
If you find it to be interesting, take a look at the Wikipedia article. I can't make sense of the first paragraph ("part of a question"???).

Paul said...

My two cents worth, the "raison d'être" for any maintained and active gene system is purely and simply an environmental 'intervention'.
Be it providing this indirectly as part of a support structure for a higher function (e.g. in photosynthesis), or directly in coding the actual active site as an enzyme.

Tim Tyler said...

Genes should be what genetics studies - and genetics is the science of inheritance and variation in living organisms. Organisms do not *have* to use nucleic acids for inheritance - that is simply a local historical accident - so genes should not have to be made out of DNA.

Unknown said...

Help please ... I am a complete amateur but interested. I am teaching myself about "Life" in general and how it came to be ... and am at a very basic level.

My question at the moment is from the studying I have done so far. I have ended up thinking of alleles as a collection of physical entities and genes as descriptive of the physical manifestation of the alleles' properties (i.e. not physical but descriptive of a pattern). Am I adrift?

PhillsBlog said...

A gene is something between a nucleotide and a chromosome. do genes have week defined boundaries if not, then what is an allele?