More Recent Comments

Wednesday, January 10, 2024

Benjamin Lewin's new book and his view of the human genome

I was a big fan of Benjamin Lewin. Back in the 1970's he published the first volumes of what was to become Genes, the authoritative textbook of molecular biology. I admired his ability to understand the latest experiments and put the results in the appropriate context.

Later on, when he founded the journal Cell, his editorials and other writings were always insightful. His editorial judgement was impeccable—he always published the very best papers in molecular biology.1

After a lengthy hiatus, Lewin has returned to popular science writing with Inside Science: Revolutions in Biolgoy and its Impact. The book is a critique of how science is done and how scientific dogma can inhibit scientific progress. He returns to the theme of focusing on the broad picture of science that characterized his earlier views but now he wants to emphasize the different ways of interpreting data and how they have changed. His "revolutions" include reverse transcriptase, introns, prions, genomes, CRISPR, epigenetics, and stem cells.

Here's how he describes his book.

Lewin brings these general principles to life by considering the history of the genetics revolution, from the discovery of the double helical structure of DNA to the sequencing of the human genome and the possibilities of gene editing today. History shows us that each period of progress in science relied on dogmas that often advanced but sometimes retarded progress, and that views of reality often changed suddenly and dramatically. One example is the current critical reassessment of epigenetics that is raising the possibility that there may be factors in inheritance extraneous to DNA. The book concludes by asking if the reductionist manifesto that has dominated biology for the past half century can continue to hold, and revisits the much-debated question: What is science?

The book is a major disappointment. Most of it is just a rehash of the current most popular views of the history of molecular biology without any of the insights that he published when these events were happening. His critiques are just the standard diatribes about reproducibility, publish or perish, fraud, and big science. If you are interested in these subjects then you should consult Science Fictions by Stuart Ritchie or The Scientific Attitude by Lee McIntyre both of which are far more insightful than Lewin's latest book.

There is very little critical analysis of the major "paradigm shifts" that he covers with the exception of epigenetics where Lewin expresses a modest amount of skepticism. Having lived through most of these events, as did Lewin, I question whether they were actually perceived as paradigm shifts or just extensions in our understanding of genes and gene expression.

Let me give an example of one of the revolutions that Lewin discusses. You won't be surprised to see that I've chosen the human genome and estimates of the number of genes. Here's how Lewin describes the events leading up to the publication of the human genome sequence and the interpretation of the results (pp. 221-222).

Humans are more complicated than other organisms, so it seemed obvious they should contain more genes. When the project was proposed for sequencing the human genome, it was thought that 100,000 genes would be found. It was a surprise (and perhaps a disappointment?) that we have barely more than a worm, the same as a mouse, and fewer than a mustard plant.

Looking back, you can see how the assumptions of the era were misleading, It was thought that most of the genome would comprise genes coding for proteins, and that most of the RNAs would be intermediates in the production of those proteins. Neither was correct. In fact, there are about an equal number of sequences coding for RNAs as for proteins (although in most cases the RNAs have no known function). It took analysis of the actual sequence to come to grips with the issue. Less than 1% of the human genome codes for protein, and <25% is implicated directly in gene expression. There has been a lot of speculation as to why there is so much DNA in the genome, but the straight answer is that we still don't know.

Lewin publishes this figure on page 228 with the folloing legend, "Estimates for the number of human protein-coding genes dropped steadily from 100,000 in 1990 to the end of the century, and since then have oscillated around 20,000."

I expected much better from Benjamin Lewin. Back in the 1970s he wrote extensively about the number of genes and he was well aware of the fact that only a small percentage of the human genome was coding DNA. He knew about the mutation load arguments; he knew about the C-value paradox; he knew that most of the genome was repetitive DNA; he knew that there were about 10,000 mRNAs in most cells; he knew about introns and transposons (Lewin 1974a, 1974b).

Lewin tended to favor the upper end of gene number estimates and throughout the 1980s he consistently speculated that humans had about 50,000 genes. He was always a proponent of regulatory DNA and favored the idea that a large percentage of the genome was devoted to regulation. However, at no time did he subscribe to the idea that humans had 100,000 genes and he knew very well that many experts were predicting about 30,000 genes. When he says, "It was thought that most of the genome would comprise genes coding for proteins ..." he is misrepresenting history and his own contribution to that history.

Also, the old Benjamin Lewin would never say that the number of non-coding genes is about equal to the number of coding genes. The old Benjamin Lewin was much more skeptical about scientific claims and required that they be supported by evidence. The old Benajamin Lewin would have read the literature on gene numbers and come across this figure (below) from Hatje et al. (2019). It shows the real estimates of gene number from the past 60 years. (See: How many protein-coding genes in the human genome? (2))

Lewin has never been a fan of junk DNA and throughout the 1970s and 1980s he avoided using the term. However, that doesn't mean he was unaware of the concept. So, he is being somewhat disingenuous when he says;

There has been a lot of speculation as to why there is so much DNA in the genome, but the straight answer is that we still don't know.

This would have been an excellent time to make a further point about "dogma" and to point out that for the past fifty years many top scientists have been advocating a small number of genes and lots junk DNA. The old Benjamin Lewin would have done that and admitted that he, himself, might have been ignoring data because he was adhering to an incorrect dogma about the evolution of genomes.

1. For example, he published three of my papers in 1978-79 without asking for any significant revisions! :-)

Lewin, B. (1974a) "Chapter 17: Genes and Gene Number" in Gene Expression-2: Eukarotic Chromosomes, John Wiley & Sons, Ltd. pp. 479-502.

Lewin, B. (1974b) Sequence Organization of Eukaryotic DNA: Defining the Unit of Gene Expression. Cell 1:107-111. [doi: 10.1016/0092-8674(74)90125-1]

Hatje, K., Mühlhausen, S., Simm, D., and Kollmar, M. (2019) The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays 1900066. [doi: 10.1002/bies.201900066]


Robert Byers said...

Yes they nalways say that better new ideas were and are hampered by old dogma help by a old guard. This is what organized creationism says about evolutionary biology claims or any claims in origin subjects. it always confirms why creationism is on good ground not being impressed with dogma . Yes ideas change suddenly. This will happen in origin subjects I predict.

Stewart Hinsley said...

"Humans are more complicated than other organisms, so it seemed obvious they should contain more genes."

I'm willing to entertain the hypothesis that vertebrates are more complex than other organisms (though arthropods strike me as pretty complex), but I've never understood the assumption that humans are more complicated that other tetrapods.

I can think of a couple of reasons why humans might have fewer genes, rather than more. Homeothermic animals don't need to have cells that work at different temperatures; poikilothermic animals might need to have multiple variants of proteins that work in different temperature ranges. Autotrophs (such as plants) have to synthesise molecules that animals obtain in their diet, and therefore need additional enzymes. (Though bacteria thriving with much smaller genomes argues against the significance of these factors.)

Larry Moran said...

Robert Byers said, "... it always confirms why creationism is on good ground not being impressed with dogma."

You broke my irony meter. I'll send you the bill.

Eric Falkenstein said...

It seems you disagree with conventional opinion in your field--lay and professional--as much as non-conventional creationists. If you and your readership are the sole purveyors of a profound truth, you guys should generate some tangible new, true, and important results. Otherwise, you are arguing about semantics, emphasis, and speculative interpretations. That can be interesting, but it would imply that greater empathy would be requisite.

Joe Felsenstein said...

@Eric Falkenstein: Larry's views are mainstream among people working in molecular evolution, who overwhelmingly support his view that about 90% of the genome in most eukaryotes is junk DNA. Larry's book explains why this is consistent with the evidence. It is the people who don't work on molecular evolution, many of whom have accepted the nonexistence of junk DNA, who need to show that their shallowly-evidenced view leads to anything.

Larry Moran said...

@Eric Falkenstein: It's hard to be empathetic toward scientists who repeatedly misrepresent the history of their field in order to advance their own agenda and who consistently ignore (lack of empathy) all those scientists who disagree with them.

Look at the two figures in my post. Which one do you think is correct?

Neil Taylor said...

Professor Moran, you often use a phrase about knowledgeable scientists/biochemists and how, many years prior to the genome project providing reasonably definitive evidence on the number of protein coding genes, they had postulated that the number of genes was close to the actual number.

The figure from Pertrea and Salzberg (2010) does show this.

However it also shows that there were scientists publishing estimates of the number of protein coding genes that were two or three times (or more) higher than the actual number.

These scientists were wrong; and I realise peer review is a minimum standard, but was this simply two separate fields not communicating with each other, or was it an ongoing and active controversy with papers citing and critiquing each other.

You give the impression that large numbers of scientists, even while publishing in the field, were not knowledgeable.

I've read your book (thank you for your efforts - I really enjoyed it) but am still a little confused how a field can be so unaware over such a long period of time - the dichotomy seems to have existed since the 1960s.

Benjamin Lewin's figure is incomplete. It excludes those who consistently predicted approximately the correct number.

Were these (knowledgeable) scientists voices in the wilderness? Why do you think they have been ignored.

The deflated ego problem seems quite a strange reason for them to be side-lined, what other reasons were stopping their ideas gaining wider currency - even today.

Eric Falkenstein said...

@Larry Moran: The main difference you note is about the percentage and nature of non-functional DNA, or in Lewin's words, 'why is there so much DNA in the genome?' Given that protein-coding genes take up such a small percent of the DNA, whether it is 20k or 100k doesn't address that point significantly (1%, 5%). The conventional opinion is that much (40%? 80%) of the non-protein-coding DNA is doing something, where you estimate it is more like 10%, highlighting tests that would adjudicate these very different numbers seems more fruitful.

Larry Moran said...

@Neil Taylor: I made a mistake in my original post when I attributed the second figure to the wrong people. It's from Hatje et al. (2019). I've corrected my post.

The original estimates of the number of genes were based mostly on four lines of evidence: (1) Mutation load predicting about 30,000 genes. (2) Hybridization of mRNA to DNA (R0t curves), which gave an estimate of 10-20,000 protein-coding genes. (3) C0t curves showing that half of the genome was repetitive DNA that was unlikely to contain genes. (4) Estimates of gene number in Drosophila showing that there were probably fewer than 10,000 genes in fruit flies. (We now know that those estimates were too low.)

All of these lines were discussed in the scientific literature and the conclusion was that the human genome had fewer than 50,000 genes - probably closer to 30,000. There were knowledgeable scientists who disputed the reliability of the data (including Lewin) but they did not publish alternative predictions.

The fourth point on the Hatje et al. figure (60,000 genes) comes from Vogel (1964) but he didn't consider any of the four lines of evidence mentioned above. The human genome estimate of 100,000 in 1990 comes from a back-of-the-envelope calculation by Wally Gilbert who also didn't look at any of the earlier data. No knowledgeable scientists believed that number in 1990. We assumed that it was hype used to sell the human genome project.

The high estimates in the mid 1990s are based on EST data but that data was clearly flawed and few scientists accepted it except for Venter and the TIGR team. It was dismissed by two of the three gene number papers published in 2000 before the draft human genome sequence was announced.

I think the main reason why many scientists were surprised to learn in 2001 that we have only 30,000 genes (now down to 25,000) is because they weren't following the literature and had never read the papers from the 1970s. Most of them were unfamiliar with molecular evolution and C0t curves so they didn't know the evidence for a smaller number of genes. After that date (2001), the resistance to accepting the actual number of genes in spite of the evidence can be attributed to wishful thinking and the deflated ego problem.

Larry Moran said...

Eric Falkenstein says, "The conventional opinion is that much (40%? 80%) of the non-protein-coding DNA is doing something, where you estimate it is more like 10%, highlighting tests that would adjudicate these very different numbers seems more fruitful."

I agree that there are a great many scientists who think that most of our genome is functional. Very few of them are aware of the fact that this conclusion is controversial and hardly any of them would be able to defend their view in a serious debate.

I attribute this to the propaganda spread by scientists like the ENCODE group and their journalist allies who have consistently ignored all of the evidence for junk DNA. In addition, their misunderstanding is enhanced by an incorrect view of evolution. They are under the impression that all of evolution is due to natural selection and that means junk DNA shouldn't be there.

One of the most obvious tests is to look at how much of the genome is conserved or how much is under purifying selection. Both tests give the same answer: 10%. Opponents of junk DNA have to do a lot of hand-waving in order to dismiss this data and argue that much of the non-conserved DNA could still be functional.

The main problem is that when it comes to actual evidence of function we can't attribute more than about 5% to the functional proportion of the genome. Anyone who says that it should be 40% or 80% is making a claim that they can't support by the usual standards of evidence that we expect of scientists.

Michael Tress said...
This comment has been removed by the author.
Michael Tress said...

Indeed, the group that published the 120,000 gene prediction just before the release of the draft human genome sequence quickly released a correction to their published estimate where they admitted calculation errors and provided a revised estimate.

"These improved estimates provide a lower bound of 56,960 and an upper bound of 81,273 genes in the human genome."

Hatje et al should have used the revised figures in their graph. The erroneous 120,000 point it a real outlier.

Graham Jones said...

I think that since the human genome project, there has been a sort of streetlight effect in biology. Molecular sequencing technologies shone a very strong spotlight on a particular area of biology. The human genome project was like a gold rush. Lots of scientists got excited and developed the skills to analyse this data. Now they want the excitement to continue, and they want to continue to use their skills. They are determined to find answers where it is easiest for them to look.

Larry Moran said...

@Graham Jones: I like your description of the "streetlight effect" as practiced by genomics researchers. May I borrow it? Should I attribute it to you?

Graham Jones said...

I'm glad you like it! Feel free to use it with or without attribution as you see fit.

Ted said...

"Streetlight" comes from an old anecdote of a drunk looking for his key under the street light, even though he knows he lost them elsewhere. When asked why he is deliberately looking in the wrong place, he replies "because the light is much better here."