More Recent Comments

Tuesday, April 25, 2023

Happy DNA Day 2023!

It was 70 years ago today that the famous Watson and Crick paper was published in Nature along with papers by Franklin & Gosling and Wilkins, Stokes, & Wilson. Threre's a great deal of misinformation circulating about this discovery so I wrote up a brief history of the events based largely on Horace Freeland Judson's book The Eighth Day of Creation. Every biochemistry and molecular biology student must read this book or they don't qualify to be an informed scientist. However, if you are not a biochemistry student then you might enjoy my short version.

Some practising scientists might also enjoy refreshing their memories so they have an accurate view of what happened in case their students ask questions.

The Story of DNA (Part 1)

Where Rosalind Franklin teaches Jim and Francis something about basic chemistry.

The Story of DNA (Part 2)

Where Jim and Francis discover the secret of life.

Here's the latest version of Rosalind Frankin's contribution written by Matthew Cobb and Nathaniel Comfort: What Rosalind Franklin truly contributed to the discovery of DNA's structure. If you want to know the accurate version of her history then this is a must-read. Cobb is working on a biography of Crick and Comfort is writing a biography of Watson.

Here are some other posts that might interest you on DNA Day.



Saturday, March 25, 2023

ChatGPT lies about junk DNA

I asked ChatGPT some questions about junk DNA and it made up a Francis Crick quotation and misrepresented the view of Susumu Ohno.

We have finally restored the Junk DNA article on Wikipedia. (It was deleted about ten years ago when Wikipedians decided that junk DNA doesn't exist.) One of the issues on Wikipedia is how to deal with misconceptions and misunderstandings while staying within the boundaries of Wikipedia culture. Wikipedians have an aversion to anything that looks like editorializing so you can't just say something like, "Nobody ever said that all non-coding DNA was junk." Instead, you have to find a credible reference to someone else who said that.

I've been trying to figure out how far the misunderstandings of junk DNA have spread so I asked ChatGPt (from OpenAI) again.

Wednesday, March 08, 2023

A small crustacean with a very big genome

The antarctic krill genome is the largest animal genome sequenced to date.

Antarctic krill (Euphausia superba) is a species of small crustacean (about 6 cm long) that lives in large swarms in the seas around Antarctica. It is one of the most abundant animals on the planet in terms of biomass and numbers of individuals.

It was known to have a large genome with abundant repetitive DNA sequences making assembly of a complete genome very difficult. Recent technological advances have made it possible to sequence very long fragments of DNA that span many of the repetitive regions and allow assembly of a complete genome (Shao et al. 2023).

The project involved 28 scientists from China (mostly), Australia, Denmark, and Italy. To give you an idea of the effort involved, they listed the sequencing data that was collected: 3.06 terabases (Tb) PacBio long read sequences, 734.99 Gb PacBio circular consensus sequences, 4.01 Tb short reads, and 11.38 Tb Hi-C reads. The assembled genome is 48.1 Gb, which is considerably larger than that of the African lungfish (40 Gb), which up until now was the largest fully sequenced animal genome.

The current draft has 28,834 protein-coding genes and an unknown number of noncoding genes. About 92% of the genome is repetitive DNA that's mostly transposon-related sequences. However, there is an unusual amount of highly repetitive DNA organized as long tandem repeats and this made the assembly of the complete genome quite challenging.

The protein-coding genes in the Antarctic krill are longer than in other species due to the insertion of repetitive DNA into introns but the increase in intron size is less than expected from studies of other large genomes such as lungfish and Mexican axolotl. It looks like more of the genome expansion has occurred in the intergenic DNA compared to these other species.

This study supports the idea that genome expansion is mostly due to the insertion and propagation of repetitive DNA sequences. Some of us think that the repetitive DNA is mostly junk DNA but in this case it seems unusual that there would be so much junk in the genome of a species with such a huge population size (about 350 trillion individuals). The authors were aware of this problem but they were able to calculate an effective population size because they had sequence data from different individuals all around Antarctica. The effective population size (Ne) turned out to be one billion times smaller than the census population size indicating that the population of krill had been much smaller in the recent past. Their data suggests strongly that this smaller population existed only 10 million years ago.

The authors don't mention junk DNA. They seem to favor the idea that large genomes are associated with crustaceans that live in polar regions and that large genomes may confer a selective advantage.


Shao, C., Sun, S., Liu, K., Wang, J., Li, S., Liu, Q., Deagle, B.E., Seim, I., Biscontin, A., Wang, Q. et al. (2023) The enormous repetitive Antarctic krill genome reveals environmental adaptations and population insights. Cell 186:1-16. [doi: 10.1016/j.cell.2023.02.005]

Friday, March 03, 2023

Do you understand the scientific literature?

I'm finding it increasingly difficult to understand the scientific literature even in subjects that I've been following for decades. Is it just because I'm getting too old to keep up?

Here's an example of a paper that I'd like to understand but after reading the abstract and the introduction I gave up. I'll quote the first paragraph of the introduction to see if any Sandwalk readers can do better.

I'm not talking about the paper being a complete mystery; I can figure out roughly what's it's about. What I'm thinking is that the opening paragraph could have been written in a way that makes the goals of the research much more comprehensible to average scientifically-literate people.

Weiner, D. J., Nadig, A., Jagadeesh, K. A., Dey, K. K., Neale, B. M., Robinson, E. B., ... & O’Connor, L. J. (2023) Polygenic architecture of rare coding variation across 394,783 exomes. Nature 614:492-499. [doi = 10.1038/s41586-022-05684-z]

Genome-wide association studies (GWAS) have identified thousands of common variants that are associated with common diseases and traits. Common variants have small effect sizes individually, but they combine to explain a large fraction of common disease heritability. More recently, sequencing studies have identified hundreds of genes containing rare coding variants, and these variants can have much larger effect sizes. However, it is unclear how much heritability rare variants explain in aggregate, or more generally, how common-variant and rare-variant architecture compare: whether they are equally polygenic; whether they implicate the same genes, cell types and genetically correlated risk factors; and whether rare variants will contribute meaningfully to population risk stratification.

The first question that comes to mind is whether the variant that's associated with a common disease is the cause of that disease or merely linked to the actual cause. In other words, are the associated variants responsible for the "effect size"? It sounds like the answer is "yes" in this case. Has that been firmly esablished in the GWAS field?


Thursday, March 02, 2023

"You like me!"

The endorsements for my book are in.

One of the last steps in publishing a book is to collect endorsements—favorable statements from famous people who urge you to buy the book. These short endorsements will appear in the front of the book and on the book jacket (dust jacket). They may also appear on various websites in order to promote sales.

The trick is to sent the book out for review to as many people as possible and hope that one or two will like it well enough to say something nice. I'm pleased to report that there were, indeed, a few people who liked the book well enough to endorse it.



The title of this post is from Sally Field's acceptance speech on winning the Academy Award for best actress in 1985. She said, "I can't deny the fact that you like me. Right now, you like me!"

Wednesday, March 01, 2023

Definition of a gene (again)

The correct definition of a molecular gene isn't difficult but getting it recognized and accepted is a different story.

When writing my book on junk DNA I realized that there was an issue with genes. The average scientist, and consequently the average science writer, has a very confused picture of genes and the proper way to define them. The issue shouldn't be confusing for Sandwalk readers since we've covered that ground many times in the past. I think the best working definition of a gene is, "A gene is a DNA sequence that is transcribed to produce a functional product" [What Is a Gene?]

Saturday, February 25, 2023

How Intelligent Design Creationists try to deal with the similarity between human and chimp genomes

The initial measurement of the difference between the human and chimp genomes was based on aligning 2.4 billion base pairs in the two genomes. This gave a difference of 1.23% by counting base pair substitutions and small deletions and insertions (indels). However, if you look at larger indels, including genes, you can come up with bigger values because you can count the total number of base pairs in each indel; for example, a deletion of 1,000 bp will be equivalent to 1,000 SNPs.

Thursday, February 16, 2023

What are the best Nobel Prizes in biochemistry & molecular biology since 1945?

The 2022 Nobel Prize in Physiology or Medicne went to Svante Pääbo “for his discoveries concerning the genomes of extinct hominins and human evolution”. It's one of a long list of Nobel Prizes awarded for technological achievement. It most cases, the new techniques led to a better understanding of science and medicine.

Since World War II, there have been significant advances in our understanding of biology but most of these have come about by the slow and steady accumulation of knowledge and not by paradigm-shifting breakthroughs. These advances don't often get recognized by the Nobel Prize committees because it's difficult to single out any one individual or any single experiment that merits a Nobel Prize. In some cases the Nobel Prize committees have tried to recognize major advances by picking out leaders that have made important contributions over a number of years but their choices don't always satisfy others in the field. One of the notable successes is the awarding of Nobel Prizes to Max Delbrück, Alfred D. Hershey and Salvador E. Luria “for their discoveries concerning the replication mechanism and the genetic structure of viruses” (Nobel Prize in Physiology or Medicine 1969). Another is Edward B. Lewis, Christiane Nüsslein-Volhard and Eric F. Wieschaus “for their discoveries concerning the genetic control of early embryonic development” (Nobel Prize in Physiology or Medicine 1995)

Birds of a feather: epigenetics and opposition to junk DNA

There's an old saying that birds of a feather flock together. It means that people with the same interests tend to associate with each other. It's extended meaning refers to the fact that people who believe in one thing (X) tend to also believe in another (Y). It usually means that X and Y are both questionable beliefs and it's not clear why they should be associated.

I've noticed an association between those who promote epigenetics far beyond it's reasonable limits and those who reject junk DNA in favor of a genome that's mostly functional. There's no obvious reason why these two beliefs should be associated with each other but they are. I assume it's related to the idea that both beliefs are presumed to be radical departures from the standard dogma so they reinforce the idea that the author is a revolutionary.

Or maybe it's just that sloppy thinking in one field means that sloppy thinking is the common thread.

Here's an example from Chapter 4 of a 2023 edition of the Handbook of Epigenetics (Third Edition).

The central dogma of life had clearly established the importance of the RNA molecule in the flow of genetic information. The understanding of transcription and translation processes further elucidated three distinct classes of RNA: mRNA, tRNA and rRNA. mRNA carries the information from DNA and gets translated to structural or functional proteins; hence, they are referred to as the coding RNA (RNA which codes for proteins). tRNA and rRNA help in the process of translation among other functions. A major part of the DNA, however, does not code for proteins and was previously referred to as junk DNA. The scientists started realizing the role of the junk DNA in the late 1990s and the ENCODE project, initiated in 2003, proved the significance of junk DNA beyond any doubt. Many RNA types are now known to be transcribed from DNA in the same way as mRNA, but unlike mRNA they do not get translated into any protein; hence, they are collectively referred to as noncoding RNA (ncRNA). The studies have revealed that up to 90% of the eukaryotic genome is transcribed but only 1%–2% of these transcripts code for proteins, the rest all are ncRNAs. The ncRNAs less than 200 nucleotides are called small noncoding RNAs and greater than 200 nucleotides are called long noncoding RNAs (lncRNAs).

In case you haven't been following my blog posts for the past 17 years, allow me to briefly summarize the flaws in that paragraph.

  • The central dogma has nothing to do with whether most of our genome is junk
  • There was never, ever, a time when knowledgeable scientists defended the idea that all noncoding DNA is junk
  • ENCODE did not "prove the significance of junk DNA beyond any doubt"
  • Not all transcripts are functional; most of them are junk RNA transcribed from junk DNA

So, I ask the same question that I've been asking for decades. How does this stuff get published?


Wednesday, February 15, 2023

The Maud Menten Center

Several people have commented on Facebook about an interview with Matthew Cobb [Reflections on the Double Helix's Platinum Anniversary]. Cobb has written several books that you all should have read by now. You may also be familiar with his name from Jerry Coyne's website since he is mentioned there quite a lot.

Cobb is currently writing a biography of Francis Crick and it's the reference to Crick that's attracting attention.

Watson and Crick had something that very few of us have, which is a very relaxed relationship that allowed arguing and talking and debating and being prepared to have crazy ideas that the other person could shoot down.

This is something that Crick continued to do with Nobelist Sydney Brenner, PhD, for over 20 years. Every morning, they would just yak at each other for hours on end and talk rubbish. And most of it was rubbish! But every now and again, there would be something really insightful that they had seen in an article that they could develop further.

And that’s not exactly possible in today’s world, which focuses on how many Nature, Science, or Cell articles are on a CV. So, they were living in a very different world that was much more relaxed. But I think if we could reinject a bit of that back into science—more freedom and time to explore—I think everybody would benefit.

I think we need an institute that's similar to the Institute for Advanced Studies in Princeton except that it would be focused on biochemistry and molecular biology, including genomics and molecular evolution. It would be a place for theorists and it would encourage the kind of interactions that Cobb is talking about. I think we should create that institute in Toronto and call it the Maud Menten Center after Maud L. Menten who got her undergraduate medical degree at the University of Toronto in 1907 and later on (1911) was awarded the advanced medical degree equivalent to a Ph.D.1 Most of you will know her from her work on the theory of enzyme kinetics with Leonor Michaelis. Michaelis-Menten kinetics is one of the most important contributions to biochemistry in the 20th century.

The Maud Menten Center should have a permanent staff of several smart scientists and facilities for a large number of visiting scientists who could spend their sabbaticals at the institute. There are major corporations in Toronto that could be encouraged to contribute to supporting research in this field but a lot of the support might come from various levels of government. The Canadian model is the Perimeter Institute for Theoretical Physics in Waterloo, Ontario, Canada.


1. There's a lot of confusion about the various degrees Menten got at the University of Toronto but with the help of a few people in the alumni office we were able to sort it out [The mystery of Maud Menten].

David Allis (1951 - 2023) and the "histone code"

C. David Allis died on January 8, 2023. You can read about his history of awards and accomplishments in the Nature obituary with the provocative subtitle Biologist who revolutionized the chromatin and gene-expression field. This refers to his work on histone acetyltransferases (HATs) and his ideas about the histone code.

The key paper on the histone code is,

Strahl, B. D., and Allis, C. D. (2000) The language of covalent histone modifications. Nature, 403:41-45. [doi: 10.1038/47412]

Histone proteins and the nucleosomes they form with DNA are the fundamental building blocks of eukaryotic chromatin. A diverse array of post-translational modifications that often occur on tail domains of these proteins has been well documented. Although the function of these highly conserved modifications has remained elusive, converging biochemical and genetic evidence suggests functions in several chromatin-based processes. We propose that distinct histone modifications, on one or more tails, act sequentially or in combination to form a ‘histone code’ that is, read by other proteins to bring about distinct downstream events.

They are proposing that the various modifications of histone proteins can be read as a sort of code that's recognized by other factors that bind to nucleosomes and regulation gene expression.

This is an important contribution to our understanding of the relationship between chromatin structure and gene expression. Nobody doubts that transcription is associated with an open form of chromatin that correlates with demethylation of DNA and covalent modifications of histone and nobody doubts that there are proteins that recognize modified histones. However, the key question is what comes first; the binding of transcription factors followed by changes to the DNA and histones, or do the changes to DNA and histones open the chromatin so that transcription factors can bind? These two models are referred to as the histone code model and the recruitment model.

Strahl and Allis did not address this controversy in their original paper; instead, they concentrated on what happens after histones become modified. That's what they mean by "downstream events." Unfortunately, the histone code model has been appropriated by the epigenetics cult and they do not distinguish between cause and effect. For example,

The “histone code” is a hypothesis which states that DNA transcription is largely regulated by post-translational modifications to these histone proteins. Through these mechanisms, a person’s phenotype can change without changing their underlying genetic makeup, controlling gene expression. (Shahid et al. (2022)

The language used by fans of epigenetics strongly implies that it's the modification of DNA and histones that is the primary event in regulating gene expression and not the sequence of DNA. The recruitment model states that regulation is primarily due to the binding of transcription factors to specific DNA sequences that control regulation and then lead to the epiphenomenon of DNA and histone modification.

The unauthorized expropriation of the histone code hypothesis should not be allowed to diminish the contribution of David Allis.


Wikipedia vs experts and a proposal for "arbitrators"

Wikipedia is a not-for-profit crowdsourced encyclopedia that's open to anybody who wants to contribute. This is both a strength and a weakness but the weaknesses are becoming important in an age of fake news and misinformation. The rules of Wikipedia mean that amateurs can insert any information into science articles as long as it is backed by a reliable source. But "reliable sources" include the popular press and books that may or may not report the scientific consensus accurately. When knowledgeable experts try to correct information, or put it into the proper context, they are often opposed by Wikipedia administrators who have a built-in bias against experts—a bias that's not entirely unjustified but much abused. Consequently, scientists often get frustrated trying to deal with the rules and traditions of Wikipedians because these rules are very different than the standards in the scientific community.

Here's an interesting article by Piotr Konieczny on From Adversaries to Allies? The Uneasy Relationship between Experts and the Wikipedia Community.

Sunday, February 12, 2023

Happy Darwin Day! 2023

Charles Darwin, the greatest scientist who ever lived, was born on this day in 1809 [Darwin still spurs tributes, debates] [Happy Darwin Day!] [Darwin Day 2017]. Darwin is mostly famous for two things: (1) he described and documented the evidence for evolution and common descent and (2) he provided a plausible scientific explanation of evolution—the theory of natural selection. He put all this in a book, The Origin of Species by Means of Natural Selection published in 1859—a book that spurred a revolution in our understanding of the natural world. (You can still buy a first edition copy of the book but it will cost you several hundred thousand dollars.)

Friday, February 10, 2023

I finished my book!

The last few weeks have been hectic. I got the first page proofs about a month ago and my job was to proof read those pages and prepare the index. Proof reading is tedious and the pressure is on because after that you can only make minor changes. I found several serious errors that I had made so that was a bit deflating. In one of them, I completely screwed up the calculations of the size of a typical protein-coding gene and the amount of the genome that was devoted to genes.

The index was hard because I wanted to have as comprehensive an index as possible. Publishers usually give you the option of hiring someone to do the index and taking the charges out of your royalty. If you look at some of the books on your shelf you can easily pick out the ones that were indexed by someone who doesn't understand the material.

The next step is "second pages" or the almost final version of the book. I had to proof read that version knowing that this was the last chance to fix anything. This includes the index, which I was seeing for the first time in a formatted version after the copy editor had dealt with it. I found 55 issues in this version that had to be fixed. This included several copy edit changes that I had missed on two previous passes (e.g. "cyIne" instead of "cytosine.") That's very scary—I wonder how many others we've missed?1

That's it for me! The book is now "locked" for me. The publisher will deal with the issues I found, and a few others, and then send it off to the printer at the end of next week. It's still scheduled for release on May 16th. You can't see much about my book on the US version of Amazon but there's more on the Canadian version [What's in Your Genome].


1. There have always been some errors in my textbooks but all the copy editors had degrees in biochemistry so there were lots of people who could catch mistakes. Publishing a trade book is an entirely different kettle of fish because none of the people who worked on the book knew anything about the subject and most had no science background whatsoever.

Thursday, February 02, 2023

Not in the room where it happens!

My 9th great grandfather is John Banks (1619-1685) who was born in Essex (England) and moved to the colony of Connecticut in 1634. He marrried Marie Tainter (1619-1667) and settled down in Greenwich, Fairfield, Connecticut. I descend from his son John Banks (1650-1699).

My 8th great grandfather had a sister named Hannah Banks (1654-1684) and she married Daniel Burr (1660-1727). This was exciting because I knew that Daniel Burr was the grandfather of Arron Burr who became the third vice president of the United States. But that wasn't Aaron's main claim to fame because he also murdered Alexander Hamilton and that caused a musical named after the victim to erupt in New York City.

Could it be that I was a distant cousin of the man who sang songs like "The room where it happens"?

Alas, no. Daniel Burr and my distant cousin, Hannah Banks, were only married for a few years before she died. Burr then married Mary Sherwood1 but she had only two children (Eleanor and Hannah) before she died. Daniel Burr's third wife, Jane "Elizabeth" Pinkley, is the mother of Rev. Aaron Burr who is the father of the vice president and successful duelist.


1. I'm also related to Mary Sherwood.

How big is the human genome (2023)?

There are several different ways to describe the human genome but the most common one focuses on the DNA content of the nucleus in eukaryotes; it does not include mitochondrial and chloroplast DNA . The standard reference genome sequence consists of one copy of each of the 22 autosomes plus one copy of the X chromosome and one copy of the Y chromosome. That's the definition of genome that I will use here.

The earliest direct estimates of the size of human genome relied on Feulgen staining. The stain is quantitative so a properly conducted procedure gives you the weight of DNA in the nucleus. According to these measurements, the standard diploid content of the human nucleus is 7.00 pg and the haploid content is 3.50 pg [See Ryan Gregory's Animal Genome Size Database].

Since the structure of DNA is known, we can estimate the average mass of a base pair. It is 650 daltons, or 1086 x 10-24 g/bp. The size of the human genome in base pairs can be calculated by dividing the total mass of the haploid genome by the average mass of a base pair.

                        3.5 pg/1086 x 10-12 pg/bp = 3.2 x 109 bp

The textbooks settled on this value of 3.2 Gb by the late 1960s since it was confirmed by reassociation kinetics. According to C0t analysis results from that time, roughly 10% of the genome consists of highly repetitive DNA, 25-30% is moderately repetitive and the rest is unique sequence DNA (Britten and Kohne, 1968).

A study by Morton (1991) looked at all of the estimates of genome size that had been published to date and concluded that the average size of the haploid genome in females is 3,227 Mb. This includes a complete set of autosomes and one X chromosome. The sum of autosomes plus a Y chromosome comes to 3,122 Mb. The average is about 3,200 which was similar to most estimates.

These estimates mean that the standard reference genome should be more than 3,227 Mb since it has to include all of the autosomes plus an X and a Y chromosome. The Y chromosome is about 60 Mb giving a total estimate of 3,287 Mb or 3.29 Gb.

The standard reference genome

The common assumption about the size of the human genome in the past two decades has dropped to about 3,000 Mb because the draft sequence of the human genome came in at 2,800 Mb and the so-called "finished" sequence was still considerably less than 3,200 Mb. Most people didn't realize that there were significant gaps in the draft sequence and in the "finished" sequence so the actual size is larger than the amount of sequence. The latest estimate of the size of the human genome from the Genome Reference Consortium is 3,099,441038 bp (3,099 Mb) (Build 38, patch 14 = GRCh38.p14 (February, 2022)). This includes an actual sequence of 2,948,318,359 bp and an estimate of the size of the remaining gaps. The total size estimates have been steadily dropping from >3.2 Gb to just under 3.1 Gb.

The telomere-to-telomere assembly

The first complete sequence of a human genome was published in April, 2022 [The complete human genome sequence (2022)]. This telomere-telomere (T2T) assembly of every autosome and one X chromosome came in at 3,055 Mb (3.06 Gb). If you add in the Y chromosome, it comes to 3.12 Gb, which is very similar to the estimate for GRCh38.p14 (3.10 Gb). Based on all the available data, I think it's safe to say that the size of the human genome is about 3.1 Gb and not the 3.2 Gb that we've been using up until now.

Variations in genome size

Everything comes with a caveat and human genome size is no exception. The actual size of your human genome may be different than mine and different from everyone else's, including your close relatives. This is because of the presence or absence of segmental duplications that can change the size a human genome by as much as 200 Mb. It's possible to have a genome that's smaller than 3.0 Gb or one that's larger than 3.3 Gb without affecting fitness.

Nobody has figured out a good way to incorporate this genetic variation data into the standard reference genome by creating a sort of pan genome such as those we see in bacteria. The problem is that more and more examples of segmental duplications (and deletions) are being discovered every year so annotating those changes is a nightmare. In fact, it's a major challenge just to reconcile the latest telomere-to-telomere sequence (T2T-CHM13) and the current standard reference genome [What do we do with two different human genome reference sequences?].


[Image Credit: Wikipedia: Creative Commons Attribution 2.0 Generic license]

Britten, R. and Kohne, D. (1968) Repeated Sequences in DNA. Science 161:529-540. [doi: 10.1126/science.161.3841.529]

Morton, N.E. (1991) Parameters of the Human Genome. Proc. Natl. Acad. Sci. (USA) 88:7474-7476 [free article on PubMed Central]

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945 [doi:10.1038/nature03001]

Saturday, January 28, 2023

ChatGPT won't pass my exams!

Here are a few questions for ChatGPT and its answers. The AI program takes the most common information on the web and spews it back at you. It cannot tell which information is correct or which information is more accurate.

It's easy to recognize that these answers were written by something that's not very good at critical thinking. I agree with other professors that they mimic typical undergraduate answers but I disagree that these answers would get them a passing grade.

ChatGPT shares one very important feature that's common in undergraduate answers to essay questions: it gives you lots of unecessary information that's not directly relevant to the question.

It's important to note that (lol) these ChatGPT answers share another important feature with many of the answers on my exams: they look very much like BS!

Monday, January 23, 2023

Read a short preview of What's in Your Genome

Check out this website, What's in Your Genome if you want to see a preview of my book (Preface, Prologue, part of Chapter 1).

The book will be released on May 16, 2023. We are currently working on the proofs prior to typesetting. You can preorder the hardcover version on Amazon.ca (Canada) for $39.95 (Cdn). I don't know when the electronic version will be available. You can also preorder on Amazon.co.uk for £26.99.

You can't yet preorder on Amazon.com and there's no information on that site about availability. I don't know about other Amazon sites.



Monday, January 02, 2023

Jupiter weighs two quettagrams

New names for very large and very small weights and sizes have been adopted.

Last November's meeting of the General Conference on Weights and Measures wasn't covered by the major media outlets so you probably don't know that the mass of an electron is now one rontogram and the diameter of the universe is about one ronnameter [SI units get new prefixes for huge and tiny numbers].1

The official SI prefixes for very large things are now ronna (1027) and quetta (1030) and the prefixes for very small things are ronto (10-27) and quecto (10-30).

This is annoying because we've just gotten used to zetta, yotta, zepto, and yocto (adopted in 1991). I suspect that the change was prompted by the huge storage capacity of your latest smartphone (several yottabytes) and the wealth of the world's richest people (several zeptocents). Or maybe it was the price of houses in Toronto. Or something like that. In any case, we needed to prepare for kilo or mega increases.

The bad news is that the latest additions used up the last two available letters of the alphabet so if things get any bigger or smaller we may have to add a few more letters to the alphabet.


1. A friendly reader has pointed out that my title should have been "The mass of Jupiter is two quettagrams." My bad.

Sunday, January 01, 2023

The function wars are over

In order to have a productive discussion about junk DNA we needed to agree on how to define "function" and "junk." Disagreements over the definitions spawned the Function Wars that became intense over the past decade. That war is over and now it's time to move beyond nitpicking about terminology.

The idea that most of the human genome is composed of junk DNA arose gradually in the late 1960s and early 1970s. The concept was based on a lot of evidence dating back to the 1940s and it gained support with the discovery of massive amounts of repetitive DNA.

Various classes of functional DNA were known back then including: regulatory sequences, protein-coding genes, noncoding genes, centromeres, and origins of replication. Other categories have been added since then but the total amount of functional DNA was not thought to be more than 10% of the genome. This was confirmed with the publication of the human genome sequence.

From the very beginning, the distinction between functional DNA and junk DNA was based on evolutionary principles. Functional DNA was the product of natural selection and junk DNA was not constrained by selection. The genetic load argument was a key feature of Susumu Ohno's conclusion that 90% of our genome is junk (Ohno, 1972a; Ohno, 1972b).

A multivalent mRNA flu vaccine

Scientists have recently developed an mRNA-lipid nanoparticle flu vaccine that protects against all known flu variants.

There are four types of influenza viruses but only the A and B viruses cause significant problems [Types of Influenza Viruses]. Influenza A viruses are the ones can have caused pandemics in the past and the current flu vaccinations are unlikely to offer a lot of protection since they are only directed toward the specific subtypes that are predicted to cause problems in the next flu season. The next global flu pandemic will probably come from a new influenza A virus that nobody predicted.

Thursday, December 22, 2022

Junk DNA, TED talks, and the function of lncRNAs

Most of our genome is transcribed but so far only a small number of these transcripts have a well-established biological function.

The fact that most of our genome is transcribed has been known for 50 years but that fact only became widely known with the publication of ENCODE's preliminary results in 2007 (ENCODE, 2007). The ENOCDE scientists referred to this as "pervasive transription" and this label has stuck.

By the end of the 1970s we knew that much of this transcription was due to introns. The latest data shows that protein coding genes and known noncoding genes occupy about 45% of the genome and most of that is intron sequences that are mostly junk. That leaves 30-40% of the genome that is transcribed at some point producing something like one million transcripts of unknown function.

Wednesday, December 21, 2022

A University of Chicago history graduate student's perspective on junk DNA

A new master's thesis on the history of junk DNA has been posted. It's from the Department of History at the University of Chicago.

My routine scan for articles on junk DNA turned up the abstract of an M.A. thesis on the history of junk DNA: Requiem for a Gene: The Problem of Junk DNA for the Molecular Paradigm. The supervisor is Professor Emily Kern in the Department of History at the University of Chicago. I've written to her to ask for a copy of the thesis and for permission to ask her, and her student, some questions about the thesis. No reply so far.

Here's the abstract of the thesis.

“Junk DNA” has been at the center of several high-profile scientific controversies over the past four decades, most recently in the disputes over the ENCODE Project. Despite its prominence in these debates, the concept has yet to be properly historicized. In this thesis, I seek to redress this oversight, inaugurating the study of junk DNA as a historical object and establishing the need for an earlier genesis for the concept than scholars have previously recognized. In search of a new origin story for junk, I chronicle developments in the recognition and characterization of noncoding DNA sequences, positioning them within existing historiographical narratives. Ultimately, I trace the origin of junk to 1958, when a series of unexpected findings in bacteria revealed the existence of significant stretches of DNA that did not encode protein. I show that the discovery of noncoding DNA sequences undermined molecular biologists’ vision of a gene as a line of one-dimensional code and, in turn, provoked the first major crisis in their nascent field. It is from this crisis, I argue, that the concept of junk DNA emerged. Moreover, I challenge the received narrative of junk DNA as an uncritical reification of the burgeoning molecular paradigm. By separating the history of junk DNA from its mythology, I demonstrate that the conceptualization of junk DNA reveals not the strength of molecular biological authority but its fragility.

It looks like it might be a history of noncoding DNA but I won't know for certain until I see the entire thesis. It's only available to students and staff at the University of Chicago.


Sunday, December 18, 2022

Protein concentrations in E. coli are mostly controlled at the level of transcription initiation

The most important step in the regulation of protein-coding genes in E. coli is the rate of binding of RNA polymerase to the promoter region.

A group of scientists at the University of California at San Diego and their European collaborators looked at the concentrations of proteins and mRNAs of about 2000 genes in E. coli. They catalogued these concentrations under several different growth conditions in order to determine whether the level of protein being expressed from each of these genes correlated with transcription rate, translation rate, mRNA stability or other levels of gene expression.

The paper is very difficult to understand because the authors are primarily interested in developing mathematical formulae to describe their results. They expect you to understand equations like,

even though they don't explain the parameters very well. A lot of important information is in the supplements and I couldn't be bothered to download and read them. I don't think the math is anywhere near as important as the data and the conclusions.

Friday, December 16, 2022

Publishing a science book - Lesson #1: The publisher is always right about everything

Don't bother trying to reason with a publisher. All of them have different views on proper style and every single one of them is absolutely certain that their style is the only correct one.

I'm in the middle of the copyedit stage of my book. This is the stage where a copyeditor goes through your manuscript and makes any corrections to spelling and grammar. This is a lot of work for any copyeditor having to deal with one of my manuscripts and I greatly appreciate the effort. My book is a lot better now than it was a few weeks ago. (Who knew that there was only one l in canceled?)

It's also the stage where the publisher imposes their particular style on the manusript and that can be a problem. I'll document some of the issues in subsequent posts but to give you an example, consider the titles of books in the reference list. I wrote it like this: The Selfish Gene and Molecular and Genome Evolution. This is not in line with my publisher's handbook of style so the titles were converted to lowercase as in: The selfish gene and Molecular and genome evolution. I objected, pointing to numerous other science books that used the same titles that are on the covers of the books and suggesting that my readers were more familiar with The Selfish Gene than with The selfish gene.

I was overruled by my publisher who noted that they make their style choices for good reasons—it's for "consistency, clarity, and ease of reading." I assume that publishers, such as Oxford, would make the same argument while insisting that the title should be The Selfish Gene.

In case you ever find yourself in this position, you should keep in mind that your contract will almost certainly say that the publisher has complete control of your book and they can make any changes they want as long as it doesn't affect the meaning of what you wrote.

Here's what it says in my contract, "The Publisher shall publish the Author's work in whatever style and format it thinks most suitable ... While the Publisher may, in its sole discretion, consult the Author with respect to said style and format, the Publisher retains the right to make all final decisions on matters of format, design, selling price and marketing."

I was aware of some issues with inappropriate covers and tiles in the past so I had an extra sentence added to the contract that said, "The Publisher and Author will discuss and agree upon the title and cover design." It's a good thing I put that in because the publisher was pressuring me to change the title of the book and I was able to resist.

Authors can't win most fights over style and format. I've been discussing the publishing of science books with a number of other authors over the past few months and several of them told me not to bother trying to argue with a publisher because they will never give in. They have a set style for all books and they won't make an exception for an individual author no matter how good an argument you make.

I didn't listen to those other authors. Silly me.

I'm thinking of trying to write a standard set of guidelines that scientists could put into their contracts to cover the most egregious style restrictions. It might be helpful if all science writers would insist on inserting these guidelines into their contracts.


Can the AI program ChatGPT pass my exam?

There's a lot of talk about ChatGPT and how it can prepare lectures and get good grades on undergraduate exams. However, ChatGPT is only as good as the information that's popular on the internet and that's not always enough to get a good grade on my exam.

ChatGPT is an artificial intelligence (AI) program that's designed to answer questions using a style and language that's very much like the responses you would get from a real person. It was developed by OpenAI, a tech company in San Francisco. You can create an account and log in to ask any question you want.

Several professors have challenged it with exam questions and they report that ChatGPT would easily pass their exams. I was skeptical, especially when it came to answering questions on controversial topics where there was no clear answer. I also suspected that ChatGPT would get it's answers from the internet and this means that popular, but incorrect, views would likely be part of ChatGPT's response.

Here are my questions and the AI program's answers. It did quite well in some cases but not so well in others. My main concern is that programs like this might be judged to be reliable sources of information despite the fact that the real source is suspect.

Monday, December 12, 2022

Did molecular biology make any contribution to evolutionary theory?

Some evolutionary biologists think—incorrectly, in my opinion—that molecular biology has made no contributions to our understanding of evolution.

PNAS published a series of articles on Gregor Mendel and one of them caught my eye. Here's what Nicholas Barton wrote in his article The "New Synthesis".

During the 1960s and 1970s, there were further conceptual developments—largely independent of the birth of molecular biology during the previous two decades (15). First, there was an understanding that adaptations cannot be explained simply as being “for the good of the species” (16, 17). One must explain how the genetic system (including sexual reproduction, recombination, and a fair meiosis, with each copy of a gene propagating with the same probability) is maintained through selection on individual genes, and remains stable despite mutations that would disrupt the system (17, 19, 20). Second, and related to this, there was an increased awareness of genetic conflicts that arise through sexual reproduction; selfish elements may spread through biased inheritance, even if they reduce individual fitness (19, 21, 22). In the decade following the discovery that DNA carries genetic information, all the fundamental principles of molecular biology were established: the flow of information from sequences of DNA through RNA to protein, the regulation of genes by binding to specific sequences in promoters, and the importance of allostery in allowing arbitrary regulatory networks (23, 24). Yet, the extraordinary achievements of molecular biology had little effect on the conceptual development of evolutionary biology. Conversely, although evolutionary arguments were crucial in the founding of molecular biology, they have had rather little influence in the half-century since (e.g., ref. 25). Of course, molecular biology has revealed an astonishing range of adaptations that demand explanation—for example, the diversity of biochemical pathways, that allow exploitation of almost any conceivable resource, or the efficiency of molecular machines such as the ribosome, which translates the genetic code. Technical advances have brought an accelerating flood of data, most recently, giving us complete genome sequences and expression patterns from any species. Yet, arguably, no fundamentally new principles have been established in molecular biology, and, in evolutionary biology, despite sophisticated theoretical advances and abundant data, we still grapple with the same questions as a century or more ago.

This does not seem fair to me. I think that neutral theory, nearly neutral theory, and the importance of random genetic drift relied heavily on work done by molecular biologists. Similarly, the development of dating techniques using DNA and protein sequences is largely the work of molecular biologists. It wasn't the adaptationists or the paleontologists who discovered that humans and chimpanzees shared a common ancestor 5-7 million years ago and it wasn't either of those groups who discovered the origin of mitochondria.

And some of us are grappling with the idea that most of our genome is junk DNA, a question that never would have occurred to evolutionary biologists from a century ago.

Barton knows all about modern population genetics and the importance of neutral theory because later on he says,

If we consider a single allele, then we can see it as “effectively neutral” if its effect on fitness is less than ∼1/2Ne. This idea was used by Ohta (54) in a modification of the neutral theory, to suggest why larger populations might be less diverse than expected (because a smaller fraction of mutations would be effectively neutral), and why rates of substitution might be constant per year rather than per generation (because species with shorter generation times might tend to have large populations, and have a smaller fraction of effectively neutral mutations that contribute to long-term evolution). Lynch (21) has applied this concept to argue that molecular adaptations that are under weak selection cannot be established or maintained in (relatively) smaller populations, imposing a “drift barrier” to adaptation. Along the same lines, Kondrashov (55) has argued that deleterious mutations with Nes ≈ 1 will accumulate, steadily degrading the population. Both ideas seem problematic if we view adaptation as due to optimization of polygenic traits: Organisms can be well adapted even if drift dominates selection on individual alleles, and, under a model of stabilizing selection on very many traits, any change that degrades fitness can be compensated.

Barton may think that the drift-barrier hypothesis is "problematic" but it certainly seems like a significant advance that owes something to molecular biology.

What do you think? Do you agree with Barton that, "... the extraordinary achievements of molecular biology had little effect on the conceptual development of evolutionary biology."


Friday, December 02, 2022

Sequencing both copies of your diploid genome

New techniques are being developed to obtain the complete sequences of both copies (maternal and paternal) of a typical diploid individual.

The first two sequences of the human genome were published twenty years ago by the International Human Genome Project and by a company called Celera Genomics. The published sequences were a consensus using DNA from multiple indivduals so the final result didn't represent the sequence of any one person. Furthermore, since each of us has inherited separate genomes from our mother and father, our DNA is actually a mixture of two different haploid genomes. Most published genome sequences are an average of these two separate genomes where the choice of nucleotide at any one position is arbitrary.

The first person to have a complete genome sequence was James Watson in 2007 but that was a composite genome sequence. Craig Venter's genome sequence was published a few months later and it was the first complete genome sequence containing separate sequences of each of his 46 chromosomes. (One chromosome from each of his parents.) In today's language, we refer to this as a diploid sequence.

The current reference sequence is based on the data published by the public consortium (International Humand Genome Project)—nobody cares about the Celera sequence. Over the years, more and more sequencing data has been published and this has been incorporated into the standard human reference genome in order to close most gaps and improve the accuracy. The current version is called GRCh38.p14 from February 3, 2022. It's only 95% complete because it's missing large stretches of repetitive DNA, especially in the centromere regions and at the ends of each chromosome (telomeric region).

The important point for this discussion is that CRCh38 is not representative of the genomes of most people on Earth because there has been a bias in favor of sequencing European genomes. (Some variants are annotated in the reference genome but this can't continue.) Many scientists are interested the different kinds of variants present in the human population so they would like to create databases of genomes from diverse populations.

The first complete, telomere-to-telomere (T2T), human genome sequence was published last year [A complete human genome sequence (2022). It was made possible by advances in sequencing technology that generated long reads of 10,000 bp and ultra-long reads of up to 1,000,000 bp [Telomere-to-telomere sequencing of a complete human genome]. The DNA is from a CHM13 cell line that has identical copies of each chromosome so there's no ambiguity due to differences in the maternal and paternal copies. The full name of this sequence is CHM13-T2T.

The two genomes (CRCh38 and CHM13) can't be easily merged so right now there are competing reference genomes [What do we do with two different human genome reference sequences?].

The techniques used to sequence the CHM13 genome make it possible to routinely obtain diploid genome sequences from a large number of individuals because overlapping long reads can link markers on the same chromosome and distinguish between the maternal and paternal chromosomes. However, in practice, the error rate of long read sequencing made assembly of separate chromosomes quite difficult. Recent advances in the accuracy of long read sequencing have been developed by PacBio, and this high fidelity sequencing (PacBio HiFi sequencing) promises to change the game.

The Human Pangene Reference Consortium has tackled the problem by sequencing the genome of an Ashkenazi man (HG002) and his parents (HG002-father and HG004-mother) using the latest sequencing techniques. They then asked the genome community to submit their assemblies using their best software in a kind of "assembly bakeoff." They got 23 responses.

Jarvis, E. D., Formenti, G., Rhie, A., Guarracino, A., Yang, C., Wood, J., et al. (2022) Semi-automated assembly of high-quality diploid human reference genomes. Nature, 611:519-531. [doi: 10.1038/s41586-022-05325-5]

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.

We don't need to get into all the details but there are a few observations of interest.

  • All of the attempted assemblies were reasonably good but the best ones had to make use of the parental genomes to resolve discrepancies.
  • Some assemblies began by separating the HG002 (child) sequences into two separate groups based on their similarity to one of the parents. Others generated assemblies without using the parental data then fixed any problems by using the parental genomes and a technique called "graph-based phasing." The second approach was better.
  • All of the final assemblies were contaminated with varying amounts of E. coli and yeast DNA or and/or various adaptor DNA sequences that were not removed by filters. All of them were contaminated with mitochondrial DNA that did not belong in the assembled chromosomes.
  • The most common sources of assembly errors were: (1) missing joins where large stretches of DNA should have been brought together, (2) misjoins where two large stretches (contigs) were inappropriately joined, (3) incorrect inversions, and (4) false duplications.
  • The overall accuracy of the best assemblies was one base pair error in 100,000 bp (10-5).
  • Using the RefSeq database of 27,225 genes, most assemblies captured almost all of these confirmed and probable genes but several hundred were not complete and many were missing.
  • No chromosome was complete telomere-telomere (T2T) but most were nearly complete including the complicated centromere and telomere regions.
  • The two genomes (parental and maternal) differed at 2.6 million SNPs (single nucleotides), 631,000 small structural variations (<50 bp), and 11,600 large structural variations (>50 bp).
  • The consortium used the best assembly algorithm to analyze the genomes of an additional 47 individuals. They began with the same coverage used for HG002; namely, 35X coverage. (Each stretch of DNA was sequenced 35 times on average - about equal amounts in both directions.) This was not successful so they had to increase the coverage to 130X to get good assemblies. They estimate that each additional diploid sequence will reguire 50-60X coverage. This kind of coverage would have been impossible in the 1990s when the first human genome was assembled but now it's fairly easy as long as you have the computer power and storage to deal with it.


Thursday, December 01, 2022

University of Michigan biochemistry students edit Wikipedia

Students in a special topics course at the University of Michigan were taught how to edit a Wikipedia article in order to promote function in repetitive DNA and downplay junk.

The Wikipedia article on Repeated sequence (DNA) was heavily edited today by students who were taking an undergraduate course at the University of Michgan. One of the student leaders, Rasberry Neuron, left the following message on the "Talk" page.

This page was edited for a course assignment at the University of Michigan. The editing process included peer review by four students, the Chemistry librarian at the University of Michigan, and course instructors. The edits published on 12/01/2022 reflect improvements guided by the original editing team and the peer review feedback. See the article's History page for information about what changes were made from the previous version.

References to junk DNA were removed by the students but quickly added back by Paul Gardner who is currently fixing other errors that the students have made.

I checked out the webpage for the course at CHEM 455_505 Special Topics in Biochemistry - Nucleic Acids Biochemistry. The course description is quite revealing.

We now realize that the human genome contains at least 80,000 non-redundant non-coding RNA genes, outnumbering protein-coding genes by at least 4-fold, a revolutionary insight that has led some researchers to dub the eukaryotic cell an “RNA machine”. How exactly these ncRNAs guide every cellular function – from the maintenance and processing to the regulated expression of all genetic information – lies at the leading edge of the modern biosciences, from stem cell to cancer research. This course will provide an equally broad as deep overview of the structure, function and biology of DNA and particularly RNA. We will explore important examples from the current literature and the course content will evolve accordingly.

The class will be taught from a chemical/molecular perspective and will bring modern interdisciplinary concepts from biochemistry, biophysics and molecular biology to the fore.

Most of you will recognize right away that there are factually incorrect statements (i.e. misinformation) in that description. It is not true that there are at least 80,000 noncoding genes in the human genome. At some point in the future that may turn out to be true but it's highly unlikely. Right now, there are at most 5,000 proven noncoding genes. There are many scientists who claim that the mere existence of a noncoding transcript is proof that a corresponding gene must exist but that's not how science works. Before declaring that a gene exists you must present solid evidence that it produces a biologically relevant product [Most lncRNAs are junk] [Wikipedia blocks any mention of junk DNA in the "Human genome" article] [Editing the Wikipedia article on non-coding DNA] [On the misrepresentation of facts about lncRNAs] [The "standard" view of junk DNA is completely wrong] [What's In Your Genome? - The Pie Chart] [How many lncRNAs are functional?].

I'm going to email a link to this post to the course instructors and some of the students. Let's see if we can get them to discuss junk DNA.


Monday, November 21, 2022

How not to write a Nature abstract

A friend recently posted a figure on Facebook that instructs authors in the correct way to prepare a summary paragraph (abstract) for publication in Nature. It uses a specific example and the advice is excellent [How to construct a Nature summary paragraph].

I thought it might be fun to annotate a different example so I randomly selected a paper on genomics to see how it compared. The one that popped up was An integrated encyclopedia of DNA elements in the human genome.