More Recent Comments

Saturday, September 10, 2022

Wikipedia articles: Quality and importance rankings

Wikipedia has a way of assessing the quality of articles that have been posted and edited. The rankings are somewhat confusing and it’s hard to find the complete list of quality categories so I’m putting a link to Wikipedia: Content assessment here.

There are six categories ranging from FA (featured article) to C.

Monday, September 05, 2022

The 10th anniversary of the ENCODE publicity campaign fiasco

On Sept. 5, 2012 ENCODE researchers, in collaboration with the science journal Nature, launched a massive publicity campaign to convince the world that junk DNA was dead. We are still dealing with the fallout from that disaster.

The Encyclopedia of DNA Elements (ENCODE) was originally set up to discover all of the functional elements in the human genome. They carried out a massive number of experiments involving a huge group of researchers from many different countries. The results of this work were published in a series of papers in the September 6th, 2012 issue of Nature. (The papers appeared on Sept. 5th.)

Sunday, September 04, 2022

Wikipedia: the ENCODE article

The ENCODE article on Wikipedia is a pretty good example of how to write a science article. Unfortunately, there are a few issues that will be very difficult to fix.

When Wikipedia was formed twenty years ago, there were many people who were skeptical about the concept of a free crowdsourced encyclopedia. Most people understood that a reliable source of information was needed for the internet because the traditional encyclopedias were too expensive, but could it be done by relying on volunteers to write articles that could be trusted?

The answer is mostly “yes” although that comes with some qualifications. Many science articles are not good; they contain inaccurate and misleading information and often don’t represent the scientific consensus. They also tend to be disjointed and unreadable. On the other hand, many non-science articles are at least as good, and often better, than anything in the traditional encyclopedias (eg. Battle of Waterloo; Toronto, Ontario; The Beach Boys).

By 2008, Wikipedia had expanded enormously and the quality of articles was being compared favorably to those of Encyclopedia Britannica, which had been forced to go online to compete. However, this comparison is a bit unfair since it downplays science articles.

Monday, August 29, 2022

The creationist view of junk DNA

Here's a recent video podcast (Aug. 23, 1022) from the Institute for Creation Research (sic). It features an interview with Dr. Jeff Tomkins of the ICR where he explains the history of junk DNA and why scientists no longer believe in junk DNA.

Most Sandwalk readers will recognize all the lies and distortions but here's the problem: I suspect that the majority of biologists would pretty much agree with the creationist interpretation. They also believe that junk DNA has been refuted and most of our genome is functional.

That's very sad.


Friday, August 26, 2022

ENCODE and their current definition of "function"

ENCODE has mostly abandoned it's definition of function based on biochemical activity and replaced it with "candidate" function or "likely" function, but the message isn't getting out.

Back in 2012, the ENCODE Consortium announced that 80% of the human genome was functional and junk DNA was dead [What did the ENCODE Consortium say in 2012?]. This claim was widely disputed, causing the ENCODE Consortium leaders to back down in 2014 and restate their goal (Kellis et al. 2014). The new goal is merely to map all the potential functional elements.

... the Encyclopedia of DNA Elements Project [ENCODE] was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types.

The new goal was repeated when the ENCODE III results were published in 2020, although you had to read carefully to recognize that they were no longer claiming to identify functional elements in the genome and they were raising no objections to junk DNA [ENCODE 3: A lesson in obfuscation and opaqueness].

Wednesday, August 24, 2022

Junk DNA vs noncoding DNA

The Wikipedia article on the Human genome contained a reference that I had not seen before.

"Finally DNA that is deleterious to the organism and is under negative selective pressure is called garbage DNA.[43]"

Reference 43 is a chapter in a book.

Pena S.D. (2021) "An Overview of the Human Genome: Coding DNA and Non-Coding DNA". In Haddad LA (ed.). Human Genome Structure, Function and Clinical Considerations. Cham: Springer Nature. pp. 5–7. ISBN 978-3-03-073151-9.

Sérgio Danilo Junho Pena is a human geneticist and professor in the Dept. of Biochemistry and Immunology at the Federal University of Minas Gerais in Belo Horizonte, Brazil. He is a member of the Human Genome Organization council. If you click on the Wikipedia link, it takes you to an excerpt from the book where S.D.J. Pena discusses "Coding and Non-coding DNA."

There are two quotations from that chapter that caught my eye. The first one is,

"Less than 2% of the human genome corresponds to protein-coding genes. The functional role of the remaining 98%, apart from repetitive sequences (constitutive heterochromatin) that appear to have a structural role in the chromosome, is a matter of controversy. Evolutionary evidence suggests that this noncoding DNA has no function—hence the common name of 'junk DNA.'"

Professor Pena then goes on to discuss the ENCODE results pointing out that there are many scientists who disagree with the conclusion that 80% of our genome is functional. He then says,

"Many evolutionary biologists have stuck to their guns in defense of the traditional and evolutionary view that non-coding DNA is 'junk DNA.'"

This is immediately followed by a quote from Dan Graur, implying that he (Graur) is one of the evolutionary biologists who defend the evolutionary view that noncoding DNA is junk.

I'm very interested in tracking down the reason for equating noncoding DNA and junk DNA, especially in contexts where the claim is obviously wrong. So I wrote to Professor Pena—he got his Ph.D. in Canada—and asked him for a primary source that supports the claim that "evolutionary science suggests that this noncoding DNA has no function."

He was kind enough to reply saying that there are multiple sources and he sent me links to two of them. Here's the first one.

I explained that this was somewhat ironic since I had written most of the Wikipedia article on Non-coding DNA and my goal was to refute the idea than noncoding DNA and junk DNA were synonyms. I explained that under the section on 'junk DNA' he would see the following statement that I inserted after writing sections on all those functional noncoding DNA elements.

"Junk DNA is often confused with non-coding DNA[48] but, as documented above, there are substantial fractions of non-coding DNA that have well-defined functions such as regulation, non-coding genes, origins of replication, telomeres, centromeres, and chromatin organizing sites (SARs)."

That's intended to dispel the notion that proponents of junk DNA ever equated noncoding DNA and junk DNA. I suggested that he couldn't use that source as support for his statement.

Here's my response to his second source.

The second reference is to a 2007 article by Wojciech Makalowski,1 a prominent opponent of junk DNA. He says, "In 1972 the late geneticist Susumu Ohno coined the term "junk DNA" to describe all noncoding sections of a genome" but that is a demonstrably false statement in two respects.

First, Ohno did not coin the term "junk DNA" - it was commonly used in discussions about genomes and even appeared in print many years before Ohno's paper. Second, Ohno specifically addresses regulatory sequences in his paper so it's clear that he knew about functional noncoding DNA that was not junk. He also mentions centromeres and I think it's safe to assume that he knew about ribosomal RNA genes and tRNA genes.

The only possible conclusion is that Makalowski is wrong on two counts.

I then asked about the second statement in Professor Pena's article and suggested that it might have been much better to say, "Many evolutionary biologists have stuck to their guns and defend the view that most of human genome is junk." He agreed.

So, what have we learned? Professor Pena is a well-respected scientist and an expert on the human genome. He is on the council of the Human Genome Organization. Yet, he propagated the common myth that noncoding DNA is junk and saw nothing wrong with Makalowski's false reference to Susumu Ohno. Professor Pena himself must be well aware of functional noncoding elements such as regulatory sequences and noncoding genes so it's difficult explain why he would imagine that prominant defenders of junk DNA don't know this.

I think the explanation is that this connection between noncoding DNA and junk DNA is so entrenched in the popular and scientific literature that it is just repeated as a meme without ever considering whether it makes sense.


1. The pdf appears to be a response to a query in Scientific American on February 12, 2007. It may be connected to a Scientific American paper by Khajavinia and Makalowski (2007).

Khajavinia, A., and Makalowski, W. (2007) What is" junk" DNA, and what is it worth? Scientific American, 296:104. [PubMed]

Tuesday, August 23, 2022

Are synonymous mutations mostly neutral or are they deleterious?

A recent paper in Nature claims that 75% of synonymous mutations reduce fitness in yeast. The results were challenged (refuted?) ten weeks later in a manuscript posted on a preprint server.

The first paper was published in the June 23, 2022 issue of Nature (Shen et al., 2022). The authors looked at mutations in 21 nonessential genes in yeast where mutations are known to lower fitness. They created mutations in the coding regions of these genes using the CRISPR-Cas9 editing technique. A total of 1,866 synonymous mutations were created as well as 6,306 non-synonymous mutations and 169 nonsense mutations.

Monday, August 22, 2022

NPR vs CDC on the new COVID-19 guidelines

NPR tweeted out a summary of the new CDC (United States) guidelines on COVID-19. The figure was posted under the name of Dr. Marcus Plescia, chief medical officer for the Association of State and Territorial Health Officials. I've posted a screenshot of the figure on the right.

Before discussing the four bullet points, I want to emphasize that Marcus Plescia issued a press release on August 11, 2022 when the new guidelines came out and it did not mention the points in the NPR figure. In fact, it seems to me that he would not agree with the NPR sumary.

Is every gene associated with cancer?

There has been an enormous expansion of papers on cancer and many of them make a connection with a particular human gene. A recent note in Trends in Genetics revealed that 15,233 human genes have already been mentioned in a cancer paper (de Magalhães, 2022). (I'm pretty sure the author is only referring to portein-coding genes).

The author notes that this association doesn't necessarily mean that there's a cause-and-effect relationship and also notes that justifying a connection between your favorite gene and a cancer grant application is a factor. However, he concludes that,

In genetics and genomics, literally everything is associated with cancer. If a gene has not been associated with cancer yet, it probably means it has not been studied enough and will most likely be associated with cancer in the future. In a scientific world where everything and every gene can be associated with cancer, the challenge is determining which are the key drivers of cancer and more promising therapeutic targets.

I think he's on to something. I predict that all noncoding genes will eventually be associated with cancer as well. Not only that, I predict that several thousand fake genes will also be associated with cancer. It won't be long before there are 100,000 human genes associated with cancer and then the remaining parts of the genome will also be mentioned in cancer papers.

This will mean the end of junk DNA because anything that causes cancer must be part of a functional genome.

I hope my book comes out before this becomes widely known.


de Magalhães, J.P. (2021) Every gene can (and possibly will) be associated with cancer. TRENDS in Genetics 38:216-217 [doi: 10.1016/j.tig.2021.09.005]

Sunday, August 21, 2022

Splicing errors or alternative splicing?

The most important issue in alternative splicing, in my opinion, is whether splice variants are due to splicing errors (= junk RNA) or whether they reflect real biologically relevant alternative splicing.

Unfortunately, this view is not shared by the majority of scientists who work in this field. They are convinced that the vast majority of splice variant transcripts represent real examples of regulation and the main task is to document the extent of alternative splicing and characterize the various mechanisms.

I've written a lot about this topic over the years (see the list of posts at the bottom of this page). The two most important issues are: (1) the frequency of splicing errors and whether it can account for the splice variants and (2) the number of well-established, genuine, examples of biologically relevant alternative splicing and whether that's consistent with the claims.

I managed to post a summary of the data on the accuracy of splicing on the Intron article on Wikipedia and I urge you to take a look at it before it disappears. The bottom line is that splicing is not terribly accurate so we expect to detect a fairly high level of incorrectly spliced transcripts whenever we look at a collection of RNAs from a particular cell line. The expected number of mispliced transcripts is well within the concentrations of 'alternatively spliced' transcripts reported in most studies.

Saturday, August 20, 2022

Editing the 'Intergenic region' article on Wikipedia

Just before getting banned from Wikipedia, I was about to deal with a claim on the Intergenic region article. I had already fixed most of the other problems but there is still this statement in the subsection labeled "Properties."

According to the ENCODE project's study of the human genome, due to "both the expansion of genic regions by the discovery of new isoforms and the identification of novel intergenic transcripts, there has been a marked increase in the number of intergenic regions (from 32,481 to 60,250) due to their fragmentation and a decrease in their lengths (from 14,170 bp to 3,949 bp median length)"[2]

The source is one of the ENCODE papers published in the September 6 edition of Nature (Djebali et al., 2012). The quotation is accurate. Here's the full quotation.

As a consequence of both the expansion of genic regions by the discovery of new isoforms and the identification of novel intergenic transcripts, there has been a marked increase in the number of intergenic regions (from 32,481 to 60,250) due to their fragmentation and a decrease in their lengths (from 14,170 bp to 3,949 bp median length.

What's interesting about that data is what it reveals about the percentage of the genome devoted to intergenic DNA and the percentage devoted to genes. The authors claim that there are 60,250 intergenic regions, which means that there must be more than 60,000 genes.1 The median length of these intergenic regions is 3,949 bp and that means that roughly 204.5 x 106 bp are found in intergenic DNA. That's roughly 7% of the genome depending on which genome size you use. It doesn't mean that all the rest is genes but it sounds like they're saying that about 90% of the genome is occupied by genes.

In case you doubt that's what they're saying, read the rest of the paragraph in the paper.

Concordantly, we observed an increased overlap of genic regions. As the determination of genic regions is currently defined by the cumulative lengths of the isoforms and their genetic association to phenotypic characteristics, the likely continued reduction in the lengths of intergenic regions will steadily lead to the overlap of most genes previously assumed to be distinct genetic loci. This supports and is consistent with earlier observations of a highly interleaved transcribed genome, but more importantly, prompts the reconsideration of the definition of a gene.

It sounds like they are anticipating a time when the discovery of more noncoding genes will eventually lead to a situation where the intergenic regions disappear and all genes will overlap.

Now, as most of you know, the ENCODE papers have been discredited and hardly any knowledgeable scientist thinks there are 60,000 genes that occupy 90% of the genome. But here's the problem. I probably couldn't delete that sentence from Wikipedia because it meets all the criteria of a reliable source (published in Nature by scientists from reputable universities). Recent experience tells me that the Wikipolice Wikipedia editors would have blocked me from deleting it.

The best I could do would be to balance the claim with one from another "reliable source" such as Piovasan et al. (2019) who list the total number of exons and introns and their average sizes allowing you to calculate that protein-coding genes occupy about 35% of the genome. Other papers give slightly higher values for protein-coding genes.

It's hard to get a reliable source on the real number of noncoding genes and their average size but I estimate that there are about 5,000 genes and a generous estimate that they could take up a few percent of the genome. I assume in my upcoming book that genes probably occupy about 45% of the genome because I'm trying to err on the side of function.

An article on Intergenic regions is not really the place to get into a discussion about the number of noncoding genes but in the absence of such a well-sourced explanation the audience will be left with the statement from Djebali et al. and that's extremely misleading. Thus, my preference would be to replace it with a link to some other article where the controversy can be explained, preferably a new article on junk DNA.2

I was going to say,

The total amount of intergenic DNA depends on the size of the genome, the number of genes, and the length of each gene. That can vary widely from species to species. The value for the human genome is controversial because there is no widespread agreement on the number of genes but it's almost certain that intergenic DNA takes up at least 40% of the genome.

I can't supply a specific reference for this statement so it would never have gotten past the Wikipolice Wikpipedia editors. This is a problem that can't be solved because any serious attempt to fix it will probably lead to getting blocked on Wikipedia.

There is one other statement in that section in the article on Intergenic region.

Scientists have now artificially synthesized proteins from intergenic regions.[3]

I would have removed that statement because it's irrelevant. It does not contribute to understanding intergenic regions. It's undoubtedly one of those little factoids that someone has stumbled across and thinks it needs to be on Wikipedia.

Deletion of a statement like that would have met with fierce resistance from the Wikipedia editors because it is properly sourced. The reference is to a 2009 paper in the Journal of Biological Engineering: "Synthesizing non-natural parts from natural genomic template."


1. There are no intergenic regions between the last genes on the end of a chromosome and the telomeres.

2. The Wikipedia editors deleted the Junk DNA article about ten years ago on the grounds that junk DNA had been disproven.

Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A. et al. (2012) Landscape of transcription in human cells. Nature 489:101-108. [doi: 10.1038/nature11233]

Piovesan, A., Antonaros, F., Vitale, L., Strippoli, P., Pelleri, M. C., and Caracausi, M. (2019) Human protein-coding genes and gene feature statistics in 2019. BMC research notes 12:315. [doi: 10.1186/s13104-019-4343-8]

Blocked by Wikipedia!

My account on Wikipedia has been blocked by some editor named Bbb23 after receiving a complaint from another editor named Praxidicae. Praxidicae has been blocking my attempts to edit articles on Intergenic region, Allele, and Non-coding DNA on the grounds that I am not obeying the Wikipedia rules. She has no expertise in science but she claims to be an expert on proper sources.

I have been removing unsourced statements and correcting incorrect ones. I have also attempted to make the articles more relevant by removing extraneous material. I have added material that reflects the scientific consensus on these topics.

Here's the complaint against me (Genome42) as stated by Praxidicae.

Persistent edit warring and refusal to provide sources, this user refuses to acknowledge that we require sources, not just an assessment by a self proclaimed SME. Discussions across multiple pages with said user have failed, including here where there has been a slow burning edit war, as well as personal attacks against other editors (which you can see in the discussions and his own talk page.) Instead of providing sources, he is just removing them because they are "outdated", though TNT has provided more up to date sources, which they've now removed as well. They've also expressed a desire to get other editors including myself to purposely engage in edit warring to get other editors blocked.

The complaint was posted this morning. Bbb23 apparently believed every word of this complaint and blocked me indefinitely 38 minutes later because ...

Disruptive editing, including edit-warring, refusal to collaborate with other editors, claiming that scientific articles can only be edited by experts, e.g., the user

The immediate cause of being blocked was my attempt to re-edit the Intergenic region article after an extensive discussion that you can see on Intergenic region: Talk. If you want to see a good example of the irresponsible behavior of Wikipedia editors that's a good place to look. There's an even better example on Non-coding: Talk where some other scientists have also attempted, unsuccessfully, to convince Praxidicae.

I'm really frustrated by this behavior and I don't know what to do. I could fight the blockage but I think the cult of Wikipedia editors is pretty tight and my chances are slim. What's really interesting is that I can't even comment on my own 'trial' at (User:Genome42 reported by User:Praxidicae) because I've been blocked!

UPDATE: I appealed the block by saying ...

I have been unfairly accused. I have attempted to debate and discuss the reasons for my edits but the Wikipedia editors refuse to discuss the scientific issues and, instead, make false accusations about a lack of sources and unjustified reasons for removing false and misleading statements from the Wikipedia articles. Check out the Talk section on Non-coding DNA for a good example of other scientists trying to convince Praxidicae to back off.

Another Wikipedia administrator reviewd my appeal and declined it saying, "As you see nothing wrong with your edits, there are no grounds to consider lifting the block."

Any advice? Is Wikipedia worth fighting for?

FURTHER UPDATE: I appealed again ...

I'm confused about the process. Is there no way to have a reasonable discussion about this? It seems like the only way to get unblocked is to admit guilt and apologize. Is that correct?

A new editor named Daniel Case responded ...

Declining since this isn't making an argument for being unblocked.

I think the best thing you could do for yourself right is back off and cool down. I do see where you might have had a point, but you insisted on edit warring when you should have been discussing, and your blog isn't a reliable source unless, say, enough other scientists accept it as one. I admit that it seems Praxdicidae was getting a little too dogmatic, but I haven't had the time to look at the whole argument.

This is still frustrating. It was clearly the editor PRAXIDICAE who started and continued the edit war and who refused to engage in a discussion about the scientific merits of my edits. I discussed, she warred. The only acceptable resolution to this war appears to be that I admit to being wrong and PRAXIDICAE is assumed to be correct. That's what cooperation and consensus means to this group of editors/administrators.

Also, I never suggested using my blog as a reliable source in a Wikipedia article although I did mention a blog post in the discussion (Talk) as a more detailed explanation of my scientific reasons for making an edit.

And isn't it strange that the judge in a "trial" admits to not having the time to look at all the evidence before rendering a verdict? I think that what's going on here is that these Wikipedia adminstrators tend to stick together and defend each other's actions but that's really not in line with what Wikipedia is supposed to be about.


Thursday, August 18, 2022

The trouble with Wikipedia

I used to think that Wikipedia was a pretty good source of information even for scientific subjects. It wasn't perfect, but most of the articles could be fixed.

I was wrong. It took me more than two months to make the article on Non-coding DNA acceptable and my changes met with considerable resistance. Along the way, I learned that the old article on Junk DNA had been deleted ten years ago because the general scientific consensus was that junk DNA doesn't exist. So I started to work on a new "Junk DNA" article only to discover that it was going to be very difficult to get it approved. The powerful cult of experienced Wikipedia editors were clearly going to withhold approval of a new article on that subject.

I tried editing some other articles in order to correct misinformation but I ran into the same kind of resistance [see Allele, Gene, Human genome, Evolution, Alternative splicing, Intron]. Frequently, strange editors pop out the woodwork to restore (revert) my attempts on the grounds that I was refuting well-sourced information. I even had one editor named tgeorgescu tell me that, "Friend, Wikipedians aren’t interested in what you know. They are interested in what you can cite, i.e. reliable sources."

How can you tell which sources are reliable unless you know something about the subject?

Much of this bad behavior is covered in a Wikipedia article on Why Wikipedia is not so great. Here's the part that concerns me the most.

People revert edits without explaining themselves (Example: an edit on Economics) (a proper explanation usually works better on the talk page than in an edit summary). Then, when somebody reverts, also without an explanation, an edit war often results. There's not enough grounding in Wikiquette to explain that reverts without comments are inconsiderate and almost never justified except for spam and simple vandalism, and even in those cases comments need to be made for tracking purposes.

There's a culture of hostility and conflict rather than of good will and cooperation. Even experienced Wikipedians fail to assume good faith in their collaborators. It seems fighting off perceived intruders and making egotistical reversions are a higher priority than incorporating helpful collaborators into Wikipedia's community. Glaring errors and omissions are completely ignored by veteran Wikiholics (many of whom pose as scientists, for example, but have no verifiable credentials) who have nothing to contribute but egotistical reverts.

In another article on Criticism of Wikipedia the contributors raise a number of issues including the bad behavior of the cult of long-time Wikipedia editors. It also points out that anonymous editors who refuse to reveal their identify and areas of expertise leads to a lack of accountability.

This sort of behavior is frustrating and it has an effect. Well-meaning scientists are quickly discouraged from fixing articles because of all the hassle they have to go through.

I now see that the problem can't be easily fixed and Wikipedia science articles are not reliable.


Friday, August 12, 2022

The surprising (?) conservation of noncoding DNA

We've known for more than half-a-century that a lot of noncoding DNA is functional. Why are some people still surprised? It's a puzzlement.

A paper in Trends in Genetics caught my eye as I was looking for somethng else. The authors review the various functions of noncoding DNA such as regulatory sequences and noncoding genes. There's nothing wrong with that but the context is a bit shocking for a paper that was published in 2021 in a highly respected journal.

Leypold, N.A. and Speicher, M.R. (2021) Evolutionary conservation in noncoding genomic regions. TRENDS in Genetics 37:903-918. [doi: 10.1016/j.tig.2021.06.007]

Humans may share more genomic commonalities with other species than previously thought. According to current estimates, ~5% of the human genome is functionally constrained, which is a much larger fraction than the ~1.5% occupied by annotated protein-coding genes. Hence, ~3.5% of the human genome comprises likely functional conserved noncoding elements (CNEs) preserved among organisms, whose common ancestors existed throughout hundreds of millions of years of evolution. As whole-genome sequencing emerges as a standard procedure in genetic analyses, interpretation of variations in CNEs, including the elucidation of mechanistic and functional roles, becomes a necessity. Here, we discuss the phenomenon of noncoding conservation via four dimensions (sequence, regulatory conservation, spatiotemporal expression, and structure) and the potential significance of CNEs in phenotype variation and disease.

Thursday, August 04, 2022

Identifying functional DNA (and junk) by purifying selection

Functional DNA is best defined as DNA that is currently under purifying selection. In other words, it can't be deleted without affecting the fitness of the individual. This is the "maintenance function" definition and it differs from the "causal role" and "selected effect" definitions [The Function Wars Part IX: Stefan Linquist on Causal Role vs Selected Effect].

It has always been difficult to determine whether a given sequence is under purifying selection so sequence conservation is often used as a proxy. This is perfectly justifiable since the two criteria are strongly correlated. As a general rule, sequences that are currently being maintained by selection are ancient enough to show evidence of conservation. The only exceptions are de novo sequences and sequences that have recently become expendable and these are rare.