More Recent Comments

Sunday, February 12, 2017

ENCODE workshop discusses function in 2015

A reader directed me to a 2015 ENCODE workshop with online videos of all the presentations [From Genome Function to Biomedical Insight: ENCODE and Beyond]. The workshop was sponsored by the National Human Genome Research Institute in Bethesda, Md (USA). The purpose of the workshop was ...

  1. Discuss the scientific questions and opportunities for better understanding genome function and applying that knowledge to basic biological questions and disease studies through large-scale genomics studies.
  2. Consider options for future NHGRI projects that would address these questions and opportunities.
The main controversy concerning the human genome is how much of it is junk DNA with no function. Since the purpose of ENCODE is to understand genome function, I expected a lively discussion about how to distinguish between functional elements and spurious nonfunctional elements.

I also expected a debate over the significance of associations between various molecular markers and disease. Are theses associations reproducible and relevant? Do the molecular markers have anything to do with the disease?

I looked at most of the videos but I saw nothing to suggest the workshop participants cared one hoot about either of these debates. Perhaps I missed something? If anyone can find such a discussion please alert me.

There was no mention of junk DNA and no mention of the failed publicity hype surrounding publication of the the 2012 papers. It was as though that episode never existed. The overwhelming impression you get from looking at the presentation is that all the researchers believe all their data is real and reflects biological function in some way or another.

The planning stage was all about collecting more and more data. Nothing about validating the data they already have. This workshop really needed to invite some of their critics to give presentations. These PI's needed to hear some "alternative truths"!

The closest thing I could find to the thinking of the participants was a slide from a talk by Michael Snyder. I assume it reflects the thinking of ENCODE leaders.

It's true that the number of protein-coding genes hasn't changed very much in the past 50 years or so. If anything, it's gone down a bit so that today we think there are fewer than 20,000 protein-coding genes. ENCODE did very little to change our view of protein-coding genes.

Prior to ENCODE there were dozens and dozens of known genes for functional noncoding RNAs. The number of proven genes in this category has crept up little by little as proven functions are found for some conserved transcripts. Today, it's conceivable there might be as many as 5,000 genes for functional noncoding RNAs. I don't think that's what Michael Snyder meant. I think he meant 100,000 or more genes but I can't be sure. In any case, even the most optimistic estimate, 100,000 genes, would only occupy a few percent of the genome.

The original 2012 ENCODE papers talked about millions of regulatory sequences. What Michael Snyder is saying here is that there's more "potential" regulatory sequences than coding DNA. That would be more than 1.5% of the genome or 48 million bp. Assuming 48 bp per regulatory site, that's one million regulatory sequences. It's enough for 40 regulatory sites for every known gene in our genome.

That doesn't make a lot of sense to me. Does anyone know of a single example of a gene whose expression is regulated by factors binding to DNA at 40 different sites?

The slide leaves out the most important thing about function; namely how much of the genome is functional. I'd love to know if the view of ENCODE researchers on junk DNA has changed between 2003 and 2015.


41 comments :

Jass said...

Larry,

I'm not going to pretend to be an expert on the issue of jDNA but I once came a cross an article that the majority of human genome is no longer functional because it was needed in the developmental stage of the organism.

I've read a little about knockout experiments but they are inconclusive in some mammals.

Larry Moran said...

Five Things You Should Know if You Want to Participate in the Junk DNA Debate

Anonymous said...

"Does anyone know of a single example of a gene whose expression is regulated by factors binding to DNA at 40 different sites?"


Factors? Do polycomb repression group (PcG) binding sites count? Technically they aren't factors, but they are important and they need places to park.

The number of tandem repeats may indicate the number of bound regulatory factors in a few cases. A good example is the Dystrophin gene. The tandem D4Z4 repeat has a separate Polycomb group on every repeat! In the case of muscular dystrophy, when the tandem repeat number is below 11, it is associated with disease. So there are at least 11 binding sites involved in regulation, probably more since healthy individuals have up to 100 repeats, so with a Polycomb Group binding to each of the repeats, that would easily be 100 binding sites for that gene alone.

http://www.cell.com/cell/abstract/S0092-8674(12)00463-1

One ENCODE meeting (open to the public) that I was a part of had a research protesting tandem repeat weren't actively annotated and tracked by ENCODE! I guess even with all the data that ENCODE gathers, there's still stuff the slips through the cracks.


Beyond that, binding doesn't necessarily have to happen at the same time, but with different transcription factories and topologically associated domains for each cell type (213 canonical, maybe thousands with a looser definition), then it is clearly possible that 48 sites could be involved. In one cell type a gene may be regulated by sites in 1 chromatin conformation and one set of chromosomes, and in another cell type regulated by a different chromatin confirmation and different chromosomes. That's driving John Rinn's Cat's cradle hypothesis, and he was motivated by his own discovery of the FIRRE lincRNA doing something that would support a Cat's cradle. That also, no doubt, motivated the creation of ENCODE's sister project, the 4D Nucleome.


So a lot of the DNA acts as a parking lot (binding site) for molecular machines servicing a gene or transcription factory. These machines do a lot of histone modification and DNA methylation.


Since many genes are processed in transcription factories, and the factories are different for the 213 canonical cell types (and maybe thousands of cell types depending on one's definition of cell type), it's possible there will be many different bindings depending on cellular context. Robert Tjian (who publishes with the sister project of ENCODE known as 4D Nucleome) mentioned it in his video on gene transcription. He called it combinatorial gene regulation.


"A reader directed me to a 2015 ENCODE workshop "

You mean me? :-)

Marcoli said...

If I follow correctly, those seem to be examples that would answer Larrys' question. But as always we are faced with a known reality that the bulk of the putative junk DNA is derived from what we know, by their origin, to be vagabond DNA elements like transposons, viral DNA inserts, and tandem repeats. Any finding of regulatory DNA or a functioning stretch of expressed RNA among this stuff is more likely an instance of secondary recruitment. By quantity, these findings barely put a dent into the large amount of what still seems to be junk DNA that is tolerated in eukaryote genomes. If this view is wrong, then we will have to rethink what we understand about transposons and viruses and slippage of DNA during replication.

I get the impression that the 'ENCODians' are like a bunch of prospectors on the side of an enormous mountain of slag, into which they dig to find an occasional small lump of gold. In this analogy, like when the ENCODians find something, the prospectors greatly inflate the significance of their occasional lump of gold by declaring that this shows the whole mountain could be gold. Then comes a gushy press release saying as much: "Prospectors discover that an enormous mountain of slag could be mostly gold!!!"
To paraphrase a well known saying: the ENCODians have been trying to fool us. But before they did that they took great pains to fool themselves.

Larry Moran said...

Sal Cordova tried to answer my question. I asked whether anyone knew of a gene with at least 40 proven regulatory sites. (Recall that this is supposed to be the AVERAGE for human genes.)

Sal rambled on a bit but the bottom line is that he doesn't know of an example.

Does anyone else?

Larry Moran said...

Sal asked,

You mean me? :-)

Yes, I meant you. Do you give me permission to edit my post by saying it was Sal Cordova who gave me the link?

Anonymous said...

"Sal rambled on a bit but the bottom line is that he doesn't know of an example."

Don't those 100 binding sites for the Polycomb Groups on the Dystrophin gene qualify as important for regulation, thus we have up to 100 binding sites for 100 separate Polycomb Groups. So why does my example not qualify?

Why wouldn't those 100 sites be considered as participating in regulation? Obviously if enough of those repetitive sites are missing, it results in a disease state.

Or is your objection that it isn't proven.

I'm not trying to be combative on this point, but if a lot of gene regulation involves histone readers, writers, and erasers binding to histones on the gene or wherever, then doesn't this qualify as some sort of relevant regulatory binding site?

I know you used the word "factor", but don't polycomb groups count as some sort of regulatory complex?

Thanks anyway for reading my comment.

Anonymous said...

SHH

Anonymous said...

A million regulatory sequences isn't at all unreasonable based on what is known about enhancers. The relatively few that have been experimentally characterized consist of multiple short protein binding sites arranged in clusters. There seems to be some redundancy in function in that a enhancer with say six binding sites for protein X might retain it's function as long as at least one of the six sites was intact.

Moreover a single binding site isn't going to be 48bp in length (thats the resolution of the assay to detect the presence of a binding site). Binding sites themselves are more like maybe 10bp in length and some of that sequence is degenerate.
In vitro binding requires maybe 20bp but that's in order for the DNA to assume a conformation that approximates its conformation in vivo. Of those 20bp only 2-4 at most are going to make base specific contacts with the protein, with additional contacts being made with the phosphates of the backbone. Because the sites themselves are short and DNA conformation isn't nearly as dependent on sequence as protein conformation is, regulatory sequences aren't nearly as sensitive to mutation (transitions, transversions, insertion, deletion, inversion etc) as coding sequence is. A million transcription factor binding sites might occupy 1% or more of the genome but the amount of literal (as opposed to indifferent sequence) required to make them biologically functional is only going to be a fraction of that.

Larry Moran said...

One million regulatory sequences means an average of 40 per gene. I don't know of a single well-studied gene that has 40 regulatory sites, do you? It is not reasonable.

There are about 100 genes for ribosomal proteins. Why would they need 40 regulatory sites? Why would most of the thousands of housekeeping genes need so many regulatory sites?

Anonymous said...

It seems to me that if a gene has 40 regulatory sites, the difference between, say, number 38 having a regulatory function and its having no function at all would be too small to measure.

Anonymous said...

Larry,

An average of 40 per gene is just that, an average. Genes showing complex patterns of expression (spatially and temporally restricted) are likely going to have more regulatory sites and house-keeping genes are likely going to have fewer.

An example of a gene with at least 40 regulatory sites is SHH (sonic hedgehog). It's expressed in a number of different sites during development and has a number of different enhancers (each consisting of multiple transcription factor binding sites and sequences whose importance can be demonstrated via mutation even if a specific binding factor hasn't been identified).

The SHH enhancer I'm most familiar with is ~ 800 -1000bp in length and is situated ~ 1Mb away from the transcriptional start site for the SHH gene. (It's ~ 1Mb in humans more like 800kb in mice. In both cases it is located within the intron for another gene.) It is strictly required for SHH expression in developing limb buds. In mice if you delete the region you end up with a mouse in which the gene is expressed normally everywhere but the limb in which expression is abolished. (Expression of that other gene in whose intron the sequence is located isn't affected.) Mice with the deletion are born with severely truncated limbs. Within the same region in humans (and mice, chickens, cats and likely other species as well) there are a number of much smaller mutations that cause polydactyly. The gene is normally expressed in a very small region of a developing limb bud. These small mutations (sometimes as small as a single base change) result in the gene being expressed in regions of the limb bud where it isn't normally expressed and that results in extra fingers and/or toes.

The small mutations have been identified pretty much by chance because extra fingers and/or toes are pretty obvious and they don't affect viability.

Anonymous said...

I should add that suggesting that house-keeping genes are likely to have fewer regulatory sites is really just my bias. You could argue that house keeping genes, because of their importance, might have more because they'll have more redundancy. But most of the studies that I'm aware of focus on genes with complex patterns of expression so those are the enhancers that have also been most studied.

Mikkel Rumraket Rasmussen said...

It seems to me most of those PG binding sites are for ensuring effective silencing of so large a gene, not because they're really envolved in some baroquely complex regulatory function of expression levels.

It's a huge gene, so it takes a lot of material to prevent spurious transcription of so much DNA. I could easily imagine noisy transcription of such a large protein could interfere with normal cellular processes. As far as I can gather, that is indeed what happens if mutations happen in the binding sites, the result is disease. So the gene needs to be transcribed properly and in it's entirety, so you need lots of binding spots for silencers to keep it inactive when it isn't needed.

It becomes a matter of semantics then, because I would agree silencing is a form of regulation. So technically I could agree dystrophin has 100 regulatory binding sites.

Larry Moran said...

Can you point me to examples of those genes with complex patterns of expression where the biological function of 40 or more transcription factor binding sites have been demonstrated?

Please be careful about terminology. An "enhancer" is a site where there's solid evidence of biological funtion. In the absence of evidence it is not an enhancer. It is a "putative enhancer" or just a binding site.

The scientific literature is full of claims about enhancers where the only "evidence" is the presence of a binding site. That's called begging the question.

Larry Moran said...

An average of 40 per gene is just that, an average.

Exactly. For every ten genes that have only 10 regulatory sites there have to be ten genes with 70 binding sites.

An example of a gene with at least 40 regulatory sites is SHH (sonic hedgehog)

Could you give me some references?

Anonymous said...

Larry writes:
"I saw nothing to suggest the workshop participants cared one hoot about either of these debates.

Agree, and I said as much in the thread that got all this started:

"I think they don't really care how much of the genome is functional."

http://sandwalk.blogspot.com/2017/02/what-did-encode-researchers-say-on.html?showComment=1486846378044#c1661586536983818740

"The planning stage was all about collecting more and more data."

Is there a problem with that? Even supposing these regions are not causally functional, they could be still diagnostic (symptomatic), and that is pretty important too. So the data collection will continue to be funded. The medical community wants the data, plain and simple.

What drives ENCODE data collection isn't "the genome is 80% functional", it's data collection. Even supposing the genome is only 10% functional, unless we know exactly in advance which is and which is not functional, we have to keep doing data collection. Look at the few lncRNAs we found as functional (like XIST). Unless we looked and did data collect, we probably wouldn't know it had function. We still have to do this even if LOLAT lncRNAs aren't functional.

"It's enough for 40 regulatory sites for every known gene in our genome."

Because histones are potential regulatory targets and there is 1 histone set for about every 200 bp of DNA, that is about 825 regulatory regions of DNA per protein coding gene.

X-inactivation is a good example of putting regulatory marks on a large fraction of the female inactive X-chromosome for dosage compensation. And we know it is highly targeted because X-inactivation leaves some genes alone (like those genes with Y chromosome homologs).

70% of the genome histones have H3K27me3 PRC2 markings on it. Many on repetitive regions. That has regulatory significance. It looks more and more like repetitive regions have a lot of regulatory robustness.

Why put multiple regulatory marks on every histone on each silenced gene on the X-chromosome, when in principle only one histone might be needed to be marked (like at the promoter region)?

One reason is the intronic regions of one gene are used to regulate and transcribe other genes (possibly even on other chromsomes). How the heck can this be coordinated in cell-type specific manners unless some sort of chromatin modifications is taking place on these non-coding regions like histone modifications, or RNA binding to DNA to work with it (like FIRRE, XIST, HOTAIR).

Ok, so assume the default is non-function. Even if ENCODE adopted that creed, it's not going to change the way they do business because unless we know in advance what is and is not junk, collecting this sort of data is basic research.

Larry Moran said...

Sal Cordova responds to a comment I made.

"The planning stage was all about collecting more and more data."

Is there a problem with that?

The PI begins the meeting, "I'd like to call this group meeting to order."

"As you know," she says, "we are interested in how much of the genome is functional. Most of you have collected a huge amount of data but we still don't know the answer to the question. Right now we can't tell whether the features you have identified have anything to do with biological function or whether they are just noise."

Looking around the room, the PI asks, "What should we do?"

"Let's just collect more data," says one of the post-docs.

"Excellent idea!" exclaims the PI with a relieved look on her face. I'll write the grant.

Anonymous said...

Larry writes:
"The overwhelming impression you get from looking at the presentation is that all the researchers believe all their data is real and reflects biological function in some way or another. "

Right on! You called the attitude the way it really is. For once we agree.

I asked an NIH research last year in passing about LINE-1s. He said he didn't know, but "it's there for a reason." That's just the natural sentiment you'll get from theses guys. Where it originates, we can only speculate, but it's not like that attitude is rare, it's pretty common place at the NIH and probably for most medical researchers.

Mikkel Rumraket Rasmussen said...

It's there for a reason. Which is just another way of saying that something caused it to be there. Which would be true to say of the actual trash in a landfill.

txpiper said...

Mikkel,

“It's there for a reason. Which is just another way of saying that something caused it to be there. Which would be true to say of the actual trash in a landfill.”

To square up your analogy, you’d need to notice a lot of amusement parks in the landfill interacting with the garbage. But how come the allergy to ‘something caused it”? Why is that offensive?

I always enjoy reading what you have to say. The guard is always on duty, and I reckon that’s a good thing. Nobody wants to have anyone slip them a Mickey. But, how do you know that someone already didn’t?

Joe Felsenstein said...

Of the 400 members of the original ENCODE consortium, several were in my own department. One, John Stamatoyannopoulos, is convinced that there is little or no junk DNA. Several others were less convinced. One told me that from now on he was going to make sure to point out that there really was junk DNA when he presents his ENCODE work. Another, Max Librecht, was so clear and public on this that his dissent from the ENCODE announcement was featured in Ryan Gregory's Genomicron blog (here).

But an alarming number of molecular biologists do think that almost all of the genome is not junk DNA. They have no answer for Graur's points -- they are not molecular evolutionists and do not understand the mutation load objection, the genome size variation objection, the lack-of-conservation onjection, or the transposable element objection. They just assume that the genome is a finely-tuned machine, all of whose features are "there for a reason". I am embarrassed for them (they lack the requisite embarrassment).

One of the underlying motives for their view is probably this: think of all the grants we can apply for to work out what these parts of the genome do!

But what do I know? I'm only a human, and therefore far inferior to our more highly-evolved relatives, the onion and the lungfish.

Mikkel Rumraket Rasmussen said...

Why do you think anyone is taking offense? That doesn't even make sense.

I'm just here to help you not making leaps the data does not bear out. "It's there for a reason" doesn't get you to where you so desperately want to go.

Larry Moran said...

Many of those molecular biologists graduated from university in the 1990s or even later. They did not receive a proper education in evolution or in biochemistry. A proper education would have taught them basic population genetics and basic properties of DNA binding proteins (among other things).

Their teachers were scientists of our generation. Where did we go wrong?

My colleagues who teach undergraduates are perpetuating some of the same mistakes my generation made. This is hardly surprising. We are graduating another generation of students who don't understand the fundamental concepts of our disciplines.

How can we fix this?

We been discussing genomes and junk DNA in my class on molecular evolution. The students are about to graduate in just a few months but it's the first time they've been told there's even a controversy. There's something seriously wrong here. Isn't critical thinking supposed to be our goal?

Georgi Marinov said...

Many people in genomics, including a number of leading figures, aren't even molecular biologists -- they come from physics, statistics, engineering or computer science backgrounds, as the field needs serious computational expertise for method development, and method development is one thing it heavily revolves around in general.

They mostly haven't even passed through those courses. I don't know how exactly they got their understanding of the subject, and it probably varies a lot from person to person, but I would venture a wild guess that given that they have mostly learned it on the fly while doing research, the kind of hype you get from the likes of Nature and Science has had a major influence on many

Unknown said...

I think that somebody with a background from physics or statistics might be better prepared for this subject than biologists. When I got to college I enrolled for physics and switched to paleontology after a while. And when I got to the second half of my studies and had to pick a secondary subject I picked mathematics, with a focus on probability theory - mainly because what drove me to paleontology in the first place was reading David Raups "Nemesis affair" as a kid and figuring that my goal was to do research by performing statistical analyses on fossil invertebrate data I should try to take in as much of the maths as possible. The main hurdle for a lot of biology students is that population genetics requires maths. When I got around to actually looking at population genetics, I had enough of the mathematical background to actually read it. When I read Kimuras paper on the probability of fixation I already knew what the Kolmogorov equations were and what type of problems you could solve with them, I knew that you could approximate discrete stochastic processes with a Wiener process and how to get to the scaling needed. I could read the paper and it made sense. Without that prior knowledge it must seem like magic. Statisticians and physicists should have that background - The percentage of in text citations that reference Fisher, Wright and Haldane is higher in statistics textbooks than biology textbooks (the history of statistics basically has 3 parts: The early days where mathematicians would do a bit of statistics by the side and also developed relevant combinatorics, the modern synthesis and finally the bit that started with the Kolmogorov axioms where things were tidied up mathematically). And physicists can't really escape statistical mechanics either.
I think we put too little mathematics into biology education. I had to seek it out (my choice of mathematics as a secondary was unprecedented - no one had ever chosen it in the 60 years or so in which you had to pick a secondary) and with recent changes in the programs it wouldn't even be possible to go a route similar to mine in an organized fashion - a current student would be able to enroll in the courses I did much less receive credit.

Georgi Marinov said...

I fully agree with your points about math and biology.

But the problem is that the people coming into biology from a math-heavy background are not reading Kimura. They might be much better positioned to understand his work, but they're just not reading it.

They're coming into the field of genomics, not the field of evolutionary theory. So their job is to build computational tools for assembling genomes, processing various types of sequencing data, building statistical models for medically relevant genetics, etc. And they get influenced by whatever hype is being pushed at the moment. Also keep in mind that a lot of this work is being done in medical schools and other biomedically oriented institutions that often don't even have evolutionary biologists in the ranks of their faculty.

Obviously this doesn't apply to every single such individual, very far from it, but on average, there probably is a such a trend.

Anonymous said...

Dr. Moran writes: ". A proper education would have taught them basic population genetics and basic properties of DNA binding proteins (among other things)."

Regarding DNA binding proteins, does this mean DNA binding proteins have at least some random affinity for a random section of DNA. That is, when we throw in the formaldehyde or some other agent to "freeze" the state of what is bound to the DNA, we'll be getting a random binding that really has no utility by the cell?

Thanks in advance.

Btw, this is a pretty good discussion so far. Thanks for hosting it.

I was part of a separate 3-day ENCODE meeting in 2015 open somewhat to the public. A molecular biologist there was lamenting ENCODE didn't provide more facility for tracking "function" of repetitive elements. Over lunch he told me he was studying a particular gene for over 20 years, and his lab work gave him good evidence the repetitive elements were of regulatory significance. He was in the middle of applying for a grant to investigate it. He struck me as sincere and seemed typical of a lot of the lab researchers there. I think the sentiment you lament about is going to be hard to undo, whatever its causes.

Jmac said...

But the problem is that the people coming into biology from a math-heavy background are not reading Kimura. They might be much better positioned to understand his work, but they're just not reading it.

They're coming into the field of genomics, not the field of evolutionary theory.


Can anybody interpret Georgi's and obviously many others' anxiety? Because that's what ENCODE did to most if not all evolutionists.

I'm really glad that Georgi told us "where the bodies are buried" because I suspected there were some...

Here it is:

"They're (ENCODE scientist who are not evolutionists) coming into the field of genomics, not the field of evolutionary theory.
Interpretation: If scientists interpret data that is not aligned with current and accepted evolutionary theory (whichever that is now, who knows?) they should do what Georgi? Disregard it? For the sake of what? Your ambitions or the better good of Darwinian bullies who don't like what ENCODE has found and possibly will find in the future?


Joe Felsenstein said...

A good discussion. In computational molecular biology courses here, we try to educate students not only in algorithmics and computation, but also in statistics and evolutionary biology. But most genomicists training even in our department (Genome Sciences) don't go through those courses, only those interested in CMB or "bioinformatics". And elsewhere, even the "bioinformatics" courses concentrate on teaching Python and some algorithmics. Bioinformatics textbooks show the same biases, with much discussion of BLAST, sequence assembly, and alignment. But typically the phylogenies material is towards the back of the book and only 10-15 pages long. And they may have no population genetics material at all.

It may take some sort of humiliation of molecular biologists and genomicists, such as people seeing them waste a lot of money and then fail to find any function in much of the genomes. Then they might start asking how they went wrong.

Jmac said...

Joe,
Blah, blah, blah...

Let's face it; What are you going to do if most of the human genome is proven to be functional?

I will answer that for you:

You are not going to throw away your life's work that is the opposite to the findings for a couple of "morons" who have the evidence to prove you wrong...

Mikkel Rumraket Rasmussen said...

"Regarding DNA binding proteins, does this mean DNA binding proteins have at least some random affinity for a random section of DNA. That is, when we throw in the formaldehyde or some other agent to "freeze" the state of what is bound to the DNA, we'll be getting a random binding that really has no utility by the cell?"

There isn't a yes/no answer to that question. Of course you're going to get a much higher associating of DNA binding proteins to areas they are "supposed" to bind to, but yes, there will be some noise there too. Someone actually did that experiment, they assembled a lot of nonsense DNA (DNA deliberately made to be nonfunctional, with random sequence), then tested biological regulatory proteins on this random nonsense DNA. What they found is they get binding profiles that look like they were testing genomic DNA, with areas of greater and lesser binding affinity.

Random DNA Sequence Mimics #ENCODE !!
"The claim that ENCODE results disprove junk DNA is wrong because, as I argued back in the fall, something crucial is missing: a null hypothesis. Without a null hypothesis, how do you know whether to be surprised that ENCODE found biochemical activities over most of the genome? What do you really expect non- functional DNA to look like?

In our paper in this weeks PNAS, we take a stab at answering this question with one of the largest sets of randomly generated DNA sequences ever included in an experimental test of function. We tested 1,300 randomly generated DNAs (more than 100 kb total) for regulatory activity. It turns out that most of those random DNA sequences are active. Conclusion: distinguishing function from non-function is very difficult.

Mike showed the most unexpected thing. Random DNA not only binds to transcription factor, but also -

It turns out that most of the 1,300 random DNA sequences cause reproducible regulatory effects on the reporter gene. You can see this in these results from 620 random DNA sequences below, in what I call a Tie Fighter plot:
(check his blog for the plot)"

Anonymous said...

Thanks for the PNAS paper Rumraket. That was very informative.

A Salty Scientist said...

The PNAS paper is very interesting, and I think nicely highlights the difficulty in predicting whether a given motif with be functional or not. Ultimately, I think we need biochemistry to get a handle on the number of truly functional cis elements for any given transcription factor. What is the binding affinity for each potential site? And also, how many protein copies of the transcription factor exist with the cell. It defies logic that there would be 100,000 functional DNA-binding sites for a transcription factor present at only 50 copies per cell.

Larry Moran said...

@A Salty Scientist

According to the White et al. paper, Crx binds the consensus sequence CTAATCCC. There are 6.6 million of these sequences in the mouse genome. About 14,000 of these sites are occupied by Crx in vivo.

Only a small percentage of those sites are likely to be involved in biologically meaningful regulation.

Based on what we know about DNA binding proteins, it was no surprise to discover that cells contain tens of thousands of copies per cell. As with the case in bacteria and 'phage you need to have enough copies to overcome the spurious binding that's characteristic of large genomes.

The data and the theory have been around for more than 40 years.

Yamamoto, K.R., and Alberts, B. (1975). The interaction of estradiol-receptor protein with the genome: an argument for the existence of undetected specific sites. Cell 4:301-310. doi: 10.1016/0092-8674(75)90150-6

As an alternative to these models, we propose that a relatively small number of high affinity receptor binding sites exist, in addition to a much larger number of low affinity sites. In this view, the interaction of receptors [transcription factors - LAM] with the high affinity sites leads to the biological response, while binding to low affinity sites is without effect. ... If the high affinity sites represent a small fraction of the total binding sites, all of the empirically observable binding (both in vitro and in vivo) would appear to be nonspecific and nonsaturable.

We propose that this situation arises because, from a physical-chemical point of view, any protein which recognizes a specific DNA (or DNA-protein) region will also bind to general DNA structures with some reasonable (albeit reduced) affinity.

Anonymous said...

for the limb specific enhancer in SHH I'd start with a 2014 paper by Laura Lettice in Development titled Development of five digits is controlled by a bipartite long-range cis-regulator. This paper is a detailed analysis of the limb specific enhancer.

A less detailed analysis of some of the other SHH enhancers is found in : Development 136, 1665-1674 (2009) doi:10.1242/dev.032714.

Sox2 one of the so-called Yamanaka factors (required for embryonal stem cell pluripotency) has a super-enhancer consisting of >10kb of non-coding sequence identified by protein binding and functionally confirmed by deletion in:
CRISPR Reveals a Distal Super-Enhancer Required for Sox2 Expression in Mouse Embryonic Stem Cells http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114485

Neither of these papers specifically shows that 40 or more transcription factor binding sites have been demonstrated but they do show that kilo bases of DNA are required to regulate the respective genes.

Graham Jones said...

There are several "lots of human genomes" projects around. I was wondering if they could be used to assess the value of ENCODE. For example the NHS in the UK has a project to sequence 100,000 genomes from people with cancer or a rare disease (https://www.genomicsengland.co.uk/). My first thought was: I wish they'd sequence random people from all over the world, not focus on ill people in the UK, because it would make the statistics easier!

But on second thoughts, I think it could be a great data set. It seems to me that for each type of cancer and disease, we will be able to distinguish three cases:
1. there is no genetic association
2. there is an association in a conserved region
3. there is an association in an unconserved region
The ratio of cases 2 and 3 will tell us how valuable it is, from a medical point of view, to study the biochemical activity of unconserved regions.

Larry Moran said...

More than 90% of the genetic variation seen in humans is in junk DNA. That's because there's no negative selection to remove deleterious mutations. The variable sites are neutral.

Groups of SNPs (single nucleotide substitutions) tend to segregate together if they are close together. That's because recombination between them is rare. Thus, there are various haplotypes that are characteristic of certain populations. It's how DNA testing companies are able to tell you where your ancestors came from.

Most diseases with a genetic basis occur infrequently in a certain haplotype context. As the disease locus spread throughout the population it usually remains associated with the original haplotype.

Scientists look for associations between the occurrence of disease and certain haplotypes. That tells you which part of the chromosome is contributing to the disease phenotype. It's a mistake to assume that the SNPs defining the haplotype are necessarily the ones causing the genetic defect.

The actual disease-causing mutation may not have been detected. It may be just one of several that segregate together, in which case more work has to be done to find out which allele is causing the phenotype. Finally, the association may be spurious and not reproducible—that's a serious problem when looking at large amounts of data because there will always be something at the tail end of a normal distribution.

The bottom line is that case #3 does not tell you that there's something functional in junk DNA region because the mutation may have nothing to do with the disease.

The other problem is that the disease-causing mutations could, in fact, occur in junk DNA by creating a gain-of-function mutation such as a new splice site or a new binding site or a new origin of replication etc. The DNA is still junk in this case.

BTW, in order to work effectively these studies MUST look at people within a restricted population. You won't see the disease associations if you take the same number of people from different populations all over the world.

Anonymous said...

When you say the DNA is still junk in the case where a disease causing mutation creates a gain of function mutation, it may be junk from the perspective of an evolutionary biologist but it certainly isn't junk from the perspective of medical science. I'm not even sure why it's junk from the evolutionary perspective. If mutation at that site is maladaptive, isn't it reasonable to assume that it might have been selected against?

A Salty Scientist said...

If you are interested in comprehensively understanding the relative contribution of coding vs. non-coding variation to organismal phenotypes, these studies are better performed in genetically tractable models where you can actually test whether the associated polymorphisms are causal (via allele swaps for instance).

A Salty Scientist said...

Thanks Larry. My experience is with much smaller genomes and I did not appreciate that TF levels scale with DNA content (which should have been obvious to me). For one of my favorite TFs, it exists at ~100-200 copies per cell, there are >10,000 consensus binding sites in the genome, and ~1000 are bound as assessed by ChIP-seq. However, there are ~200 sites with much higher ChIP-seq signal, which are interpreted to be bona fide regulatory interactions (GO enrichment supports this notion). The other 80% of sites have detectable binding across a population of cells (and would be called biochemically functional by certain scientists, I assume), but the bound genes are not really functionally related. I think those are spurious and non-functional binding sites.

My point (I guess) is to partially double down on the notion that across a large enough population of cells we can detect low levels of spurious TF binding at more sites than the number of TF protein molecules in a given cell, and that binding strength likely differentiates truly functional sites from noise.