Comments on Sandwalk: ENCODE workshop discusses function in 2015

Thanks Larry. My experience is with much smaller g...

2017-02-21T11:19:04.919-05:00

Thanks Larry. My experience is with much smaller genomes and I did not appreciate that TF levels scale with DNA content (which should have been obvious to me). For one of my favorite TFs, it exists at ~100-200 copies per cell, there are >10,000 consensus binding sites in the genome, and ~1000 are bound as assessed by ChIP-seq. However, there are ~200 sites with much higher ChIP-seq signal, which are interpreted to be bona fide regulatory interactions (GO enrichment supports this notion). The other 80% of sites have detectable binding across a population of cells (and would be called biochemically functional by certain scientists, I assume), but the bound genes are not really functionally related. I think those are spurious and non-functional binding sites.

My point (I guess) is to partially double down on the notion that across a large enough population of cells we can detect low levels of spurious TF binding at more sites than the number of TF protein molecules in a given cell, and that binding strength likely differentiates truly functional sites from noise.

If you are interested in comprehensively understan...

2017-02-21T10:55:46.308-05:00

If you are interested in comprehensively understanding the relative contribution of coding vs. non-coding variation to organismal phenotypes, these studies are better performed in genetically tractable models where you can actually test whether the associated polymorphisms are causal (via allele swaps for instance).

When you say the DNA is still junk in the case whe...

2017-02-21T10:27:31.787-05:00

When you say the DNA is still junk in the case where a disease causing mutation creates a gain of function mutation, it may be junk from the perspective of an evolutionary biologist but it certainly isn't junk from the perspective of medical science. I'm not even sure why it's junk from the evolutionary perspective. If mutation at that site is maladaptive, isn't it reasonable to assume that it might have been selected against?

More than 90% of the genetic variation seen in hum...

2017-02-21T09:56:32.271-05:00

More than 90% of the genetic variation seen in humans is in junk DNA. That's because there's no negative selection to remove deleterious mutations. The variable sites are neutral.

Groups of SNPs (single nucleotide substitutions) tend to segregate together if they are close together. That's because recombination between them is rare. Thus, there are various haplotypes that are characteristic of certain populations. It's how DNA testing companies are able to tell you where your ancestors came from.

Most diseases with a genetic basis occur infrequently in a certain haplotype context. As the disease locus spread throughout the population it usually remains associated with the original haplotype.

Scientists look for associations between the occurrence of disease and certain haplotypes. That tells you which part of the chromosome is contributing to the disease phenotype. It's a mistake to assume that the SNPs defining the haplotype are necessarily the ones causing the genetic defect.

The actual disease-causing mutation may not have been detected. It may be just one of several that segregate together, in which case more work has to be done to find out which allele is causing the phenotype. Finally, the association may be spurious and not reproducible—that's a serious problem when looking at large amounts of data because there will always be something at the tail end of a normal distribution.

The bottom line is that case #3 does not tell you that there's something functional in junk DNA region because the mutation may have nothing to do with the disease.

The other problem is that the disease-causing mutations could, in fact, occur in junk DNA by creating a gain-of-function mutation such as a new splice site or a new binding site or a new origin of replication etc. The DNA is still junk in this case.

BTW, in order to work effectively these studies MUST look at people within a restricted population. You won't see the disease associations if you take the same number of people from different populations all over the world.

There are several "lots of human genomes"...

2017-02-21T08:21:43.506-05:00

There are several "lots of human genomes" projects around. I was wondering if they could be used to assess the value of ENCODE. For example the NHS in the UK has a project to sequence 100,000 genomes from people with cancer or a rare disease (https://www.genomicsengland.co.uk/). My first thought was: I wish they'd sequence random people from all over the world, not focus on ill people in the UK, because it would make the statistics easier!

But on second thoughts, I think it could be a great data set. It seems to me that for each type of cancer and disease, we will be able to distinguish three cases:
1. there is no genetic association
2. there is an association in a conserved region
3. there is an association in an unconserved region
The ratio of cases 2 and 3 will tell us how valuable it is, from a medical point of view, to study the biochemical activity of unconserved regions.

for the limb specific enhancer in SHH I'd star...

2017-02-21T00:39:54.320-05:00

for the limb specific enhancer in SHH I'd start with a 2014 paper by Laura Lettice in Development titled Development of five digits is controlled by a bipartite long-range cis-regulator. This paper is a detailed analysis of the limb specific enhancer.

A less detailed analysis of some of the other SHH enhancers is found in : Development 136, 1665-1674 (2009) doi:10.1242/dev.032714.

Sox2 one of the so-called Yamanaka factors (required for embryonal stem cell pluripotency) has a super-enhancer consisting of >10kb of non-coding sequence identified by protein binding and functionally confirmed by deletion in:
CRISPR Reveals a Distal Super-Enhancer Required for Sox2 Expression in Mouse Embryonic Stem Cells http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114485

Neither of these papers specifically shows that 40 or more transcription factor binding sites have been demonstrated but they do show that kilo bases of DNA are required to regulate the respective genes.

@A Salty Scientist According to the White et al. ...

2017-02-20T13:58:13.226-05:00

@A Salty Scientist

According to the White et al. paper, Crx binds the consensus sequence CTAATCCC. There are 6.6 million of these sequences in the mouse genome. About 14,000 of these sites are occupied by Crx in vivo.

Only a small percentage of those sites are likely to be involved in biologically meaningful regulation.

Based on what we know about DNA binding proteins, it was no surprise to discover that cells contain tens of thousands of copies per cell. As with the case in bacteria and 'phage you need to have enough copies to overcome the spurious binding that's characteristic of large genomes.

The data and the theory have been around for more than 40 years.

Yamamoto, K.R., and Alberts, B. (1975). The interaction of estradiol-receptor protein with the genome: an argument for the existence of undetected specific sites. Cell 4:301-310. doi: 10.1016/0092-8674(75)90150-6

As an alternative to these models, we propose that a relatively small number of high affinity receptor binding sites exist, in addition to a much larger number of low affinity sites. In this view, the interaction of receptors [transcription factors - LAM] with the high affinity sites leads to the biological response, while binding to low affinity sites is without effect. ... If the high affinity sites represent a small fraction of the total binding sites, all of the empirically observable binding (both in vitro and in vivo) would appear to be nonspecific and nonsaturable.

We propose that this situation arises because, from a physical-chemical point of view, any protein which recognizes a specific DNA (or DNA-protein) region will also bind to general DNA structures with some reasonable (albeit reduced) affinity.

The PNAS paper is very interesting, and I think ni...

2017-02-20T12:51:05.734-05:00

The PNAS paper is very interesting, and I think nicely highlights the difficulty in predicting whether a given motif with be functional or not. Ultimately, I think we need biochemistry to get a handle on the number of truly functional cis elements for any given transcription factor. What is the binding affinity for each potential site? And also, how many protein copies of the transcription factor exist with the cell. It defies logic that there would be 100,000 functional DNA-binding sites for a transcription factor present at only 50 copies per cell.

Thanks for the PNAS paper Rumraket. That was very...

2017-02-20T09:40:59.830-05:00

Thanks for the PNAS paper Rumraket. That was very informative.

"Regarding DNA binding proteins, does this me...

2017-02-20T02:25:18.683-05:00

"Regarding DNA binding proteins, does this mean DNA binding proteins have at least some random affinity for a random section of DNA. That is, when we throw in the formaldehyde or some other agent to "freeze" the state of what is bound to the DNA, we'll be getting a random binding that really has no utility by the cell?"

There isn't a yes/no answer to that question. Of course you're going to get a much higher associating of DNA binding proteins to areas they are "supposed" to bind to, but yes, there will be some noise there too. Someone actually did that experiment, they assembled a lot of nonsense DNA (DNA deliberately made to be nonfunctional, with random sequence), then tested biological regulatory proteins on this random nonsense DNA. What they found is they get binding profiles that look like they were testing genomic DNA, with areas of greater and lesser binding affinity.

Random DNA Sequence Mimics #ENCODE !!
"The claim that ENCODE results disprove junk DNA is wrong because, as I argued back in the fall, something crucial is missing: a null hypothesis. Without a null hypothesis, how do you know whether to be surprised that ENCODE found biochemical activities over most of the genome? What do you really expect non- functional DNA to look like?

In our paper in this weeks PNAS, we take a stab at answering this question with one of the largest sets of randomly generated DNA sequences ever included in an experimental test of function. We tested 1,300 randomly generated DNAs (more than 100 kb total) for regulatory activity. It turns out that most of those random DNA sequences are active. Conclusion: distinguishing function from non-function is very difficult.

Mike showed the most unexpected thing. Random DNA not only binds to transcription factor, but also -

It turns out that most of the 1,300 random DNA sequences cause reproducible regulatory effects on the reporter gene. You can see this in these results from 620 random DNA sequences below, in what I call a Tie Fighter plot:
(check his blog for the plot)"

Joe, Blah, blah, blah... Let's face it; What ...

2017-02-19T22:30:13.007-05:00

Joe,
Blah, blah, blah...

Let's face it; What are you going to do if most of the human genome is proven to be functional?

I will answer that for you:

You are not going to throw away your life's work that is the opposite to the findings for a couple of "morons" who have the evidence to prove you wrong...

A good discussion. In computational molecular bio...

2017-02-19T21:53:03.980-05:00

A good discussion. In computational molecular biology courses here, we try to educate students not only in algorithmics and computation, but also in statistics and evolutionary biology. But most genomicists training even in our department (Genome Sciences) don't go through those courses, only those interested in CMB or "bioinformatics". And elsewhere, even the "bioinformatics" courses concentrate on teaching Python and some algorithmics. Bioinformatics textbooks show the same biases, with much discussion of BLAST, sequence assembly, and alignment. But typically the phylogenies material is towards the back of the book and only 10-15 pages long. And they may have no population genetics material at all.

It may take some sort of humiliation of molecular biologists and genomicists, such as people seeing them waste a lot of money and then fail to find any function in much of the genomes. Then they might start asking how they went wrong.

But the problem is that the people coming into bio...

2017-02-19T21:46:32.968-05:00

But the problem is that the people coming into biology from a math-heavy background are not reading Kimura. They might be much better positioned to understand his work, but they're just not reading it.

They're coming into the field of genomics, not the field of evolutionary theory.

Can anybody interpret Georgi's and obviously many others' anxiety? Because that's what ENCODE did to most if not all evolutionists.

I'm really glad that Georgi told us "where the bodies are buried" because I suspected there were some...

Here it is:

"They're (ENCODE scientist who are not evolutionists) coming into the field of genomics, not the field of evolutionary theory.
Interpretation: If scientists interpret data that is not aligned with current and accepted evolutionary theory (whichever that is now, who knows?) they should do what Georgi? Disregard it? For the sake of what? Your ambitions or the better good of Darwinian bullies who don't like what ENCODE has found and possibly will find in the future?

Dr. Moran writes: ". A proper education woul...

2017-02-19T19:26:24.881-05:00

Dr. Moran writes: ". A proper education would have taught them basic population genetics and basic properties of DNA binding proteins (among other things)."

Regarding DNA binding proteins, does this mean DNA binding proteins have at least some random affinity for a random section of DNA. That is, when we throw in the formaldehyde or some other agent to "freeze" the state of what is bound to the DNA, we'll be getting a random binding that really has no utility by the cell?

Thanks in advance.

Btw, this is a pretty good discussion so far. Thanks for hosting it.

I was part of a separate 3-day ENCODE meeting in 2015 open somewhat to the public. A molecular biologist there was lamenting ENCODE didn't provide more facility for tracking "function" of repetitive elements. Over lunch he told me he was studying a particular gene for over 20 years, and his lab work gave him good evidence the repetitive elements were of regulatory significance. He was in the middle of applying for a grant to investigate it. He struck me as sincere and seemed typical of a lot of the lab researchers there. I think the sentiment you lament about is going to be hard to undo, whatever its causes.

I fully agree with your points about math and biol...

2017-02-19T19:15:27.464-05:00

I fully agree with your points about math and biology.

But the problem is that the people coming into biology from a math-heavy background are not reading Kimura. They might be much better positioned to understand his work, but they're just not reading it.

They're coming into the field of genomics, not the field of evolutionary theory. So their job is to build computational tools for assembling genomes, processing various types of sequencing data, building statistical models for medically relevant genetics, etc. And they get influenced by whatever hype is being pushed at the moment. Also keep in mind that a lot of this work is being done in medical schools and other biomedically oriented institutions that often don't even have evolutionary biologists in the ranks of their faculty.

Obviously this doesn't apply to every single such individual, very far from it, but on average, there probably is a such a trend.

I think that somebody with a background from physi...

2017-02-19T19:09:31.048-05:00

I think that somebody with a background from physics or statistics might be better prepared for this subject than biologists. When I got to college I enrolled for physics and switched to paleontology after a while. And when I got to the second half of my studies and had to pick a secondary subject I picked mathematics, with a focus on probability theory - mainly because what drove me to paleontology in the first place was reading David Raups "Nemesis affair" as a kid and figuring that my goal was to do research by performing statistical analyses on fossil invertebrate data I should try to take in as much of the maths as possible. The main hurdle for a lot of biology students is that population genetics requires maths. When I got around to actually looking at population genetics, I had enough of the mathematical background to actually read it. When I read Kimuras paper on the probability of fixation I already knew what the Kolmogorov equations were and what type of problems you could solve with them, I knew that you could approximate discrete stochastic processes with a Wiener process and how to get to the scaling needed. I could read the paper and it made sense. Without that prior knowledge it must seem like magic. Statisticians and physicists should have that background - The percentage of in text citations that reference Fisher, Wright and Haldane is higher in statistics textbooks than biology textbooks (the history of statistics basically has 3 parts: The early days where mathematicians would do a bit of statistics by the side and also developed relevant combinatorics, the modern synthesis and finally the bit that started with the Kolmogorov axioms where things were tidied up mathematically). And physicists can't really escape statistical mechanics either.
I think we put too little mathematics into biology education. I had to seek it out (my choice of mathematics as a secondary was unprecedented - no one had ever chosen it in the 60 years or so in which you had to pick a secondary) and with recent changes in the programs it wouldn't even be possible to go a route similar to mine in an organized fashion - a current student would be able to enroll in the courses I did much less receive credit.

Many people in genomics, including a number of lea...

2017-02-19T16:17:23.208-05:00

Many people in genomics, including a number of leading figures, aren't even molecular biologists -- they come from physics, statistics, engineering or computer science backgrounds, as the field needs serious computational expertise for method development, and method development is one thing it heavily revolves around in general.

They mostly haven't even passed through those courses. I don't know how exactly they got their understanding of the subject, and it probably varies a lot from person to person, but I would venture a wild guess that given that they have mostly learned it on the fly while doing research, the kind of hype you get from the likes of Nature and Science has had a major influence on many

Many of those molecular biologists graduated from ...

2017-02-19T16:05:58.879-05:00

Many of those molecular biologists graduated from university in the 1990s or even later. They did not receive a proper education in evolution or in biochemistry. A proper education would have taught them basic population genetics and basic properties of DNA binding proteins (among other things).

Their teachers were scientists of our generation. Where did we go wrong?

My colleagues who teach undergraduates are perpetuating some of the same mistakes my generation made. This is hardly surprising. We are graduating another generation of students who don't understand the fundamental concepts of our disciplines.

How can we fix this?

We been discussing genomes and junk DNA in my class on molecular evolution. The students are about to graduate in just a few months but it's the first time they've been told there's even a controversy. There's something seriously wrong here. Isn't critical thinking supposed to be our goal?

Why do you think anyone is taking offense? That do...

2017-02-19T09:53:30.758-05:00

Why do you think anyone is taking offense? That doesn't even make sense.

I'm just here to help you not making leaps the data does not bear out. "It's there for a reason" doesn't get you to where you so desperately want to go.

Of the 400 members of the original ENCODE consorti...

2017-02-19T08:47:05.624-05:00

Of the 400 members of the original ENCODE consortium, several were in my own department. One, John Stamatoyannopoulos, is convinced that there is little or no junk DNA. Several others were less convinced. One told me that from now on he was going to make sure to point out that there really was junk DNA when he presents his ENCODE work. Another, Max Librecht, was so clear and public on this that his dissent from the ENCODE announcement was featured in Ryan Gregory's Genomicron blog (here).

But an alarming number of molecular biologists do think that almost all of the genome is not junk DNA. They have no answer for Graur's points -- they are not molecular evolutionists and do not understand the mutation load objection, the genome size variation objection, the lack-of-conservation onjection, or the transposable element objection. They just assume that the genome is a finely-tuned machine, all of whose features are "there for a reason". I am embarrassed for them (they lack the requisite embarrassment).

One of the underlying motives for their view is probably this: think of all the grants we can apply for to work out what these parts of the genome do!

But what do I know? I'm only a human, and therefore far inferior to our more highly-evolved relatives, the onion and the lungfish.

Mikkel, “It's there for a reason. Which is ju...

2017-02-19T03:15:38.273-05:00

Mikkel,

“It's there for a reason. Which is just another way of saying that something caused it to be there. Which would be true to say of the actual trash in a landfill.”

To square up your analogy, you’d need to notice a lot of amusement parks in the landfill interacting with the garbage. But how come the allergy to ‘something caused it”? Why is that offensive?

I always enjoy reading what you have to say. The guard is always on duty, and I reckon that’s a good thing. Nobody wants to have anyone slip them a Mickey. But, how do you know that someone already didn’t?

It's there for a reason. Which is just another...

2017-02-19T02:18:07.290-05:00

It's there for a reason. Which is just another way of saying that something caused it to be there. Which would be true to say of the actual trash in a landfill.

Larry writes: "The overwhelming impression yo...

2017-02-18T22:26:46.015-05:00

Larry writes:
"The overwhelming impression you get from looking at the presentation is that all the researchers believe all their data is real and reflects biological function in some way or another. "

Right on! You called the attitude the way it really is. For once we agree.

I asked an NIH research last year in passing about LINE-1s. He said he didn't know, but "it's there for a reason." That's just the natural sentiment you'll get from theses guys. Where it originates, we can only speculate, but it's not like that attitude is rare, it's pretty common place at the NIH and probably for most medical researchers.

Sal Cordova responds to a comment I made. "T...

2017-02-18T13:45:56.032-05:00

Sal Cordova responds to a comment I made.

"The planning stage was all about collecting more and more data."

Is there a problem with that?

The PI begins the meeting, "I'd like to call this group meeting to order."

"As you know," she says, "we are interested in how much of the genome is functional. Most of you have collected a huge amount of data but we still don't know the answer to the question. Right now we can't tell whether the features you have identified have anything to do with biological function or whether they are just noise."

Looking around the room, the PI asks, "What should we do?"

"Let's just collect more data," says one of the post-docs.

"Excellent idea!" exclaims the PI with a relieved look on her face. I'll write the grant.

Larry writes: "I saw nothing to suggest the w...

2017-02-18T11:11:10.384-05:00

Larry writes:
"I saw nothing to suggest the workshop participants cared one hoot about either of these debates.

Agree, and I said as much in the thread that got all this started:

"I think they don't really care how much of the genome is functional."

http://sandwalk.blogspot.com/2017/02/what-did-encode-researchers-say-on.html?showComment=1486846378044#c1661586536983818740

"The planning stage was all about collecting more and more data."

Is there a problem with that? Even supposing these regions are not causally functional, they could be still diagnostic (symptomatic), and that is pretty important too. So the data collection will continue to be funded. The medical community wants the data, plain and simple.

What drives ENCODE data collection isn't "the genome is 80% functional", it's data collection. Even supposing the genome is only 10% functional, unless we know exactly in advance which is and which is not functional, we have to keep doing data collection. Look at the few lncRNAs we found as functional (like XIST). Unless we looked and did data collect, we probably wouldn't know it had function. We still have to do this even if LOLAT lncRNAs aren't functional.

"It's enough for 40 regulatory sites for every known gene in our genome."

Because histones are potential regulatory targets and there is 1 histone set for about every 200 bp of DNA, that is about 825 regulatory regions of DNA per protein coding gene.

X-inactivation is a good example of putting regulatory marks on a large fraction of the female inactive X-chromosome for dosage compensation. And we know it is highly targeted because X-inactivation leaves some genes alone (like those genes with Y chromosome homologs).

70% of the genome histones have H3K27me3 PRC2 markings on it. Many on repetitive regions. That has regulatory significance. It looks more and more like repetitive regions have a lot of regulatory robustness.

Why put multiple regulatory marks on every histone on each silenced gene on the X-chromosome, when in principle only one histone might be needed to be marked (like at the promoter region)?

One reason is the intronic regions of one gene are used to regulate and transcribe other genes (possibly even on other chromsomes). How the heck can this be coordinated in cell-type specific manners unless some sort of chromatin modifications is taking place on these non-coding regions like histone modifications, or RNA binding to DNA to work with it (like FIRRE, XIST, HOTAIR).

Ok, so assume the default is non-function. Even if ENCODE adopted that creed, it's not going to change the way they do business because unless we know in advance what is and is not junk, collecting this sort of data is basic research.