Thursday, July 30, 2015

The next step in genomics

The draft sequence of the human genome was published in 2001. The "finished" version was published a few years later but annotation continues.

A massive amount of data on complex genomes has been published, especially on the human genome. The next step is to decide what this data means. Here are the most important questions from my perspective.
  1. We have a pretty good idea of the number of protein-coding genes (~20,000) but we don't know how many genes specify functional RNAs. Is it 5,000, 50,000, or more? What do know from big data science (like ENCODE) is that almost all parts of the human genome are transcribed in some cells at some time. What we don't need is more experiments documenting pervasive transcription. It's time to do the hard work and figure out just how many of these transcripts have a biological function.
  2. Big data science has demonstrated that various transcription factors bind all over the genome. This is pretty much what you'd expect for spurious binding given the small size of the binding sites. Now we need to find out which of these binding sites are spurious and nonfunctional and which ones actually play a role in gene expression. That means getting into the lab and looking carefully at specific examples. What we don't need are more big data experiments covering every known transcription factor binding site in every known tissue at various stages of development. We know for a fact that most of this data will be useless in terms of recognizing biological function. It's time to find out how much of the data we already have is going to be useful.
  3. Same for alternative splicing. How much of it represents real biologically functional alternative transcripts and how much is due to splicing accidents? That's the important question. We don't need more data until that issue is resolved.
  4. There are many different markers that identify open and closed chromatin regions in the genome. They include DNA methylation sites and various histone modifications. Lots of these have been mapped in different tissues. What does it mean? There are millions of them. Do they all represent regions of the genome that have a biological function or are most of them just spurious sites? We have enough already. We don't need more data, especially if it's not telling us anything useful.
These are the four questions that I think we need to work on over the next few years.

PloS Biology also wondered about the future of genetics and genomics [Where Next for Genetics and Genomics?].
The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? PLOS Biology asked eight leaders spanning a range of related areas to give us their predictions. Without exception, the predictions are for more data on a massive scale and of more diverse types. All are optimistic and predict enormous positive impact on scientific understanding, while a recurring theme is the benefit of such data for the transformation and personalization of medicine. Several also point out that the biggest changes will very likely be those that we don’t foresee, even now.
That's a remarkable contrast with my view of what needs to be done. I suspect that these eight leaders are going to be better predictors of the future than I am. I suspect that we're just going to see more of the same for the next few years so that by 2020 we'll be no further ahead than we are now and none of my questions will be answered.

The phenomenon is familiar. When you have a big hammer, everything looks like a nail. Most of the genomics workers are unfamiliar with the nitty gritty of ordinary biochemistry and molecular biology but that's exactly what we need right now.


108 comments :

  1. I don't think any of the responders simply said "let's do more of the same sequencing we've been doing". They all stress the need to interpret the data. But in the 21st century a lot of this analysis will be through computational means. "When you have a big hammer, everything looks like a nail" is certainly true, but this is equally true of traditional biochemists and molecular biologists who think computer science is a foreign culture as it is for genomicists who have graduated from the lab bench.

    ReplyDelete
    Replies
    1. Yes there is "need to interpret the data." But to declare that this is "the next step" does a great disservice to people like Richard Grantham who in the 1970s and 1980s bridged the traditional cultures. Simply stated, they recognized that you do not need an entire dictionary to work out how a dictionary works. Likewise, you do not need an entire genome. As pointed out by Sydney Brenner and others, big "genome projects" made funds scarce for those who, in those early days, set out to "decide what this data means" on day one!

      Delete
  2. I of course would consider the primary use of genomic data to be further elucidating phylogenetic relationships. And it we're going to do that, we need lots more genomes. There will also be peripheral effects: we'll get more understanding of how genomes evolve and of why they evolve that way. And we might get a few clues connecting genomes to morphology.

    ReplyDelete
    Replies
    1. The question here is "How much of a genome do you need to get adequate phylogenetic data?" I think that approaches that pick multiple genes are more cost effective if all you want is solid phylogenies, with much of the genome contributing little, however:
      Syntheny is a big one. We get to a point in assembly where contigs and scaffolds are big enough to look at syntheny. And unlike sequence data, which of course has a lot of of homoplasy, syntheny produces actual apomorphies.

      IMO the most pressing issue is that automatic annotation pipelines are feeding themselves and a key problem is that there is only a relatively small number of genomes where there are even some well founded functional annotations. This results in medicore BLAST hits getting annotated with a putative function, which then makes it more likely that related organisms will get the same putative annotation, etc. You get cases where a particular gene is actually quite well understood in humans, then it's the best hit for some ORF in Drosophilla with say, 60% sequence identity. The automatic pipleine assume identical function and annotates it. Now any insect sequence that fits has a far better match in Drosophila and because it's now 90% sequence identity the annotation is made with more confidence. And then another one is annotated and now there are two really good hits that both have been annotated with that function. And in the end it is based on one very shaky annotation.

      I'm aware of at least one gene family where this has happened and where knock out experiments have shown that the original transfer from humans to Drosophila was wrong. However, since that result in the change of only one annotation, the automatic pipelines still find more hits with the wrong function and this leads to new draft genomes being published with the same incorrect annotation, which in turn generates more hits in the genomes to follow. We need to work out a way where such errors can be adequately taken care of.

      Automated annotation will stay around, because there's just too much data to do everything by hand. But if errors have a tendency to spiral out of control that will become a larger and larger issue. Related to that, we will need more experimental work to figure out which genes are functional and what their function is if they are. And we need to do this with a lot more taxa than just a few model organisms, because we likely missannotate things that have novel functions right now.

      Delete
    2. Yes, you don't need all that much of a genome for phylogenetic analysis, as long as you can pick the right parts. There are two reasons for sequencing whole genomes in phylogenetics: first, to help you pick the parts; second, because whole-genome sequencing is becoming cheaper than amplifying and sequencing just the parts you want. Easier to pick a few million bases out of the pile; and hey, you don't even have to produce full-genome alignments or even assemblies to do that.

      I'm glad you know that synteny is completely without homoplasy. It seems to be the only remaining such character, since homoplasious SINE insertions were found. I would say that rare events like translocations, duplications, inversions, insertions, and such are merely rather nice characters, to be analyzed with attention to the possibility of homoplasy, just like any other characters. And synteny, like sequence characters, can be scrambled into uninterpretability by time.

      Delete
    3. I'm glad you know that synteny is completely without homoplasy.
      Not what I wrote. I stated that synteny produces actual apomorphies, that is that for a reasonably large number of taxa you will actually get some that do not have any homoplasies and where you can show that with a pretty high certaincy. You don't get that with sequence data, where most positions have homoplasies somewhere in the tree.
      I'm not that sure about the cost, but I guess that would be taxon dependent. One advantage of sequencing only parts is that you need to acquire a smaller tissue sample. Getting the amount of material to do whole genome sequencing has been tough for us for some taxa and this has resulted in at least additional man hours...

      Delete
    4. I haven't actually done any Illumina sequencing, but I understand it takes very little material. Am I wrong about that? Now, I'm always thinking exclusively about birds, which have genomes of quite uniform size, and somewhat smaller than the average eukaryote. So sequencing whole genomes makes more sense than if you work on lungfish. The cost also depends on how much sequence you need. I figure a million bases should be enough for almost anyone, but I'm pretty sure it would be cheaper -- even now -- to sequence a whole genome than to pick out enough pieces from conventional PCR & sequencing to make up a million good bases.

      How concerned are you about the complete absence of homoplasy? Why do you find "actual apomorphies" to be superior to "pretty good apomorphies"?

      Delete
    5. Depends on how much you mean by very little and it also depends on how much coverage you need. Now, I'm pretty sure that for birds you can get there easily, because even small birds are relatively big. I'm currently sorting fleas by sex and now we are talking about 100s of individuals, which we'd prefer to be an inbred lab population (but of course it isn't) and therefore we'll end up with a lot of SNPs, which could also be sequencing errors... I haven't been involved in projects where you do conventional sequencing, but AFAIK 1GBase in Sanger is about half what you need for a genome with a decent depth.

      I don't think there are sequence apomorphies at all. There's a statistical phylogenetic signal in sequences, but if you break that down per base, you get a fraction of an apomorphy. That doesn't mean I reject molecular phylogenetics, but... I'm a german entomologist. So was of course Willi Hennig. That means something. I start hyperventilating when somebody has a discretized continuous character in their Matrix. I cringe when there's a reduction trait. Somebody has to stay hardcore, as in: "UPGMA? Do I look like a pheneticists to you, do I?"

      Delete
    6. I haven't actually done any Illumina sequencing, but I understand it takes very little material. Am I wrong about that?

      Yes and no - it depends on what kind of libraries you're building.

      Now, I'm always thinking exclusively about birds, which have genomes of quite uniform size, and somewhat smaller than the average eukaryote.

      Average eukaryote? You probably meant the average tetrapod?

      Delete
    7. You probably meant the average tetrapod?

      No, I meant eukaryote. I just could be wrong. We could settle on "multicellular eukayote" if you wanted. Or even "vertebrate". I do have an unconscious vertebrate bias, certainly.

      Delete
    8. "Tetrapod" is the only correct term in this case (or maybe eve "reptile"). There is quite a bit of variation in fishes, and the average invertebrate genome is smaller than those of vertebrates (which is not to say there aren't many lineages with bloated genomes, see below) and the average unicellular genome is even smaller. Think of human vs Drosophila vs Monosiga/Salpingoeca/Fungi .

      mean/mode C-values (from www.genomesize.com):

      mammals: 3.36/3.26
      birds: 1.39/1.39
      reptiles: 2.28/2.19
      amphibians: 18.9/8.41
      bony fishes:1.27/1.04
      cartilaginous fishes: 5.56/4.81

      Invertebrates:

      Annelida: 1.23/0.87
      Chelicerata: 2.43/2.29
      Cnidaria: 1.01/1.14
      Crustacea: 4.19/2.28
      Echinodermata: 1.31/0.91
      Insects: 1.32/0.58
      Mollusca: 2.08/1.79
      Myriapoda: 0.74/0.49
      Nematoda: 0.15/0.08
      Platyhelminthes: 2.15/1.18
      Porifera: 0.66/0.37
      Tardigrada: 0.37/0.31
      other invertebrates: 0.78/0.53

      Fungi: 0.038/0.029

      Delete
    9. I'm thinking the average vertebrate genome might be quite a bit bigger than the average avian genome. And what about plants?

      Delete
    10. Took a little work, but the mean for angiosperms appears to be around 11gb (diploid), with variation running all the way from 0.1gb to 29gb. So birds are pretty small from an angiosperm perspective.

      Gymnosperms run from 4gb to 70gb. Pteridophytes from 0.2gb to 142gb. Bryophytes from 0.3gb to 16gb. And algae from 0.02gb to 38gb. Huge variation, but the means are all gigantic except for the algae. Of course these are all paraphyletic or polyphyletic groups except for the angiosperms, which might explain some of the variation.

      Delete
  3. That the next step should be characterization of individual elements is well known.

    The problem is that nobody gets glam mag publications/grants/jobs for doing that, unless they happens to be lucky and stumble upon something very important. But most people won't be so lucky, so the incentives point in a different direction

    ReplyDelete
  4. I would think a great deal of the work Larry suggests here could be done on computer without setting foot in a lab.
    For example, one could pull up one of the many studies that determined what flanking sequences were driving the expression of a particular gene and reanalyze it in terms of the ENCODE data. If one of the ENCODE RNAs or TE sites can be deleted without effect ( and no sign of a footprint for the TE site) then they'd probably be nonfunctional.

    If I was more adept at bioinformatics I'd turn offer this as an undergrad project

    ReplyDelete
    Replies
    1. "then they'd probably be nonfunctional"

      Not true if the TE's are part of an implementing a function that is incongruently redundant, which is likely much of the genome. You can knock out a piece, and a spare tire takes over and there is no apparent loss since an alternate means of achieving the function gets triggered. And in few cases apparent increases in functionality might deceptively appear because one is testing in the wrong environmental context.

      And in the case of mouse LINEs you'd hardly notice reduction in function until you knocked out quite a bit because the LINEs are used as a part of the mouse's optical devices. So the functionality isn't just about RNA expression or lack thereof, there are biophysical uses, not just biochemical ones.

      Delete
    2. One problem with your "redundancy" theory is that the redundant sequences would not be maintained by selection and so would over evolutionary time would be degraded and eventually disappear. How do you deal with that?

      Delete
    3. My guess is that Sal would "deal" with that by asserting that there is no such thing as evolutionary time. Honestly, trying to discuss junk DNA with a YEC seems like an exercise in futility to me.

      Delete
  5. "We don't need more data"

    The researchers of cancer and auto-immune diseases and diabetes don't feel that way. We absolutely need more data!

    Some of the Genome Wide Association Studies (GWAS) conducted by ENCODE proponents involve thousand today and millions tomorrow of patients and their medical records. This data combined with ENCODE data is revealing association of specific regions of the genome with disease and health in ways that would not have otherwise been detectable.

    Patients with diseases often have some amount of DNA genetic mutations but also different epigentic markings on (methylation, histone modifictions). The GWAS studies are sort of a ready-made field laboratory to study the effects of genetic and epigenetic variation on non-coding DNA. Many of the diseases have association with intronic and intergenic regions. And because the genome has incongruent redundancy, many times multimutations in certain regions are necessary to reveal disease associaiton with intronic and intergenic regions. And sometimes the correlation is only nominal because cytoplasmic and environmental factors play a role.

    ENCODE right now tracks around 150 cell types but most of them aren't healthy, they are immortalized cancer cells. We need to be tracking all sorts of tissue types in various stages of development across thousand if not millions of individuals and we need the computational power to analyze it.

    A few lab workers in an isolated biochem lab aren't going to be up to the task. There is hardly a bigger lab than the real world of the millions of GWAS records that will be used in conjunction with ENCODE, ROADMAP and other data.

    ReplyDelete
    Replies
    1. Yes, but did you have anything to say that is relevant to the point Larry was actually making?

      Delete
    2. judmarc,

      Yes, apparently you missed it. Larry said we don't need more data, I pointed out we do.

      When 10 million records of GWAS data are collected for a specific disease coupled by sequencing those genomes, there will be more live experiments than can possibly be done by a biochemical laboratory. And it will involve relevant environmental conditions to boot. That is exactly how various auto-immune, diabetes, and cancers are and will be studied, and that is why ENCODE, ROADMAP and GWAS studies continue to be funded because it is doing exactly the hard work Larry is calling for. Maybe not the we he prefers to go about ut....

      And it's not all about biochemistry but also biophysics. The use of LINE1 transposons in optical applications by nocturnal mammals isn't exactly an obvious biochemical function as much as a biophysics optical function. A broken optical device might not show too much difference in biochemistry, but it will show up in biophysics of optical devices. So even though transposons seem inactive in the germline, they are critical for nocturnal mammals to see in low lighting environments. That wouldn't have been discovered if we weren't collecting data!

      Delete
    3. @LiarForJesus:
      Nobody is saying there was not a time for collecting data, but you need to do more than merely collect data. Any idiot can list something interesting discovered with data collection in the past, but we are now at a point where so much data has been collected it's not being used, that's the point. The science isn't going to improve just by collecting even more, now that data has to be put to use in experiments to elucidate functional relationships.
      The fact that some obscure transcript is found in some obscure tissue doesn't tell you anything about what that transcript is doing, if anything at all. For that you need direct experimentation with the tissues, gene knock-outs and so on. You can't do that with just more data.

      Honestly, by arguing the opposite you are showing you have no fucking clue what you're talking about. There is no other "way to go about it". You have to do the experiments to find out what those millions and millions of transcripts you have recorded, are actually doing, mere mindless data collection can't tell you that.

      Delete
    4. "The use of LINE1 transposons in optical applications by nocturnal mammals isn't exactly an obvious biochemical function as much as a biophysics optical function."

      And how was it worked out that that this transposon had a biophysical function? With data collection? Please give a citation.

      Delete
    5. Biophysical function of LINE:

      http://www.cell.com/abstract/S0092-8674(09)00137-8

      Delete
    6. liars,

      The golden moles, family Chrysochloridae, have degenerate eyes which the skin grows over and they are blind. What would you predict about the heterochromatin and LINE1 structure in the cells of their eyes, and why?

      Delete
    7. "Biophysical function of LINE:

      http://www.cell.com/abstract/S0092-8674(09)00137-8"


      Excellent, this proves my point. The function was discovered with direct biochemical and cell experimentation, not mindless data collection.

      Delete
  6. “It's time to do the hard work and figure out just how many of these transcripts have a biological function.”
    Exactly. It is time for a new generaton of Dobzhanskys to gear up their Drosophila (or whatever) labs to get whole-animal fitness measures for putative regulatory alleles. Enough of this biochemical nonsense.

    ReplyDelete
  7. Larry,

    Read this quote from Ohta, you will understand what is needed:

    "As this short history demonstrates, population genetics has made
    remarkable strides in understanding both the phenomenology and the
    theoretical models of molecular evolution. However, it also demonstrates
    that we have yet to find a mechanistic theory of molecular evolution that
    can readily account for all of the phenomenology. Thus, while the 1990s
    will most likely be a decade dominated by the gathering of data, we would
    like to call attention to a looming crisis as theoretical investigations lag
    behind the phenomenology."

    Ohta and Gillespie, Development of Neutral and Nearly Neutral Theories. Theoretical population biology 49, 128 142 (1996)

    ReplyDelete
    Replies
    1. It reads to me like it sort of supports what Larry wrote. What am I missing?

      Delete
    2. What Larry wants to see is more data on the functions of each gene part. But our mainstream theory only forecasts junks for most of these parts. So what if that forecast is proven wrong? And do we really need more data to prove that? No, not to me at least. At least I know already that the existing theory is at least incomplete and will forecast many false things, because it has not even accounted for all the phenomenology known at the time of 1996 as acknowledged by the above quote from Ohta and Gillespie. Maybe the difference between me and some people in the field is that to me a correct theory means accounting for all relevant data without a single contradiction. If anyone does not think that is possible, in biology at least, just remember that all seemingly impossible things are viewed as quite simple after they have been accomplished. Also keep in mind, a single contradiction to a theory is equivalent to an infinite number of contradictions. When a theory allows a single contradiction or refuses to be falsified by it, it no longer qualifies as testable (it would be meaningless to use the word test).

      So, what one really needs is a more complete or correct theory, not more data, which was what the above quote means.

      Delete
  8. for all I care, the present theory for reading the DNA code has not even explained the first molecular phenomenology, the genetic equidistance result!!

    ReplyDelete
    Replies
    1. There's no "theory for reading the DNA code" gnomon.

      Also, there's something called "saturation." Common problem when calculating evolutionary distances. One effect would be that distances cannot be calculated. Look it up.

      Delete
  9. Regarding the onion test, I would like to say it is asking the wrong question. One should be asking is: why the simple organism protozoa kind has a genome size variation range from the small to large of ~20000 fold, or why flowering plants ~2000 fold, whereas mammals only less than 10 fold? Don't even try to invoke time of evolution as the reason as mammals and flowering plants appeared about the same time. The actual numbers cited here came directly from the a paper by the author who invented the onion test (see Figure 1 in the paper).

    Palazzo AF, Gregory TR (2014) The Case for Junk DNA. PLoS Genet 10(5): e1004351. doi:10.1371/
    journal.pgen.1004351

    ReplyDelete
    Replies
    1. This is a stupid question gnomon. Variation in junk DNA is to be expected. Much more if the junk has little to no fitness effect. It wouldn't matter if flowering plants and mammals "appeared at the"same time." Variation can differ between the groups depending on how the groups can tolerate the junk, and on how the junk propagates. But variation is expected. There's no expectation that variation should be identical across groups.

      This should be obvious if you understood what junk DNA is, what selection coefficients are, and that different life forms are subject to different conditions. Also that junk can propagate in different ways, and will be tolerated differently.

      I'm surprised that someone would be naïve enough to think that mere evolutionary time would "account" for variations in amount of junk DNA. It's like expecting that humans and roses should be identical organisms because their main groups (wherever you put the threshold), separated at the same time. Simply ridiculous. You can put that threshold anywhere you want. So you might as well claim that eukaryotes and bacteria should have the same variation in junk because their main groups separated "at the same time." Do you see how ridiculous that is at all?

      Learn some very basic logic, population genetics, evolutionary theory, neutral theory, evolutionary analyses (for example, study how programs based on maximum likelihood work and decide on saturation), etc. But truly understand them. Your shallow treatment of these subjects leaves a lot to be desired.

      Delete
    2. Photosynthesis says:“Variation can differ between the groups depending on how the groups can tolerate the junk, and on how the junk propagates. Also that junk can propagate in different ways, and will be tolerated differently.”

      Well, I could not agree more. Along this line, I would like to further ask more specifically: which group can tolerate more junks? Is it mammals or plants or protozoa? and why?

      Delete
    3. The answer is obvious, isn't it? As Einstein said, the key is to ask the right question. Once I asked it, you see photosyntheis, you can and have easily answered it for me. Your saying "junks will be tolerated differently" is all there is to the onion test.

      Delete
    4. "is all there is to the onion test"

      I think you don't understand what the test is about. It's not about whether variation and limits (when there's limits) can be explained. It's about whether proposed functions can explain the presence and variation in what we see as junk. It's a test that makes the person proposing a function realize that the presence of junk is an unavoidable conclusion.

      Delete
    5. From the onion test author: “the onion test simply asks: if most eukaryotic DNA is functional at the organism level, be it for gene regulation, protection against mutations, maintenance of chromosome structure, or any other such role, then why does an onion require five times more of it than a human?”

      Answer: 99.9% of human genome is functional for internal construction purposes while 0.1% is for normal variations among humans and for adaptation to environments. HIV is 20% vs 80%. Onions have a large fraction of their genomes for adaptive purposes, which can be freely changed without much effect on its internal integrity (in this sense, junks). So, to say most human DNA is functional for its construction does not necessarily precludes one from saying that most onion DNA is not or most HIV DNA is not for construction. So the question why does an onion require more (functional genomes in terms of construction purpose) is not a valid question. No one is saying so and it does not. No one, at least I am not, is saying that every species has the same proportion of functional genomes (in terms of internal construction not much related to adaptation). So, the onion test has a straw man premise. What onion does have more than human does, or an HIV virus does more than human does, is that it has more fraction of its genome as the so called junks (that in fact play adaptive roles in response to environments). Once you accept as you did that some species can tolerate more junks, the size of the junks, whether 1x or 5x of human genome size, is irrelevant.

      Delete
    6. Your making the Onion test's point. You're declaring that not all the DNA in every organism has a function. That's what this is about.

      There's no straw-man in the test. You made its point exactly. Whether you want to accept that our own genomes have junk or not doesn't change that you ended accepting that at least some organisms have junk, or at least that your favourite explanation for the human junk doesn't explain the onion's.

      Delete
    7. "Once you accept as you did that some species can tolerate more junks, the size of the junks, whether 1x or 5x of human genome size, is irrelevant."

      This is true for anybody who understands that there's such thing as junk DNA. The test is a reality check for those who don't.

      Man you're slow.

      Delete
    8. the simple organism protozoa

      Protozoa is an extremely diverse paraphyletic collection of organisms, not a natural taxon. In fact, some protozoans are more closely related to humans (and other animals) than to other protozoans. Would you expect them to be similar in terms of genome length? Why?

      Delete
    9. "Once you accept as you did that some species can tolerate more junks, the size of the junks, whether 1x or 5x of human genome size, is irrelevant."

      "This is true for anybody who understands that there's such thing as junk DNA."

      It is true for you and me but not to the workers in the molecular evolution and popgen field. all of them base their papers on the infinite sites model, which says there are infinite number of neutral or junk sites for any genome regardless whether it is human or onion. They do not take what you said into account in their studies that different species tolerate different amounts of junks. They don't think that monkey or mouse or onion can tolerate more junks than humans do. or Maybe they do so in their heart but at least they disregard that in their work.

      Delete
    10. Protozoa are all unicellular, which is the key. They are hence all simple relative to multicellular organisms. Simple systems can tolerate more random error type of variations in their building parts, including the dimensions or amounts.

      Delete
    11. So if a microsporidian has a tiny genome, and a dinoflagellate a huge one, which of them has "more random error type of variations in their building parts" (whatever this gibberish is supposed to mean), and why? And how do you know?

      Delete
    12. "more random error type of variations in their building parts" just means a large stdev from the ideal form. So, if a part for a toy car is specified to have a dimension of 10+/-6, then both 16 or 6 will be allowed random errors. So, the answer to your question would be that both have similar amounts of random errors. Something in between would likely be close to the ideal size.

      Delete
    13. Don't even try to invoke time of evolution as the reason as mammals and flowering plants appeared about the same time.

      Really? Even a casual glance at the literature would tell you that we know mammals from the upper Triassic, but the first flowering plants are known from the lower Cretacious.

      Delete
    14. gnomon,

      "not to the workers in the molecular evolution and popgen field. all of them base their papers on the infinite sites model, which says there are infinite number of neutral or junk sites for any genome regardless whether it is human or onion."

      This is false. Nobody assumes infinite sites. Nobody assumes infinite junk either. Almost any programs they might use will tell them if theres saturated sites and the proportion, often giving no results for phyla distances when too many sites are saturated. Often they will not work if the sequences are too short.

      I don't know what kinds of works on evolution you have checked, but my experience working and reviewing is very different to what you assert. People are conscious that there's such thing as saturation, and people are conscious that different organisms have different effective populations, different metabolisms, different generation times, and different levels of junk.

      Sore, some people assume that there's no such thing as junk, but those rarely try and publish work on evolution and phylogenetics.

      Delete
    15. just give you one example should be sufficient. That you don't know this is probably because you are like most people who got into the field without a careful look at the paradigms or assumptions. Those paradigms are so taken for granted that most papers do not even mention them. Some do but only in the supplementary materials.


      "Under the infinite sites model, mutations accumulate in a Poisson process of rate μl, the locus-wide mutation rate."

      This sentence is in page 6 of supplementary materials in the paper by
      G. David Poznik et al. Common Ancestor of Males Versus Females
      Sequencing Y Chromosomes Resolves Discrepancy in Time to
      DOI: 10.1126/science.1237619, Science 341, 562 (2013);

      You mention saturation. However, is saturation the same as maximum distance? Do you think that distance will stop at the maximum distance after very long evolutionary time, especially for fast mutating genes (once reached, say, 42% difference in seq identity, it will never increase anymore no matter how long evolution will continue)? Saturation does not imply maximum distance, right? It is a fact that the field has no concept of maximum genetic distance or maximum genetic diversity (MGD). Why? Well, does the infinite sites model predicts a maximum distance?

      Delete
    16. the first result in the field, the genetic equidistance result, is in fact all about maximum genetic distance. After 100 million years, why not? And yet, the interpretation offered by the field was linear distance still increasing with time. Hence you have the molecular clock, in turn the neutral theory and the infinite sites assumption. Everything went astray right from the beginning.

      Delete
    17. gnomon,

      You're mischaracterizing the fields of population genetics and phylogenetics. Here's what you said:

      "which says there are infinite number of neutral or junk sites for any genome regardless whether it is human or onion"

      Which is false. The model is used when mutations are low enough to assume that few if any mutations fall in the same place. But it is not used for just any analyses. This is why I did not understand what you were talking about. No studies I have checked or reviewed dealt with lowly divergent sequences, and thus none used a linear model, let alone one assuming infinite sites. As I told you, most software will give up estimating distances if there's too many mutations. Researchers will use different models, and often renounce comparing DNA sequences, because of saturation. Researchers will be asked for the models used to infer saturation. Reviewers often demand that appropriate models be used to account for potential mistakes with highly divergent sequences.

      There's no such thing as "equidistance." Organisms don't stop diverging. Sequences don't stop diverging. It's just that saturation makes it from hard to impossible to actually measure such divergence. You're mistaking the fact that a linear model would stop working with a maximum distance.

      The "interpretation of the field" is that once there's saturation, we start losing the power to measure distances, and that the molecular clock's ticking can't be measured either. This is what I was taught. I wasn't taught that a molecular clock could be used infinitely. I wasn't taught that mutations never occur in the same site.

      Maybe you had bad professors. Maybe you had careless teachers who assumed infinite sites no matter what. Maybe you think you're the one who discovered saturation. But you're way off base. Sorry to break the news, but researchers in evolution know about saturation and problematic molecular clocks for eons.

      Delete
    18. "There's no such thing as "equidistance." Organisms don't stop diverging. Sequences don't stop diverging."

      Just take cytochrome C. The protein from yest is approximately equidistant to all multicelluar animals, regardless worms or humans (~36% identity). This is just a fact whether one like it or not. Now do you think it could diverge more given more time? could the difference between yeast and human in cyto c be ~15% identity some time in the future, which would be equivalent to the identity between two non-related randomly picked proteins?

      Delete
    19. gnomon,

      I thought that you were mistaking concepts, but it's worse than I suspected. I thought you were talking about saturation and how it makes it difficult to calculate distances, but you;re talking about something that should be obvious. Since yeast's cytochrome C diverged from those of "multicelullar animals" when the two lineages separated (something of a "single" point, provided no HGT and/or gene conversion happens), then its proteins should have somewhat the same distance to their orthologs in multicellular animals. However, phylogenetic analyses rarely give equivalent distances (identity is not a measure of distance), because of confounding factors (like saturation, gene conversion, randomness in fixation of neutral and semineutral changes, etc).

      Anyway, the sequences keep mutating. But selection keeps some proportion of the protein because of the conservation of function. Not every position in a protein sequence can change without affecting function, and homoplasy can also happen. But the organisms and the sequences continue diverging. Only we can't measure it because of the effects of selection (plus homoplasy and saturation). This is much more evident when we use protein and DNA sequences. So, for example, two cyt C sequences might be 36% identical at the protein level, but very hard to align at the DNA level (unless we used the protein sequence to produce the alignment).

      Your conceptual framework is in very bad shape.

      Delete
    20. the more specific, the better the theory. just answer the question. is it going to stop at 36% or can it go to 15%. my theory says yes at 36%. what is yours?

      Delete
    21. Who cares. The problem we have with cytC is that it should keep its function. That means that it might be hard to get below the mark you're putting because some unimportant positions might be carried on conserved by proximity with functionally-important ones. Check a multiple alignment. I think you'll find, as I have in many many cases, that even if you have around 36% between yeast's cytC and any-multicellular-organism, you'll find that it's not the very same 36%. The conserved proportion in all of them will be less. That means that at least in theory it can go below 36%, only homoplasy/saturation/carrying on/etc, has kept yeast-whatever pairwise numbers around your preferred "threshold." This is too obvious. By evolutionary processes it might become harder and harder to go lower as the mutation/drift/selection divergence game goes on.

      Delete
    22. Yes, I agree mostly. functional constraint!! So, it appears that you are saying that the 36% difference between yeast and higher organisms is the maximum for cyt C. Any more difference than that may destroy the functional residues.

      Now, here is another fact: bacteria is ~66% different from all eukaryotic organisms. So, why? Is this 66% the maximum difference?

      Delete
    23. How many eukaryotes that aren't opisthokonts have cyt c more than 36% different from yeast or humans?

      Delete
    24. "So, it appears that you are saying that the 36% difference between yeast and higher organisms is the maximum for cyt C."

      Nope. I'm saying that it could be lower, but that several factors make it hard to go lower by evolutionary processes. That as divergence goes on, it becomes harder and harder to go lower, but that such thing does not mean that proteins with lower identities would not work. Read much more carefully.

      Bacteria are not 66% different to eukaryotic organisms. Lots of genes are not even present in one or the other (and/or are so divergent that the homology is no longer detectable). Here you're way off man.

      This is fun gnomon, but I'm not sure I'll have more time. You're in serious trouble. Your concepts are all over the map. Only semi-right, which makes it hard for you to notice why your conclusions are far off the map. I can see how you go astray. Hopefully you'll notice that one good day. Just try and be much more humble.

      Huítóu jiàn

      Delete
  10. Once you have the answer to my above question, the onion question gets answered automatically.

    ReplyDelete
    Replies
    1. So, why do onion genome sizes vary so much?

      Delete
    2. Because onions have a lot of the molecular phenomenology of genetic equidistance!

      Delete
  11. Why go that far? Why don't you ask them how cellular differentiation evolved and by what mechanism? That will make them happy.

    Happy Friday!!!

    ReplyDelete
    Replies
    1. An invisible but omnipresent mind, but no spatial dimensions, with human emotions like love, hate and jealousy, somehow persisting eternally outside of time and space, living in the absense of a physical brain, which still has divine omnipotent powers over all of material reality, wished it into existence at the expenditure of no energy or effort, in an instant, for reasons we cannot understand, at some moment in time we can't be bothered speculating on.

      That's the right answer, right?

      Delete
    2. Mikkel,
      This is 'lies' one and only argument; it's straight up god of the gaps. If science can't explain every aspect of the evolution of life on Earth (right here in the comments section of a blog) then evolution is all falsified and god did it. "Cellular differentiation" is just where he's hiding his omnipotent benevolent magical creator today.

      Ask him why his god created malaria. That will make him happy.

      Delete
    3. Looks like Mikkel and Chris B have no idea how cellular differentiation evolved and by what mechanism. This is not surprising as I spent that last few days trying to find ANYTHING that would make sense.
      Interestingly, I stumbled on a paper that suggests that cell differentiation is "influenced" by epigenetic (no mention of evolution). That is even more bad news for the followers of modern evolutionary theory, whatever that means.

      Since the boys have no evidence or a hypothesis on how cell differentiation evolved or is controlled, they had no choice but to attack some god.
      This is what it has come to in the world of science.

      Delete
    4. "Interestingly, I stumbled on a paper that suggests that cell differentiation is "influenced" by epigenetic (no mention of evolution). That is even more bad news for the followers of modern evolutionary theory, whatever that means."

      Why would that be bad news?

      Delete
    5. Looks like lies/Quest/KevNik/whatever has no idea what he's talking about again. And runs out of ideas after god of the gaps falls flat for the umpteenth time.

      You still have nothing. You don't even know what epigenetics could potentially mean for evolutionary theory. You just think it's something you can throw out without understanding it and think you are making a point.

      If you spent spent the last few days (?!) trying to find information on how cell differentiation is controlled and only "stumbled" on something about epigenetics, you are either lying or without a clue on conducting a literature search.
      If the latter, consult your librarian. If the former, you may as well resort to flaming video games still in development (oh, wait...)

      Delete
    6. A clue for whoever that was: cell differentiation is epigenetic by definition, given that all cells in the typical organism are genetically identical.

      Delete
    7. This is very interesting: liars for the devil accused me, and many other people on this blog of being kevnik, quest ect.

      Why? Is it reasonable?

      So, I researched this blog and found some interesting connections.

      People who clearly identified themselves by the name: Sal Cordova, Peer Tarborg were accused of having Quest/Kevin sock puppets.

      Now me.

      Why? The moment one of them demanded EVIDENCE for any of the mysteries of evolution, such as the origin of life, the origin of information needed for life, the origin of first self-replicating molecule, the origin of the first cell, the origin of eukaryotic cell, the origin of complex life forms and the mechanism of their evolution and so on, was branded quest/kevnik.

      Here is the kicker: if anyone asks any of those question and uses ellipses, as Peer did, he was almost automatically branded as Quest.

      Peer and Sal are pretty smart people but both of them can't be Quest.

      At one point John Harshman made a comment that "the blog was dominated by Quest" or something like that. The only solution was for the host to remove Quest's comments.

      If you call me or any other people Quest, I'm pretty sure most of them take it as a complement. I do.


      Delete
    8. So John, by what evolutionary mechanism (s) did the epigenetics evolve?

      Delete
    9. At one point John Harshman made a comment that "the blog was dominated by Quest" or something like that.

      John was being polite. The factually accurate verb would have been "infested". I've never mistaken Sal or Peer for any of the anonymous trolls that visit this place regularly, but I also have the impression that quite a few of you are Quest's sockpuppets. Either this, or there are a number logical, cognitive and linguistic defects that creationist trolls commonly share through some queer sort of convergence. You don't use ellipses, but it might be deliberate avoidance. Then, after all, who cares if you are or are not Quest? You are mutually swappable.

      Delete
    10. I know you have no answer, so I will ask you this question:

      Do you believe that some "traits" can be inherited by an organism without altering the sequence of DNA?

      Delete
    11. Yes. Apart from epigenetic modifications affecting germline cells (which, however, remain stable for only a small number of generations) we have socially learnt (and inherited) skills and knowledge. If you regard human cognitive structures as part of "the organism", they represent traits acquired thanks to cultural, not biological transfer of information.

      Delete
    12. Piotr

      Why don't you answer some of the questions that those "trolls" have been asking rather than just foaming your saliva and spitting it at them with your baseless accusations?

      What should you be called for continually not answering questions and just spreading accusation? Do you want a name or you will figure it our yourself?

      Your "best kicker" is when you accuse scientists who perform real laboratory experiments of being incompetent. Do you know what that makes you? Do you want a name? Can you see your incompetence?

      Unless you answer those questions, don't waste your time writing comments to me. I will be ignoring your arrogant ass.

      Delete
    13. I sometimes let myself be drawn into discussions with trolls. Those discussions are invariably futile, and so indeed I feel it's a waste of my time.

      What serious scientists have I accused of incompetence? Care to provide a link?

      Delete
    14. Re Piotr

      lies for the devil probably thinks that Michael Behe is a serious scientist.

      Delete
    15. "Your "best kicker" is when you accuse scientists who perform real laboratory experiments of being incompetent."

      Like who, lies? Care to back up that statement?

      And once again, because science had not yet traced the "origins of life" or whatever moveable goalpost you are throwing out today does not in any way invalidate the decades of data supporting evolution.

      Delete
    16. So John, by what evolutionary mechanism (s) did the epigenetics evolve?

      The usual ones: mutation, selection, drift. Why is this a special problem? What is confusing you about it?

      Delete
    17. The usual ones: mutation, selection, drift. Why is this a special problem? What is confusing you about it?

      I guess this is a normal pattern for creationists. Everytime a new discovery is made (or, as in the case for epigenetics, is new to creationists) it is immediately viewed as a problem for whatever scientific ideas aren't allowed in a god-controlled universe.

      Delete
    18. John Harshman

      You didn't answer my second question

      Do you believe that some "traits" can be inherited by an organism without altering the sequence of DNA?

      but I'm very encouraged by you first one already.

      Delete
    19. Piotr

      "I sometimes let myself be drawn into discussions with trolls. Those discussions are invariably futile, and so indeed I feel it's a waste of my time."

      Now you know how I feel.

      Delete
    20. John

      I think you and PZ Myers should officially complain to Microsoft. It is impossible to spell your last names correctly as the self-corrector almost always corrects you last name to "Horsham" or "Hirschman".

      I think the computer "junk DNA" is not letting it evolve. You may have a serious case. Larry too.

      Delete
    21. Now you know how I feel.

      Nobody forces you to visit a scientist's blog, if it's such a frustrating experience. I understand it's no place for the likes of you. Perhaps your church requires you to go out and "reach" the unfaithful, but in that case I'd recommend traditional door-to-door preaching in your town's streets. With your manners and eloquence, you won't score many conversions, but at least you'll have some outdoor exercise walking from house to house.

      Delete
    22. Do you believe that some "traits" can be inherited by an organism without altering the sequence of DNA?

      Yes, over the short term (a few generations), but not over a long enough term to be relevant to evolution.

      Hirschman

      Add an "n" to the end of that and you have the original German spelling. But my autocorrect has no problem. Harshman Harshman Harshman. Easy. Perhaps you shouldn't be using a Microsoft product.

      Delete
    23. liesforthedevil, did you bring up epigenetics because you believe that epigenetic effects/results are evidence that an 'outside force' (i.e. your chosen, so-called 'God') influences (or designs, creates, assembles, guides) some or all "traits"?

      Delete
    24. "So John, by what evolutionary mechanism (s) did the epigenetics evolve?"

      This question is supernaturally dumb. Do you even fucking know what epigenetics is?

      Delete
    25. John & Piotr

      It looks like both of you rely on Coyne's opinion rather than any scientific evidence for epigenetics not having any relevance to evolution. Neither of you have linked any studies that prove your claims. What if there was scientific data suggesting otherwise?

      Delete
    26. If epigenetic can alter the phenotype in heritable ways and remain stable for several generations, who says that epigenetics can't have any relevance to evolution at all?

      Delete
    27. If epigenetic can alter the phenotype in heritable ways and remain stable for several generations, who says that epigenetics can't have any relevance to evolution at all?

      People who have the faintest fucking clue about evolution. That's who.

      Delete
    28. Sorry, but "several generations" isn't good enough. It needs to be stable over hundreds of generations, not just three. Nor do I rely on Coyne's opinion. I see you haven't linked to any studies. Show me a case in which epigenetic inheritance is effective over evolutionary time. This is merely your latest attempt to grasp at anything that seems to disagree with the standard model, therefore Jesus.

      Delete
    29. Several generations 8-10 is enough to make some people investigate further and others mad (look above). Studies are being done so who knows. I was waiting for your links with definite evidence that epigenetics has been discarded as an evolutionary mechanism, so I guess there is some interesting evidence out there that can change the way people view evolution. What's evolutionary time in YOUR VIEW?It seems that your standard model is evolution did it but the disagreement remains on as to HOW EVOLUTION DID IT. How's your view better than mine?

      Delete
    30. I was waiting for your links with definite evidence that epigenetics has been discarded as an evolutionary mechanism, so I guess there is some interesting evidence out there that can change the way people view evolution.

      So could you perhaps share some examples of this "interesting evidence" with the rest of us? Thanks in advance.

      Delete
    31. Sceptical Mind,

      No one said epigenetics has been discarded as an evolutionary mechanism. There is just no evidence that epigenetic changes can last more than a few generations. I has not been observed yet. That doesn't mean it's impossible, only that there is currently no evidence to support that hypothesis. This is still an active area area of study. It was thought that epigenetic changes (methylation of DNA for example) were reset when DNA replicated, effectively 'wiping the slate' of these changes. In fact, most of these changes do. Some have been observed to persist for several generations, but not longer than that.

      To be effective over evolutionary time, an epigenetic change would have to persist like a point mutation, for example. Most point mutations are more or less permanent, spreading through the population or droppping out of the population by genetic drift, modified by a positive or negative selection coefficient, if applicable. They become part of the genetic variation on the population. Point mutations don't just revert back to their former state after a few generations.

      If some epigenetic genetic changes do turn out to be stably inherited over time like point mutation, this would pose no problem for evolutionary theory. It would be another recognized source of heritable genetic variation.

      Delete
    32. The great thing about actual mutations is that they persist forever, so there can be a population response to selection and that response can result in fixation. The shorter the persistence of the epigenetic change, on the other hand, the less time there is for a population response. And even if it gets fixed, in another few generations it won't be again. How can that possibly affect evolution?

      Delete
    33. Mikkel,

      A more intelligent question would be: why epigenetics evolved since it is mostly a deterrent of evolution by decreasing the fitness of an organism with cancer and diseases?

      Evolutionary suicide? Natural selection, super-selectively and randomly impotent?

      I'm sure you have a hypothesis just as good as the one that didn't explain the enormous junk DNA differences in similar species.

      Delete
    34. What exactly do you think epigenetics is? Your word salad just above leads me to wonder if you have any notion what you're talking about. If epigenetics had not evolved, there would be no multicellular organisms.

      Delete
    35. lies,

      "A more intelligent question would be: why epigenetics evolved since it is mostly a deterrent of evolution by decreasing the fitness of an organism with cancer and diseases? "

      Even if this were true, ID/creationists have a lot more explaining to do here. What kind of intelligent design is that?

      Delete
  12. ENCODE claims another victim, this time in the literary world:

    "Like most children of her era, she'd been taught to believe that the genome — the sequence of base pair expressed in the chromosomes in every nucleus of the body — said everything there was to say about the genetic density of an organism. A small minority of those DNA sequences had clearly defined functions. The remainder seemed to do nothing, and so were dismissed as 'junk DNA.' but that picture had changed during the first part of the twenty-first century, as more sophisticated analysis had revealed that much of that so-called junk actually performed important roles in the functioning of cells by regulating the expression of genes. Even simple organisms, it turned out, possessed many genes that were suppressed, or silenced altogether, by such mechanisms." And so on.

    —— Seveneves, by Neal Stephenson (page 599)

    ReplyDelete
    Replies
    1. Hah, I noted that as well! I thought of writing and inviting him to have a look at this blog and then consider editing. Unfortunately, later developments depend on "advanced epigenetics," so it's not easily removed from the story.

      Though I bought the book, I started going off Stephenson at about volume two of the Baroque Cycle, and thought his peak was prior to that.

      Delete
    2. Snow Crash will always be a classic, though.

      Delete
    3. Argh! Spoiler alert (I'm on page 150 or so and have a long train ride tomorrow).
      I think the System of the world (Baroque 3) was good, Anathem was very good and Reamde was kinda meh (it basically screams "Make me a movie starring Nicholas Cage. And not the Wild at Heart/Lord of War/Adaptation kind - make me the other kind of Nicholas Cage movie"). I'm enjoying Seveneves thus far.

      I'd pick Cryptonomicron over Snow Crash and Snow Crash over Anathem and Anathem over AYLIP in terms of Stephenson. The Braoque Cycle suffers mostly from the fact that every character has about 20 pseudonyms and unless you keep track of them by taking notes you get them all confused.

      Delete
    4. What? No mention of The Diamond Age? Now that was a great book.

      Delete
    5. AYLIP is A Young Lady's Illustrated Primer, so it did get a mention. It's (judgement pending on seven eves) my 4th favorite Stephenson. Talking nerd, I keep wondering if you are the John Harshman that did work for GDW.

      Delete
    6. Talking nerd, I keep wondering if you are the John Harshman that did work for GDW.

      The proper term is "geek". And yes.

      Delete
    7. https://xkcd.com/747/
      On behalf on my simulationist friends I would like to express thanks for your work. On behalf of narrativst/gamist hybrid me, who has had to play countless hours of Triplanetary (before it was replaced by Eklunds High Frontier for its more realistic treatment of gravity) to satisfy them (and for which I was rewarded through sessions of InSpectre, Fiasco and various WP-games) I would like to express a somewhat more ambivalent point of view. But since every minute they spent on designing satellites was a miunute not spent on coming up with house rules to make ASL more realistic (because seriously what ASL needed was more rules), I guess you have done good *g*

      Delete
  13. All are optimistic and predict enormous positive impact on scientific understanding

    Not so sure about it! Does anyone remember the massive "interactomics" that was a consequence of hood mass-spec available to the masses? Hello, that was more than 10 years ago. What did we get in return? Very little - particularly in the realm of understanding. It still pretty much stands where it was before: life, and protein interactions, are very complicated.

    One of the big things that hopefully will get resolved in the next 10 years is the relative significance of epistasis. Right now, for complex traits that depend on many genes, there is a stark contrast between genomics people and standard geneticists. The former who really like their linear equations and claim that almost everything is additive whereas practically all experiments of the latter show pervasive epistasis. One of the two must be wrong.

    ReplyDelete