Thursday, March 05, 2015

Don't misuse the word "homology"

Here's the latest science news from The Allium": Evolutionist Loses It As Colleague Conflates Homology and Similarity Yet Again.
Evolutionary biologist Dr. Constance Noring shot and killed her microbiology colleague and formerly good friend, Dr. Dan Deline when, for the umpteenth time he used the word homology when he really should have said similarity.
Read the rest. I sympathize with Professor Noring. This could have been me if Canadians were allowed to buy handguns.


37 comments :

  1. When a speaker says homology when they meant similarity or identity, my colleagues already know why I'm so insistently raising my hand. Sigh!

    ReplyDelete
  2. How about if they used a corrected similarity value and called it homology? Say, 75% raw similarity corrected (let's use Jukes-Cantor); hmmm, I get 30 changes per 100 sites, or 5% multiple hits, of which around a third are reversals. Would it be OK to say, in that case, 73% homology, i.e. about 73% of sites are identical by descent?

    Just asking.

    ReplyDelete
    Replies
    1. The 30% changed sites would still be homologous. Only variant homologs. The sites would play the role of "characters," and their conservation/change the role of "character states." The "characters" are homologous.

      Of course, the history can get muddled by indwells, which makes it hard to tell precisely which positions can still be considered homologous characters, so then it's better to refer to the sequences as the "homologous characters" under consideration.

      Percent homology might work, though, when referring to domains. As in a domain making 20% of a protein would be homologous to the same domain that makes 50% of another. But then, which percent would you use? I rather say that both proteins share a homologous domain and avoid confusion.

      Delete
    2. "indels" not "indwells." No idea how that word got there.

      Delete
    3. Homology is a conclusion based on evidence. For example, you may conclude that two genes descend from a common ancestor (i.e. they are homologous) based on the evidence that their aligned sequences are 30% identical.

      Homology is a word like "pregnant"—you either are, or you aren't, based on evidence. You can't be 30% pregnant.

      Delete
    4. "indwells" got there because either gods or gremlins infect this blog. They attack me almost every day. (Those spelling mistakes and typos are not my fault.)

      Delete
    5. photosynthesis: I'd say there are two sorts of homology working here: site homology and base homology. You're talking about the first, and I'm talking about the second. The same applies to morphology. Different states of a character are homologous in one way, but the same state in different taxa can be homologous (or homoplasious) in the other way.

      Larry: I would consider "percent homology" to be a summing of individual homologies. If 80% of the bases in a sequence are identical by descent, it makes sense to call the sequence 80% homologous. This is not the same thing you were complaining about before.

      Delete
    6. I was talking about each position being considered the character under investigation. So bases or amino-acid residues or complete sequences makes no difference. My point applies, each base, conserved or not, would be a character. Each base can be considered homologous whether they are conserved or not. But we rarely go at the base level for characters because the thing can easily get muddled with semantic/philosophical problems.

      In your example you're implying that you know that the non-conserved positions are not homologous. That you know that only conserved ones are. Again, character states here would mean that the homologous character is conserved or not conserved, not that the characters are not homologous. To make matters more confusing, for example, apparently conserved positions can also have gone through homoplasy without having lost their homologous status. The positions/bases would be homologous by descent because, despite they have changed states, they are related by common descent.

      But it's easy to get all muddled and confused. So I prefer to stay at the sequence level, rather than at the base/reidue/position level. And I rather not allow the use of the word homology when identity is the word that best avoids confusion. When you say identity you don't have to explain that much more. If you want to say percent homology you have lots of assumptions and explaining to do.

      Delete
    7. It isn't clear to me what you're trying to say, but I don't think you understand what I'm saying. Let me try again. There are two sorts of homology in DNA or protein sequences. The first is site or positional homology, i.e. sites that we align as the same site even if their bases/residues are different. The second is base/residue homology, e.g. two glycines at position 34 that are glycine because they were glycine in the common ancestor and were never replaced. Positions and their contents are not the same thing.

      Again, we do exactly the same thing with morphological characters. Characters are homologous, and character states are homologous too. If two taxa have the same character state, that's homologous if they got it by inheritance of that state from their common ancestor.

      Identity doesn't have to be homologous, given that homoplasy happens. The two concepts are different and need different words. You just have to be aware which one you're really talking about.

      Delete
    8. John,

      I understand what you're saying (or I think I do). My attempt was at showing that going your way only adds to confusion.

      Why not leave the two glycines at position 34, which are both glycine because they were so in the common ancestor, as being identical character states of a homologous character?

      Why would we want to define (by contrast) the tryptophan/phenylanaline pair at position 64 as being non-homologous because one (or both) of them changed from their state in the common ancestor? Why not think of them also as character states of a homologous character, only states that did change?

      You're making character states into characters, which is confusing and does not help us understand each other. Of course, you could justify it, but it would still be confusing. Just see how much explanation has gone between us, and I might still be unable to explain my point, while you keep thinking that I don't understand yours?

      Delete
    9. If two taxa have the same character state, that's homologous if they got it by inheritance of that state from their common ancestor.

      Agreed. But if you define percent homology from that, then you're ignoring the homologous characters that did not remain in the same state.

      Delete
    10. You may not like it, but the commonly understood meaning of homology applies both to characters and to character states, separately. Homology is defined as similarity due to common ancestry. Different states of one character are homologous characters if that character was found in the ancestor; but identical states are homologous states if that state was found in the ancestor. You may want to apply a special, molecular meaning to the term, but I don't see why. There are level of homology; always have been.

      Tryptophan at position 64 is not homologous to phenylalanine at position 64, but position 64 is (or may be) a homologous site in two species even if occupied by non homologous residues.

      Delete
    11. If 80% of the bases in a sequence are identical by descent, it makes sense to call the sequence 80% homologous.

      No, that makes no sense at all.

      If you have decided that the stretches of nucleotides share a common ancestor then they are homologous. You describe their relatedness by saying that the sequences are 80% identical. In most cases, that's the evidence that you used to reach the conclusion in the first place.

      When you align any two DNA sequences you'll find that roughly 25% of the base pairs are identical. In that case, it makes no sense to say that each of those "characters" is homologous and the sequences are 25% homologous.

      Once you've decided that the two sequences are homologous that's the end of the story. The sequences are usually genes but if they're not then it has to be a significant stretch of DNA. It makes no sense to examine small regions of that stretch and say that this 10 bp stretch is 90% homologous while that 10 bp stretch is only 60% homologous.

      Delete
    12. It makes no sense to examine small regions of that stretch and say that this 10 bp stretch is 90% homologous while that 10 bp stretch is only 60% homologous.

      Why not? Would it make sense to say that 90% of the bases in a 10bp stretch are homologous? (Of course, mere identity doesn't equal homology, given that there is homoplasy too.)

      Delete
    13. Why not?

      Because we have perfectly good ways of saying the same thing without abusing the word "homology." We can say that the genes in two species are homologous and certain segments are more highly conserved than others. We can even say that there's a short segment in the two genes where the sequences are 90% identical in divergent species.

      Why do you think we have to use the word "homology" in this context?

      Delete
    14. Dumb question. If "random" DNA sequences are 25% identical, are they all likely to share a common ancestor?

      Delete
    15. Not on those grounds alone, no - it's just that there are 4 bases, so any two drawn at random will be the same 25% of the time.

      Delete
    16. Why do you think we have to use the word "homology" in this context?

      Well, of course we don't have to do anything. A better question is this: Why are we forbidden to use it? At any rate, the homologies of individual characters and character states are not invalid questions in morphological studies. Why are molecules to be considered different?

      Delete
    17. Allen Miller: I misunderstood. I didn't realize they were being matched one at a time. But I'm glad I asked the question.

      Delete
    18. When you align any two DNA sequences you'll find that roughly 25% of the base pairs are identical.

      Are you sure of that? If you have two random sequences of length l, sure we expect 25% identity, with an SD of 43%/l^.5. But even the most simple alignment algorithm will tend to produce greater sequence identity. How much would depend on the precise method used and the length of sequences involved, but aligning a 100BP sequence to a 10k BP sequence assuming no indels gives me about 43% sequence identity for instance. That's quite a bit higher than 25% and would go further up if indels were allowed.
      25% identity is not a very useful baseline. We've been over this when you were arguing that micro RNAs were not generally highly conserved. But for a sequence of 30BP aligning it to a sequence of 100kBP we get about 60% as a baseline, rising to 70% for 1MBP. Detecting homologs requires a high degree of sequence conservation for these short sequences.

      Delete
    19. Are you sure of that?

      No, of course not. Everything you say is true and the problems even apply to amino acid sequences. (Although I would argue that you need to correct identify calculations by subtracting gap penalties.)

      I didn't think it was important to quibble in order to make the point.

      Delete
    20. At any rate, the homologies of individual characters and character states are not invalid questions in morphological studies. Why are molecules to be considered different?

      It would be pretty silly to say that the wings of a bird and the flippers of a seal are 42% homologous.

      Why are molecules to be considered different?

      Delete
    21. The reason it's silly to say that wings and flippers are 42% homologous is that we have no objective measure of percent homology, since character scoring is a subjective process. Our judgments would depend on what particular characters we had abstracted from the anatomy. For molecular sequences, on the other hand, scoring (once you've aligned them, that is) is simple and objective.

      Would you consider it odd to say that 42% of the bases in a given sequence are homologous between two taxa?

      Delete
    22. I'm confused about the non-homologous parts. I get sequence identity. I get reversals. But for a SNP, it still shares an ancestor, your 'site homology'.
      Now indels are different. If a gene or protein were to be called 70% homologous, I would want that to mean that you can align 70% of the sites with the rest being recent indels (since LCA) but I'm still thinking about bits that were in the LCA and deleted in one.
      Do you really use homologous to describe character identity? Should you? I'm scratching my head.

      Delete
    23. Do you really use homologous to describe character identity?

      Of course you do. To take a gross example, let's consider a character we might call "tetrapod forelimb". Now of course that character is homologous throughout tetrapods. Now consider a few character states, and let's naively code it as "leg" or "wing". "Leg" is of course the ancestral state, and both birds and bats have the derived state "wing". But those states are not homologous.

      It's the same with the bases at any given site, except that we have no hope of telling whether two A's are homologous or homoplasious just by examining them.

      Delete
    24. Would you consider it odd to say that 42% of the bases in a given sequence are homologous between two taxa?

      Yes, because we have a far better word for it. We can say that the sequences are 42% identical. That's the raw data that leads us to the conclusion that the genes/sequences are homologous.

      Delete
    25. That doesn't address my question. I get "homologous as limbs" "Not homologous as wings". Wingness was not shared.
      But does anything consider the Aness or Tness of a specific site? It seems severely contrived outside of anything other than artificial algorithmic accountancy.

      Delete
    26. John Harshman, normally I either agree with you or wish I had agreed with you because I was either wrong or ignorant when I didn't. Here, however, I disagree with you and I think you're wrong. (Yes, I do understand your distinction between characters and character states -- I just don't think it is useful for communication.)

      In a writing class long ago, I learned that if a writer wants to be understood, he has a responsibility to write clearly, not to bitch about readers who fail to understand. There is a sense in which an entire DNA sequence can reasonably be treated as homologous even if some of the bases have mutated and are no longer identical. When discussing sequences in that sense, saying that two non-identical bases are not homologous would just be wrong. Of course, if you switch to a base-centered frame of reference and somewhat redefine "homologous" to mean identical by descent rather than similar because they are descended from a common ancestor, you can reasonably say that non-identical bases are not homologous. However, you can't expect that your readers to come along with you on this little mental side track, unless you explain a lot.

      We'd be stuck with these multiple definitions of homologous if that were all we have (think of the chromosome / chromatid mess in meiosis that is guaranteed to confuse students), but we have a way to express what you mean much more clearly (for your audience) if you use percent identity, rather than percent homology.

      Of course, percent homology is a common phrase, but it's a confusing phrase and should be discouraged simply. Forget arguments on fine shades of meanings; percent homology simply does not communicate well.

      Delete
    27. bwilson: These are not multiple definitions of "homologous"; it's all the same definition, applied to different features. And this is how the term is used by systematists; I didn't invent it.

      Larry: Identity and homology are not the same thing. Some of that identity is homoplasy, as in my example that started this little argument.

      Roger: The analogy was intended to be crude just to make it understandable. "Wingness" is indeed shared, it just isn't homologous. Now in the example we can easily tell that bats and birds do not have homologous wings. But in thousands of real cases in morphology (and always in molecular sequences) the non-homologous states look similar enough -- or in fact identical -- that the only reason they're known as homoplasious is after the fact, i.e. because that character doesn't match the tree. This is a routine statement in morphological systematics, and the molecular case differs in no significant way.

      Delete
  3. Gun held side ways?! Oh that's a kill shot right there! (RIP Dr. Dan Deline)

    ReplyDelete
  4. OMG, how did I not know about this web site? Well, looks like I have another reason to get less stuff done!

    ReplyDelete
  5. Oh, those silly semantics warriors - they sure get tiring after a while.

    And the funny thing is, in most cases they are dead wrong when they object to the use of "homology" because high similarity almost invariably means homology.

    ReplyDelete
  6. So, when RecA (or one of its ... er ... homologs such as RAD51) does its ... er ... homology search, we should say something different? This usage has become pretty much embedded in certain areas. Which may grate, but language evolves, even scientific language. Personally, I deplore beginning sentences with 'so', and pointless use of ellipses ...

    ReplyDelete
    Replies
    1. You do a search for possible HOMOLOGY based on sequence similarity or structural similarity.

      Delete
    2. You do, yes, but RecA itself is frequently described as doing a 'homology search'!

      Delete