Science addressed the problem of How to (seriously) read a scientific paper by asking a group of Ph.D. students, post-docs, and scientists how they read the scientific literature. None of the answers will surprise you. The general theme is that you read the abstract to see if the work is relevant then skim the figures and the conclusions before buckling down to slog through the entire paper.
None of the respondents address the most serious problems such as trying to figure out what the researchers actually did while not having a clue how they did it. Nor do they address the serious issue of misleading conclusions and faulty logic.
I asked on Facebook whether we could teach undergraduates to read the primary scientific literature. I'm skeptical since I believe it takes a great deal of experience to be able to profitably read recent scientific papers and it takes a great deal of knowledge of fundamental concepts and principles. We know from experience that many professional scientists can be taken in by papers that are published in the scientific literature. Arseniclife is one example and the ENCODE papers published in September 2012 are another. If professional scientists can be fooled, how are we going to teach undergraduates to be skeptical?
I've addressed this issue before. Back in 2013 I wrote about a Nature paper that looked at promoter sites in the human genome. The authors concluded that there may be 500,000 active promoters that are probably "functional and specific" [Transcription Initiation Sites: Do You Think This Is Reasonable?].
This conclusion is almost certainly wrong but there are probably only a handful of scientists in the entire world who can understand the science in this paper and figure out what went wrong. This is a problem. I know something about this subject but I have no idea what the scientists did. The work is completely opaque to most scientists. This paper was subsequently retracted! [Transcription Initiation Sites: Do You Think This Is Reasonable? (revisited)]
A few months later I looked at another Nature paper on transcription. This one describes an attempt to identify the relationship between variation in gene expression levels and genetic differences in mouse strains. It was extremely difficult to understand the paper and, as it turns out, I didn't succeed. Several Sandwalk readers pointed out differing interpretations of the data and the experiments. There was general agreement that the paper was badly written [ see: Do you understand this Nature paper on transcription factor binding in different mouse strains?].
So, I decided to re-visit this problem by opening up the latest issue of Nature to see if I could learn anything from reading the primary literature.
Just by chance, there happened to be a paper on the effect of genetic variation in mouse strains on gene expression. This is the same topic I addressed in late 2013. Here's the recent paper with it's abstract ...
Chick, J. M., Munger, S. C., Simecek, P., Huttlin, E. L., Choi, K., Gatti, D. M., Raghupathy, N., Svenson, K. L., Churchill, G. A., and Gygi, S. P. (2016) Defining the consequences of genetic variation on a proteome-wide scale. Nature 534:500-505. [doi: 10.1038/nature18270]
Genetic variation modulates protein expression through both transcriptional and post-transcriptional mechanisms. To characterize the consequences of natural genetic diversity on the proteome, here we combine a multiplexed, mass spectrometry-based method for protein quantification with an emerging outbred mouse model containing extensive genetic variation from eight inbred founder strains. By measuring genome-wide transcript and protein expression in livers from 192 Diversity outbred mice, we identify 2,866 protein quantitative trait loci (pQTL) with twice as many local as distant genetic variants. These data support distinct transcriptional and post-transcriptional models underlying the observed pQTL effects. Using a sensitive approach to mediation analysis, we often identified a second protein or transcript as the causal mediator of distant pQTL. Our analysis reveals an extensive network of direct protein–protein interactions. Finally, we show that local genotype can provide accurate predictions of protein abundance in an independent cohort of collaborative cross mice.
The first thing I noticed was that the 2013 paper wasn't in the reference list! That's because it was retracted.
The second thing I noticed was that the abstract didn't really tell me anything about the conclusions. How many genes are regulated differently in the various mouse strains because of genetic difference between the strains? How many of those genetic differences are directly due to changes in promoters and enhancers? How do the authors tell the difference between stochastic variation and variation due to sequence differences in the genomes.
The third thing I noticed was the opening paragraph of the introduction.
Regulation of protein abundance is vital to cellular functions and environmental response. According to the central dogma, [ref: Crick, 1970] the coding sequence of DNA is transcribed into mRNA (transcript), which in turn is translated into protein. Although rates of transcription, translation and degradation of both transcript and protein vary, under this simplest model of regulation, the cellular pool of a protein is determined by the abundance of its corresponding transcript. Genetic or environmental perturbations that alter transcription would directly affect protein abundance. In reality, many layers of regulation intervene in this process, and numerous studies have been carried out to determine whether and to what extent transcript abundance is a predictor of protein abundance2, 3, 4, 5, 6. Several studies have reported that there is generally a low correlation between the two. An emerging consensus is that much of the protein constituent of the cell is buffered against transcriptional variation4, 7, but a global perspective of protein buffering has not been put forward.Once I realized that the authors had not read the very first paper they referenced, I figured this was not going to be a good paper. Nevertheless, I persevered because I'm very interested in the problem.
I started to lose interest on the second page when I read ...
We identified 2,866 pQTL for 2,552 distinct proteins at a genome-wide significance level of P < 0.1 (Fig. 2a). This is the largest set of pQTL identified so far, with tenfold greater numbers than other mass spectrometry (MS)-based approaches. Significant local pQTL were more common than distant pQTL (1,736 local and 1,130 distant pQTL) (Extended Data Fig. 3g). In addition, we identified 4,188 significant eQTL among 3,706 genes, with threefold more local than distant associations at the transcript level (3,211 local and 977 distant eQTL; Fig. 2a, Extended Data Fig. 3h, i). Finally, to examine the replication rate, we analysed a replication set of 192 separate DO mice treated under identical conditions for eQTL (see Methods and Extended Data Fig. 4). To determine whether the same genetic loci acted on transcript and protein abundance, we first compared the QTL maps. We observed a significant overlap of proteins with pQTL and eQTL (n = 1,400; hypergeometric P < 1 × 10−16; Fig. 2a). As expected, genes with concordant QTL had generally higher correlations between protein and transcript abundance compared to those having only pQTL, only eQTL or neither (Fig. 2b). Among local QTL only, we observed a high degree of overlap with 80% of local pQTL having a corresponding local eQTL. The small number of local pQTL that lack corresponding eQTL (n = 344) could result from genetic variation that regulated protein abundance via post-transcriptional mechanisms such as coding variation that affected protein stability without altering transcript levels. In contrast, distant genetic variants that affected both transcript and protein levels seem to be nearly mutually exclusive (Fig. 2a). This observation leads to the intriguing hypothesis that most distant pQTL affected the abundance of a target protein via post-transcriptional mechanism(s).
I suppose I could figure out what they mean if I was willing to download the supplemental information and spend a good deal of time trying to learn the jargon. (What's the difference between "local" and "distant" eQTL?)
It's not worth the effort.
Here's the conclusion.
This study quantified both protein and transcript abundance in a genetically diverse population of mice, mapping their genetic architecture. We identified the largest catalogue of pQTL so far, which can be attributed to two variables in our experimental design. First, we have improved the accuracy and sensitivity of quantification for both protein and transcript abundance. Second, our experimental population captured genetic diversity far in excess of the human population and standard laboratory mouse strains. Earlier studies reported a disconnect between transcript and protein abundance2, 3, 6, which has also been a conclusion drawn from several recent eQTL–pQTL analyses4, 7, 17, 35. Data here show that local QTL tend to abide by the central dogma as demonstrated by concordant effects on transcripts and proteins, whereas distant pQTL are conferred by post-transcriptional mechanisms. Our mediation analysis provided the ability to identify causal protein intermediates underlying distant pQTL and led to the identification of hundreds of protein–protein associations. Our experimental design provides an advantage over protein interaction maps because genetic mapping is not dependent on physical interactions. This conclusion is further exemplified by the co-regulation of protein complexes or biochemical pathways in this study. Stoichiometric buffering provides one explanation for co-regulation of protein complexes and may account for earlier observations that protein abundances (but not transcript abundances) of orthologues are well-conserved across large evolutionary distances36, 37.In my opinion, the scientific literature is becoming unreachable for most scientists. How many people interested in science can read this paper and understand it, let alone evaluate it? If you are one, then please let me know in the comments.
These findings suggest a new predictive genomics framework in which quantitative proteomics and transcriptomics are combined in the analysis of a discovery population like the DO to identify genetic interactions. Next, pathways relevant to the tissue/physiological phenotype of interest are intersected with the list of significant pQTL. Pathways enriched for proteins with significant pQTL should be amenable to manipulation in the founder and CC strains. That is, the founder allele effects inferred at the pQTL can be combined in such a way via crosses of CC strains to tune pathway output. Moreover, as we better understand the types of mutation that can affect protein abundance, we can introduce specific mutations with gene editing into sensitized or robust genetic backgrounds. We foresee this strategy being used to design reproducible rodent models that span a range of human-relevant phenotypes, for example, in drug metabolism or toxicology studies.
I can't imagine how any undergraduate could profit by reading this paper without a great deal of help. If the teacher really understands what was done, wouldn't it be far easier to just explain the result to the students?
I blame the journals for this situation. Maybe it's only a problem in genomics and proteomics but even if it's confined to those disciplines, something has to be done.
I can't read the primary scientific literature any more because a lot of it is incomprehensible. Most of the rest is just wrong or misleading.
Image Credits: The first photo is from: Improve Your Reading of Scientific Papers. The second is from: Scientific papers, civil disobedience and personal networks.