A massive amount of data on complex genomes has been published, especially on the human genome. The next step is to decide what this data means. Here are the most important questions from my perspective.
- We have a pretty good idea of the number of protein-coding genes (~20,000) but we don't know how many genes specify functional RNAs. Is it 5,000, 50,000, or more? What do know from big data science (like ENCODE) is that almost all parts of the human genome are transcribed in some cells at some time. What we don't need is more experiments documenting pervasive transcription. It's time to do the hard work and figure out just how many of these transcripts have a biological function.
- Big data science has demonstrated that various transcription factors bind all over the genome. This is pretty much what you'd expect for spurious binding given the small size of the binding sites. Now we need to find out which of these binding sites are spurious and nonfunctional and which ones actually play a role in gene expression. That means getting into the lab and looking carefully at specific examples. What we don't need are more big data experiments covering every known transcription factor binding site in every known tissue at various stages of development. We know for a fact that most of this data will be useless in terms of recognizing biological function. It's time to find out how much of the data we already have is going to be useful.
- Same for alternative splicing. How much of it represents real biologically functional alternative transcripts and how much is due to splicing accidents? That's the important question. We don't need more data until that issue is resolved.
- There are many different markers that identify open and closed chromatin regions in the genome. They include DNA methylation sites and various histone modifications. Lots of these have been mapped in different tissues. What does it mean? There are millions of them. Do they all represent regions of the genome that have a biological function or are most of them just spurious sites? We have enough already. We don't need more data, especially if it's not telling us anything useful.
PloS Biology also wondered about the future of genetics and genomics [Where Next for Genetics and Genomics?].
The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? PLOS Biology asked eight leaders spanning a range of related areas to give us their predictions. Without exception, the predictions are for more data on a massive scale and of more diverse types. All are optimistic and predict enormous positive impact on scientific understanding, while a recurring theme is the benefit of such data for the transformation and personalization of medicine. Several also point out that the biggest changes will very likely be those that we don’t foresee, even now.That's a remarkable contrast with my view of what needs to be done. I suspect that these eight leaders are going to be better predictors of the future than I am. I suspect that we're just going to see more of the same for the next few years so that by 2020 we'll be no further ahead than we are now and none of my questions will be answered.
The phenomenon is familiar. When you have a big hammer, everything looks like a nail. Most of the genomics workers are unfamiliar with the nitty gritty of ordinary biochemistry and molecular biology but that's exactly what we need right now.