Hello, and welcome to Paper of the Day (Po'D): On the epistemological crisis in genomics edition. Today's paper is E. R. Dougherty, "On the epistemological crisis in genomics", Current Genomics, vol 9, pp. 69-79, 2008. (I have discussed a previous paper by Dougherty and Dalton here. )
From its beginning, Dougherty's article is on the attack, and minces no words:
There is an epistemological crisis in genomics. The rules of the scientific game are not being followed. ... High-throughput technologies such as gene-expression microarrays have [led] to the accumulation of massive amounts of data, orders of magnitude in excess to what has heretofore been conceivable. But the accumulation of data does not constitute science, nor does the [a posteriori] rational analysis of data.
Dougherty moves from the ancient to more modern philosophy, highlighting the essential roles in Science played by experiments performed with controlled conditions, the formulation of knowledge through mathematics (models), and the necessity of verification of models through their prediction of data, not their explanation of data. The following paragraph makes this latter quality clearer:
Science is not about data fitting. Consider designing a linear classifier .... The result might be good relative to the assembled data; indeed, [it] might even classify the data perfectly. But this linear-classifier model does not constitute a scientific theory unless there is an error rate associated with the line, predicting the error rate on future observations. ... In practice, the error rate of a classifier is estimated via some error-estimation procedure, so that the validity of the model depends upon this procedure. Specifically, the degree to which one knows the classifier error, which quantifies the predictive capacity of the classifier, depends upon the mathematical properties of the estimation procedure. Absent an understanding of those properties, the results are meaningless.
Dougherty provides a nice illustration of how unreliable such error rates can be. Using real microarray data of genes (independent variables) and tumor types (dependent variable), Dougherty builds and tests several classifiers on subsets of the data, and compares their estimated error rates with their "true error rates" (which is estimated using all of the data). The two appear quite uncorrelated. (A similar example is on Dalton's research webpage.) Dougherty is led to the conclusion that many publications in genomics are "lacking scientific content", and refers to Kant when he remarks, "A good deal of the crisis in genomics turns on a return to 'groping in the dark'."
Since publication, this article appears to have been referenced only 31 times, 19 of which are not from Dougherty and/or Dalton. I look forward to seeing how it has been received in those papers, and its lessons taken into practice. Looks like I will be reading a lot more bioinformatics research.