October 2014 Archives

Beginning from my research in music machine listening, I have become more and more aware of applications of machine learning to cultural products, and the pitfalls that accompany such work. I previously critiqued a study applying clustering of image features to photographs of paintings by different artists. Here is a new one: clustering of Shakespeare's plays into genres by word frequencies. (This work is published in: S. Allison, R. Heuser, M. Jockers, F. Moretti and M. Witmore, "Quantitative Formalism: an Experiment", Pamphlets of the Stanford Literary Lab, Jan. 2011.)

On its face, this seems reasonable. As Allison et al. comment, certain words are closely associated with genres, like "castle" with "gothic". However, they discover they are able to automatically and correctly cluster Shakespeare's plays by using frequencies of only 37 words:

"a", "and", "as", "be", "but", "for", "have", "he", "him", "his", "i", "in", "is", "it", "me", "my", "not", "of", "p_apos", "p_colon", "p_comma", "p_exlam", "p_hyphen", "p_period", "p_ques", "p_semi", "so", "that", "the", "this", "thou", "to", "what", "will", "with", "you", "your"

At this point, it is reasonable to pause before making any claim that the clustering -- though correct it may be -- is a result of or caused by genre recognition. To accept such a conclusion entails accepting the words above and their frequencies as the mysterious ingredients that separate "tragedy" from "comedy". Unfortunately, it appears Allison et al. accept just that, calling these word frequency features the observable tips of the "icebergs" that are genres.

Hello, and welcome to Paper of the Day (Po'D): On the epistemological crisis in genomics edition. Today's paper is E. R. Dougherty, "On the epistemological crisis in genomics", Current Genomics, vol 9, pp. 69-79, 2008. (I have discussed a previous paper by Dougherty and Dalton here. )

From its beginning, Dougherty's article is on the attack, and minces no words:


There is an epistemological crisis in genomics. The rules of the scientific game are not being followed. ... High-throughput technologies such as gene-expression microarrays have [led] to the accumulation of massive amounts of data, orders of magnitude in excess to what has heretofore been conceivable. But the accumulation of data does not constitute science, nor does the [a posteriori] rational analysis of data.

Dougherty moves from the ancient to more modern philosophy, highlighting the essential roles in Science played by experiments performed with controlled conditions, the formulation of knowledge through mathematics (models), and the necessity of verification of models through their prediction of data, not their explanation of data. The following paragraph makes this latter quality clearer:

Science is not about data fitting. Consider designing a linear classifier .... The result might be good relative to the assembled data; indeed, [it] might even classify the data perfectly. But this linear-classifier model does not constitute a scientific theory unless there is an error rate associated with the line, predicting the error rate on future observations. ... In practice, the error rate of a classifier is estimated via some error-estimation procedure, so that the validity of the model depends upon this procedure. Specifically, the degree to which one knows the classifier error, which quantifies the predictive capacity of the classifier, depends upon the mathematical properties of the estimation procedure. Absent an understanding of those properties, the results are meaningless.

Dougherty provides a nice illustration of how unreliable such error rates can be. Using real microarray data of genes (independent variables) and tumor types (dependent variable), Dougherty builds and tests several classifiers on subsets of the data, and compares their estimated error rates with their "true error rates" (which is estimated using all of the data). The two appear quite uncorrelated. (A similar example is on Dalton's research webpage.) Dougherty is led to the conclusion that many publications in genomics are "lacking scientific content", and refers to Kant when he remarks, "A good deal of the crisis in genomics turns on a return to 'groping in the dark'."

Since publication, this article appears to have been referenced only 31 times, 19 of which are not from Dougherty and/or Dalton. I look forward to seeing how it has been received in those papers, and its lessons taken into practice. Looks like I will be reading a lot more bioinformatics research.

Blog Roll

About this Archive

This page is an archive of entries from October 2014 listed from newest to oldest.

September 2014 is the previous archive.

November 2014 is the next archive.

Find recent content on the main index or look in the archives to find all content.