August 2013 Archives

D. J. Hand, "Deconstructing statistical questions," J. Royal Statistics Society A (Statistics in Society), vol. 157, no. 3, pp. 317-356, 1994.

This is a remarkable paper, addressing "errors of the third kind": applying a statistical tool to correctly answer the wrong question. This type of error can occur when a research question is not defined in sufficient detail, or worse, when a tool is used simply because it is convenient, and/or gives the result desired. Hand gives many illustrative examples of how things can go very wrong from the beginning, and argues that before proceeding to apply the numerous statistical tools available in software packages today, we all must "deconstruct" with care the scientific and relevant statistical questions that we actually seek to answer.

At the end of the article, there are 24 (mostly) laudatory responses to it --- including one by John Tukey. These are like well-thought comments on reddit, and provide revealing looks at the actual practice of statistics in science, and the practice of science with statistics. One in particular strikes me, because it is about the practice of science with statistics in academia. Donald Preece begins:

Professor Hand speaks of the questions that the researcher wishes to consider. There are often three in number:
  1. How do I obtain a statistically significant result?
  2. How do I get my paper published?
  3. When will I get promoted?
So Professor Hand's suggestions must be supplemented by a recognition of the corruptibility and corruption of the scientific research process. Nor can we overlook the constraints imposed by inevitable limitation of the resources. Needing further financial support, many researchers ask merely 'How do I get results?', meaning by 'results', not answers to questions, but things that are publishable in glossy reports.
This, in particular, hit home, especially after I accidentally read, E. R. Dougherty and L. A. Dalton, "Scientific knowledge is possible with small-sample classification," EURASIP J. Bioinformatics and Systems Biology, vol. 10, 2013. In their recent article, Dougherty and Dalton pull no punches:

Since scientific validity depends on the predictive capacity of a model, while an appropriate classification rule is certainly beneficial to classifier design, epistemologically, the error rate is paramount. ... [A]ny paper that applies an error estimation rule without providing a performance characterization relevant to the data at hand is scientifically vacuous. Given the near universality of vacuous small-sample classification papers in the literature [where error is not estimated], one could easily reach the conclusion that scientific knowledge is impossible in small-sample settings. Of course, this would beg the question of why people are writing vacuous papers and why journals are publishing them.

A mean lesson about the mean

| 1 Comment
It seems like taking the mean of a sample is not controversial. However, it could be the wrong thing to do. Consider this neat example from D. J. Hand, "Deconstructing statistical questions," J. Royal Statistical Society A (Statistics in Society), vol. 157, no. 3, pp. 317-356, 1994.

An English researcher and French researcher both test two cars of two types to determine which type is the more fuel efficient. One researcher measures miles per gallon, and the other gallons per mile. The following data are collected:

table.png The English researcher finds the average miles per gallon of type 1 cars is greater than that of type 2, so they conclude type 1 is more fuel efficient. However, the French researcher finds the average gallons per mile of type 2 cars is less than that of type 1, so they conclude type 2 is more fuel efficient.

Who is right??
Hello, and welcome to the Paper of the Day (Po'D): Revisiting Inter-Genre Similarity Edition. Some work from my visits to Portugal earlier this year has finally been given the green light: B. L. Sturm and F. Gouyon, "Revisiting Inter-Genre Similarity", IEEE Signal Processing Letters, 2013 (accepted). My one-line description of this work is:

Be wary of an idea that sounds good and intuitive until analysis shows it to be good.
This paper addresses a former Po'D: Automatic classification of musical genres using inter-genre similarity edition. Our attempts at reproducing the results in that work are here, here, and here. After finding that our results were nowhere near those published, we sought answers through analysis. That is where this paper begins.

In short, we show that while the idea proposed in the original publication sounds good and intuitive, it is plainly not a good idea. (This is, I think, a great example of how intuition can seriously lead one astray.) Once we put the inter-genre similarity approach in the context of naive Bayesian classification, it becomes clear why it can't be superior to the much simpler approach of naive Bayesian classification. We add some empirical experiments to drive home this point. We make available the code to reproduce all figures in our paper exactly.

In fact, it appears that the reviewers put a lot of weight on the pains we took to make our paper reproducible. A few of the reviewers actually dug into some of it to experiment with different parameters. Here are a few comments from reviewers revealing to this point.

Some of the previous reviews [the first version was rejected, and the comment is about our revision and comments to the previous reviews] expressed surprise at the big discrepancy between the results obtained in this paper and the original results and believe that there might be some issue with the implementation. In a case like that I think the reproducible implementation is the one that should be taken seriously.
Overall I think there is a big emphasis on novelty and new results in engineering but reproducibility and repetition of experiment are a central foundation of good science and engineering and papers like this should be encouraged rather than discouraged.
A big pro of the paper at hand is that the authors foster reproducability and even make available their source code. This has already been mentioned by the reviewers, but I would like to highlight and appreciate it again. This is really great practice of good science, but unfortunately not always seen in the signal processing and music-IR domains, unlike in other domains.

Blog Roll

About this Archive

This page is an archive of entries from August 2013 listed from newest to oldest.

July 2013 is the previous archive.

September 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.