# January 2014 Archives

## One solution to the wine puzzle

I posed this puzzle a few days ago, partly as an exercise in experimental design.

What is different between the two test conditions?

One thing that is different is that the bottle shape is visible in the first condition, but not visible in the second. Even though we carefully covered the label to conceal its identity, we were naive to the fact that the bottle shape can carry information about the region from which the wine comes. The following picture shows the variety of shapes.

From the left, bottles of this shape usually come from Bordeaux; the next bottle shape from Burgundy; the next from Rhône; the next from Champagne; and the next from Côtes de Provence. In our dataset then, the bottle shape is confounded with its region. We thus inadvertently trained our experts not to recognize the region from which the wine comes based on the contents of the bottles, but instead on the shape of its container. Hence, when Team Småg can no longer see the bottle, they must guess.

I came up with this example to show how easy it can be to believe that one's learning experiment is valid, and that the figure of merit reflects real-world performance, but that is actually invalid due to an independent variable of which the experimenter is unaware. This fact is directly discussed as limitations by Ava Chase in her article, "Music descriminations by carp (Cyprinus carpio)":

Even a convincing demonstration of categorization can fail to identify the stimulus features that exert control at any given time, especially if the stimuli are complex. In particular, there can be uncertainty as to whether classification behavior had been under the stimulus control of the features in terms of which the experimenter had defined the categories or whether the subjects had discovered an effective discriminant of which the experimenter was unaware. The diversity of S+/S- pairings presumably rules out the possibility that the fish could have been relying on only a single discriminant, such as timbre, but a constant concern is the possible existence of a simple attribute that would have allowed the subjects merely to discriminate instead of categorizing. (emphasis mine)

## Faults in the Ballroom dataset

The Ballroom dataset (BD) was created around 2004 by F. Gouyon et al. for use in a comparative evaluation of particular features for music genre classification. BD is described in the paper, F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, "Evaluating rhythmic descriptors for musical genre classification", in Proc. Int. Conf. Audio Eng. Society, June 2004.

[BD] contains excerpts from 698 pieces of music, around 30 seconds long. The audio quality of this data is quite low, it was originally fetched in real audio format [from http://www.ballroomdancers.com], with a compression factor of almost 22 with respect to the common 44.1 kHz 16 bits mono WAV format. ... The data covers eight musical sub-genres of ballroom dance music: [Jive, Quickstep, Tango, Waltz, Viennese Waltz, Samba, Cha Cha Cha and Rumba].

## Paper of the Day (Po'D): Cross-validation and bootstrap edition

Hello, and welcome to Paper of the Day (Po'D): Cross-validation and bootstrap edition. Today's paper is: R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in Proc. Int. Joint Conf. Artificial Intell., 1995.

Let's first get some things straight:
• We have a space of instances $$\mathcal{V}$$, many of which are unlabeled.
• We have a set of possible labels, $$\mathcal{Y}$$.
• We have a dataset of $$n$$ labeled instances from $$\mathcal{V}$$: $$\mathcal{D} = \{x_1, \ldots, x_n\}$$, where $$x_i = \{v_i \in \mathcal{V}, y_i \in \mathcal{Y}\}$$.
• We have an "inducer" $$\mathcal{I}$$, which takes $$\mathcal{D}$$ and creates a classifier $$\mathcal{C}$$.
• Hence, $$\mathcal{I}(\mathcal{D},v)$$ is the label predicted by the classifier for a $$v\in\mathcal{V}$$,
As Kohavi says, "Given a finite dataset, we would like to estimate the future performance of a classifier induced by the given inducer and dataset." Or, more specifically and with less redundancy: Given a finite $$\mathcal{D}$$, we want to estimate the accuracy of a classifier induced by $$\mathcal{I}$$. Kohavi defines the accuracy of $$\mathcal{C}$$ as "the probability of correctly classifying a randomly selected instance."

Kohavi considers several options here for estimating accuracy. First, there is the "holdout method." Take some number of elements of $$\mathcal{D}$$ for testing, and use the rest --- call this $$\mathcal{D}_t$$ --- for training. Then, accuracy is estimated $$\frac{1}{|\mathcal{D}\backslash\mathcal{D}_t|} \sum_{x_i \in \mathcal{D}\backslash\mathcal{D}_t} \delta(\mathcal{I}(\mathcal{D}_t,v_i),y_i)$$ where $$\delta(i,j) = 1$$ if $$i = j$$, and zero else. Then there is the "random subsampling method", which repeats "holdout" several times and averages the resulting estimates.

Then, there is the "cross-validation method". This is done in the same way as the "holdout method", except $$\mathcal{D}$$ is partitioned into a number of $$k$$ non-overlapping folds, and each fold is used as a testing dataset while the remainder is used for training. The accuracy is then estimated as the mean of the $$k$$ test accuracies.

Then, there is the "bootstrap method". Define a training dataset $$\mathcal{D}_t$$ as $$n$$ labeled instances picked with replacement from $$\mathcal{D}$$. The remaining instances are then used for testing. The 0.632 bootstrap estimate of accuracy is $$0.368 \textrm{acc}_s + 0.632 \frac{1}{|\mathcal{D}\backslash\mathcal{D}_t|} \sum_{x_i \in \mathcal{D}\backslash\mathcal{D}_t} \delta(\mathcal{I}(\mathcal{D}_t,v_i),y_i).$$ Kohavi seeks to determine how these approaches fare. His methodology is as follows. Take two inducers: C4.5 and naive Bayes. Take six US Irvine datasets, and create a random dataset with two classes. The datasets are selected based on having at least 500 instances, and for which "the learning curve for both algorithms did not flatten out ... before 100 instances." (I assume this means that the ordered set of classifiers induced from either algorithm using a incrementally growing training dataset show accuracies on the same training datasets that increase.) Then, estimate the "true accuracies of a real dataset" using the "random subsampling method". For instance, for the dataset "Breast cancer", 50 of the 699 instances are selected at random as the test set. The remainder is used for training. Then, a classifier is induced by one of the inducers and that training dataset, and its accuracy is computed. This is repeated 500 times, and the mean accuracy is computed. This is called the true accuracy.

Now, verbatim:
"To see how well an accuracy estimation method performs, we sampled instances from the dataset (uniformly without replacement), and created a training set of the desired size. We then ran the induction algorithm on the training set and tested the classifier on the rest of the instances of the dataset. This was repeated 50 times at points where the learning curve was sloping up. The same folds in cross-validation and the same samples in bootstrap were used for both algorithms compared."
It seems then that the training datasets are built by adding instances one by one to induce a classifier, looking at its accuracy on that training dataset, and either adding a new random instance if there is not enough gain, or stopping. I don't know what this means for cross-validation and bootstrap.

I will stop there because the missing details in the paper and its inconsistencies have thoroughly confused me, and I doubt its experiments are of much use. First, Kohavi claims that the main goal is to estimate the accuracy of a classifier induced by $$\mathcal{I}$$ and a finite $$\mathcal{D}$$; but then he conflates this with estimating the accuracies of several classifiers built from the same $$\mathcal{I}$$ but several partitions of $$\mathcal{D}$$. When is estimating the true accuracy of a $$\mathcal{C}$$ the same as computing the accuracies of many classifiers built using a $$\mathcal{I}$$ but several partitions of $$\mathcal{D}$$? The "holdout" and "bootstrap methods" are the only ones that are estimating the accuracy of a single $$\mathcal{C}$$. The "random subsampling method" is estimating the accuracies of several different $$\mathcal{C}$$; the "$$k$$-fold cross validation method" is estimating the accuracies of $$k$$ different $$\mathcal{C}$$; and the "leave one out method" is estimating the accuracies of $$n$$ different $$\mathcal{C}$$. These last three methods are instead saying something general about $$\mathcal{I}$$ using $$\mathcal{D}$$, and not anything specific about a particular $$\mathcal{C}$$. (Could this be what he means in the title by "Model Selection"?) Second, Kohavi admits, "the true accuracies of a real dataset cannot be computed because we don't know the target concept". (I assume here he means the accuracy of a classifier on the set of all instances in $$\mathcal{V}$$ from which a finite dataset is sampled.) Still, Kohavi claims, "... we can estimate the true accuracies using the holdout method" (which is actually the "random subsampling method" since he repeats holdout 500 times). Why? If that is true, then why not always do it that way?

In short, I do not think this paper provides any meaningful comparison of these methods for estimating the true accuracy of a classifier. Two methods say something about a single classifier and a dataset, and the other three say something about an inducer and a dataset by looking at several classifiers it induces. Comparing the numbers from all methods to those produced by the "random subsampling method" might not be saying anything at all about their usefulness in the real world if none of them actually pertain to the real world. What is needed, at least, is a dataset that has a known target concept. Then "true accuracies" can be accurately estimated, and compared to the results from the methods above. That is what I intend to do next (and which I am sure someone has already done).

## A Wine Puzzle

A contest comes to town to find the best person at identifying wines. Not the kind of wines, but the region from which each comes, featuring wines from: Bordeaux, Burgundy, Alsace, and Côtes de Provence. A large cash prize, not to mention a lot of wine tasting, makes the opportunity hard to miss. Hence, I assemble a team (Team Smag), among which the prize will be divided equally if one of us wins.

To prepare Team Smag, I go to several stores to buy several bottles of wine from each region, and then cover the labels of the bottles to remove any identifying information. (I place a random ID on the bottom of the bottle to record its region, but to keep it secret from the others.) I randomly (but not haphazardly) put aside one third of the wines from each region for testing, and use the rest for training. My assistant will help train the other teammates. We train each person by repeating trials. In a trial, the participant sits in a room with several empty glasses. Outside the room, my assistant randomly selects a bottle from the training set, takes it into the room, pours a few sips of wine in a clean glass, and instructs the participant to drink and say from which region it comes. I then say whether they are correct or not. When a participant has 20 correct consecutive trials, they are considered trained. After all teammates have been trained, my assistant tests each one (a few days later) by having them identify the regions of the other 1/3 of the wines in the bottles I saved for testing. The testing is done in the same way as the training. We find that everyone scores nearly perfectly.

The day of the contest arrives. Along with all others, Team Smag pays the entry fees, and each member takes a seat in different areas of the arena. In front of each participant sits 16 glasses of wine (just a few sips in each), each numbered 1-16. There is a No. 2 pencil and a bubble sheet as well, to note the region of each wine. Great fun is had; and when all over, the bubble sheets are tallied and a big board displays the results. Not a single member of Team Smag scores much higher than random. What happened?

## DFF FTP Independent Postdoc Project Results 2012-2013

Dear taxpayers of Denmark,

You have been very gracious to support my research the past two years by Independent postdoc grant No. 11-105218 from Det Frie Forskningsråd. With the aid of this grant: I have produced 11 journal article submissions, 3 of which are now published, 1 that has been accepted, and 2 that are currently in review; I have produced 15 conference paper submissions, 12 of which are now published; I have given 20 public seminars on my research around Europe and the USA; and I have made 13 research visits of a week or more each.

Funny story: my DFF FTP postdoc grant proposal, titled, "Greedy Sparse Approximation and the Automatic Description of Audio and Music Data", is concerned with applying particular signal decomposition approaches to extracting content from audio data. However, very little of my resulting research directly addresses this goal. What I soon learned during the first quarter of my project is that the standard experimental framework practiced in a significant amount of research in this area is not as useful or illuminating as thought by its research community: invalid conclusions are being derived from poor evaluation using faulty data, which in turn can lead research widely astray, and waste valuable resources. The second quarter of my project led me to produce a thorough survey of the extent to which this is happening, and reveals a surprising lack of scientific practice in more than 500 published works. If the standard approach to evaluating a solution is faulty, then why continue to produce solutions and not change the evaluation? Hence, by the end of the first half of my project, it made no sense to produce new approaches to extracting information from audio signals and use the same poor evaluation to "measure" progress. Clearly, the most significant problem, and the one that must be addressed before all others, is how to scientifically evaluate such solutions. To this end, I devoted the last half of my project to answering why this has happened, and what can be done about it. As a result, many new directions have emerged that I will explore over the next several years with the various collaborators I have made during my DFF FTP Independent Postdoc.

Below the fold, I provide a detailed record of the results of my DFF FTP Independent Postdoc, both failures and successes.

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk