# June 2013 Archives

## 2013 Reproducible Research Prizes

The SoundSoftware.ac.uk Prizes for Reproducibility in Audio and Music Research have been announced. I am happy to report that some of my works (with collaborators F. Gouyon and P. Noorzad) are among them. In particular, our prize (supporting the open-access publication of a journal article) was awarded for the following three items:

1. B. L. Sturm and F. Gouyon, "Revisiting Inter-Genre Similarity", submitted Mar. 2013, resubmitted June 2013. MATLAB code is here. This paper resolves some contradictions posed by the results published in: U. Bagci and E. Erzin, "Automatic classification of musical genres using inter-genre similarity," IEEE Signal Processing Letters, vol. 14, no. 8, pp. 521-524, Aug. 2007.
2. B. L. Sturm, "On music genre classification via compressive sampling", Proc. IEEE Int. Conf. Multimedia & Expo, July 2013. MATLAB code is here. This paper resolves some contradictions posed by the results published in: K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, "Music genre classification via compressive sampling," in Proc. ISMIR, (Amsterdam, The Netherlands), pp. 387-392, Aug. 2010.
3. B. L. Sturm and P. Noorzad, "On Automatic Music Genre Recognition by Sparse Representation Classification using Auditory Temporal Modulations", Proc. Computer Music Modeling and Retrieval, QMUL, London, UK, June 2012. MATLAB code is here. This paper attempts to reproduce the results published in: Y. Panagakis, C. Kotropoulos, and G. R. Arce, "Music genre classification via sparse representations of auditory temporal modulations," in Proc. EUSIPCO, Aug. 2009.

There are other awarded works there that appear quite interesting to me. I look forward to reading:
• Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., and Plumbley, M. D., "A Database and Challenge for Acoustic Scene Classification and Event Detection", conference submission.
• Raffel, C., and Ellis, D., Reproducing Pitch Experiments in "Measuring the Evolution of Contemporary Western Popular Music", technical report.

Thank you to the organizers of this prize! It is a great way to motivate reproducibility in disciplines that suffer from a lack of transparency.

## NSA proof font? Just retrain.

Someone has developed a font to confuse existing text scanners. I initially thought it would be generating captchas of documents, or at least generate a new typeface for each word, but it is instead using set type faces. This makes its value completely limited.

## Exact odds?

Is there a confidence interval for these exact odds?

## I found some formalized experimental design

Thanks to Dan Stowell, I have found an interesting book:

R. A. Bailey, Design of comparative experiments. Cambridge University Press, 2008.
The pre-print is here, but I am definitely buying a copy.

Let's have a first look at some of the formalization, as I consider whether it is equipped for describing evaluation in machine learning. Here are some basic definitions, and notations

• experimental unit: "the smallest unit to which a treatment is applied"
• observational unit (plot): "the smallest unit on which a response will be measured"
• The entire set of plots is notated $$\Omega$$. The number of plots $$N := |\Omega|$$.
• treatment: "the entire description of what can be applied to an experimental unit"
• The entire set of treatments is notated $$\mathcal{T}$$. The number of treatments $$t := |\mathcal{T}|$$
• plot structure: "meaningful ways of dividing up the set of plots"
• treatment structure: "meaningful ways of dividing up the set of treatments"
• design: "allocation of treatments to plots".
• The design is specified by a map of units to treatments, $$T : \Omega \to \mathcal{T}$$
• plan or layout: "the design translated into actual plots"
• response on a plot: realization of a random variable.
• The response on plot $$\omega$$ is the rv $$Y_\omega = Z_\omega + \tau_{T(\omega)}$$ where $$\tau_{T(\omega)}$$ is a constant, and $$Z_\omega$$ is a rv.''
In the last bit, the linear model, we want to recover $$\tau_{T(\omega)}$$ from our observation $$Y_\omega$$. This is the response of the unit to the treatment. The $$Z_\omega$$ includes measurement noise, stuff that has to do with the plot, and so on.

Now, let's try to apply these to a real virtual example, e.g., pattern recognition. We may wish to answer the following question: "how well does system $$i$$ detect the presence of human voice in my collection of $$N$$ digital audio recordings?"

• experimental unit: collection
• observational unit: digital audio recording
• treatments: system $$i$$, random system
• treatment and plot structure: N/A since we have digital samples
• design: each $$\omega \in \Omega$$ mapped to both $$i$$ and random system
• response on a plot: say $$i$$ gives a number in $$[0,1]$$ denoting its confidence in its detection of human voice. If it is 1, then it is absolutely sure. The random system gives a number in $$\{0,1\}$$ with some probability.
This reveals one major shortcoming: since we have assigned each unit to both treatments, we no longer have a function. Bailey writes,
Although we speak of allocating treatments to plots, mathematically the design is a function ... Thus plot $$\omega$$ is allocated treatment $$T(\omega)$$. The function has to be this way round, because each plot can receive only one treatment.
Unlike in agriculture, digital signals can be replicated exactly. Treating one copy does not affect another copy. We think.

So, big question: Does this mean the entire apparatus in Bailey's book is inapplicable to evaluation in machine learning?

## Formalized evaluation in machine learning?

I am wondering if experimental design, or evaluation in machine learning, has ever been formalized? I show below an example of what I wish to have. In short, one thing I want is a nice way to define an evaluation that addresses some claim or hypothesis, which shows where weak points in validity arise. Also, it would be a nice way to explore the possibilities of other evaluation.

An evaluation of some machine learning systems is completely specified by three things: an experimental design ($$\mathcal{E}$$), test data ($$\mathcal{U}$$), and a relevant figure of merit ($$\mathcal{F}$$). In more formal terms, define a unit $$u \in \Omega$$ a member of the universal set of units, and thus the test data $$\mathcal{U} := \{u_n : n \in \mathcal{N}\}$$ as an indexed set of units. Define the figure of merit $$\mathcal{F}$$ as a set of functions $$f \in \mathcal{F}$$ where the range of $$f$$ is $$\mathcal{R}(f)$$. For instance, if $$f$$ is a function that produces a confusion table, then $$\mathcal{R}(f) = {N}_0^{T\times T}$$. Now, we can see $$\mathcal{E}$$ as a map of $$\mathcal{U}$$ by $$\mathcal{X}$$ into the set of ranges of each member of $$\mathcal{F}$$: $$\mathcal{E}: \mathcal{U} \times \mathcal{X} \to \{\mathcal{R}(f) : f \in \mathcal{F}\}.$$

As an example, consider the classification system to be the map $$\mathcal{X} : \Omega \to [0,1]^{T}$$, and define the function $$\mathcal{L} : \Omega \to [0,1]^{T}$$, which produces the ground truth'' of a unit. The experimental design Classify is thus defined as $$\mathcal{E}_{\textrm{Cl}}(\mathcal{U},\mathcal{X}) := \bigl \{f\{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : f \in \mathcal{F} \bigr \}.$$ A relevant $$f$$ produces a confusion table. Now consider a system that retrieves a set of $$M$$ units from $$\mathcal{U}$$ based on observing a $$u \in \Omega$$, i.e., $$\mathcal{X} : \Omega \to \mathcal{U}^M$$. The experimental design Retrieve is defined by $$\mathcal{E}_{\textrm{Re}} := \bigl \{f\{(\mathcal{L}(u),\mathcal{L}(u')) : u' \in \mathcal{X}(u) \} : f \in \mathcal{F} \bigr \}.$$ A relevant $$f$$ is precision at $$M$$ for each class. Another experimental design is Generalize, which is defined by crossing the experimental design Classify with several datasets, i.e., \begin{align} \mathcal{E}_{\textrm{Ge}} & := \bigl \{\mathcal{E}_{\textrm{Cl}} \times \{\mathcal{U}_1,\mathcal{U}_2, \ldots\} \bigr\} \\ & := \Bigl \{f \bigl \{ \{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : \mathcal{U} \in \{\mathcal{U}_1, \mathcal{U}_2, \ldots \} \bigr \} : f \in \mathcal{F} \Bigr \}. \end{align} A relevant $$f$$ is classification accuracy. Now, consider a system $$\mathcal{X} : [0,1]^T \to \Omega$$, i.e., it maps a label to a $$u \in \Omega$$. The experimental design Compose is defined by $$\mathcal{E}_{\textrm{Co}} := \bigl \{f\{(l,\mathcal{L}(\mathcal{X}(l))) : l \in \mathcal{L} \} : f \in \mathcal{F} \bigr \}$$ where $$\mathcal{L}$$ is a set of labels. In this case, $$\mathcal{L}$$ is an expert or other system labeling the output of $$\mathcal{X}$$. A relevant $$f$$ is a confusion table.

## Final OPF results

An SQL error has kept me locked out of my blog for the past month. Now, I can finally post the final results of my experiments with OPF. Previously, I discussed how my reproduction of the optimum path forest approach to music genre recognition does not generate results near those reported, until I train and test with the same dataset. I have now run the same experiment, but used a partitioning of GTZAN that considers the duplication of artists, and its faults. (My work on GTZAN is now available at arxiv.) I predicted before the classification accuracy to drop "from 74 to at least 55". Let's see how I did!

First we look at the classification of all 23 ms segments. Quite poor in all degrees. (True classes are columns. In percentages, precision is last column on right, F-score is last row, recalls along diagonal, and accuracy is bottom right corner.)

Now, we apply a majority vote to classify excerpts, which produces much better results than before, but 7 points worse than 55 I predicted.

Finally, when we take into consideration the mislabelings in GTZAN, the accuracy drops a few more points.

So, the performance of OPF in GTZAN has gone from the 99.8% published in ISMIR2011 (which comes from testing on the training data), to 76% without using artist and fault filtering, to 47% with artist and fault filtering, finally to 45% taking into consideration the mislabelings in GTZAN.

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk