April 2013 Archives

Hello, and welcome to the Paper of the Day (Po'D): Evaluating music emotion recognition: Lessons from music genre recognition? edition. Today's paper is my third accepted for presentation at the 2013 IEEE Int. Conf. on Multimedia and Expo: B. L. Sturm, "Evaluating music emotion recognition: Lessons from music genre recognition?"

The one-line summary of this paper, for those in a hurry: Meaningful conclusions about music genre/emotion recognition systems do not follow from standard approaches to evaluation. Here is why, and what to do about it.

In this paper, we finally identify a major and fundamental problem with most research in music genre/emotion recognition. Starting with my paper "Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?", I knew something was not right: why do state of the art and high-performing music genre recognition systems behave so strangely? Surely, someone else has remarked on this behavior, and taken different approaches to evaluate systems designed to address this extremely complex problem.

So, I looked at how genre recognition systems have been evaluated by reading a few papers, and cataloging their approaches to evaluation: "A Survey of Evaluation in Music Genre Recognition." This revealed that hardly anyone has thought much about evaluation, and a vast majority use a standard approach in machine learning to evaluate supervised learning, i.e., comparing predicted labels with those of a ground truth, and present accuracy as a figure of merit.

So, in "Classification Accuracy Is Not Enough : On the Evaluation of Music Genre Recognition Systems", we show why this evaluation approach --- used in 91% of published work we review in our survey --- is incapable of measuring the depth to which a music genre recognition system recognizes genre. In short, a richer kind of evaluation is necessary for determining which proposed systems are promising for solving the problem.

Then, in "The GTZAN dataset: Its contents, its faults, their affect on evaluation, and its future use" (about to be resubmitted), we show that the results of 96 works that evaluate classification accuracy even in the same dataset cannot be meaningfully compared in any useful sense. (We also show that when taking into account all the faults of GTZAN, classification accuracies of systems that were estimated to be around 80%, decay to 50% or lower.)

Now, in today's PoD, we identify the principal goals of music genre/emotion recognition, and show why the most widely-used approach to evaluate these systems provide nothing relevant. (We do not argue whether genre/emotion recognition is a good idea, or if it is well-posed, and so on. We only address the fundamental problem of evaluating whether a system can recognize music genre or emotion.) In the words of Richard Hamming: "There is a confusion between what is reliably measured, and what is relevant. ... Just because a form of measurement is popular has nothing to do with its relevance." When a genre recognition system is tested by comparing labels in test data having many uncontrolled independent variables (e.g., dynamic compression, dynamic range, loudness, and so on), one cannot logically conclude the performance is due to a capacity to recognize genre/emotion in music. Even when one sees 100% classification accuracy! Classification accuracy in this case, while easy to do, is irrelevant for reliably measuring whether a system is recognizing genre/emotion. The conclusion does not validly follow as long as all but one independent variable are uncontrolled. This is basic experimental design, and it appears to have been rarely considered in music genre/emotion recognition.

In short: When Clever Hans trots into town, do not insist on asking more questions of the same kind.

Now, watch this and tell me, why does Maggie look to her handler when it was Oprah who asked the question?
Appearing at ISMIR 2011 was the following intriguing paper: C. Marques and I. R. Guiherme and R. Y. M. Nakamura and J. P. Papa, "New Trends in Musical Genre Classification Using Optimum-Path Forest", Proc. ISMIR, 2011. As it reports classification accuracies in GTZAN above 98.8%, it certainly caught my attention. With respect to the classification accuracies in GTZAN reported in 94 other works, we see that of the optimum path forest in the image below as reference [55]:

ex01-1.png So, with the great help of the fourth author Joao Papa, and their excellent Optimum Path Forest library, I was quickly on my way to reproducing the results.

Joao has filled in a critical detail missing from the paper. Their results come from classifying every feature (computed from a 23 ms window) instead of the 30 s excerpts. This is even more curious to me since experience shows such classification should be very poor ... unless the partitioning of the dataset into training and test sets distributes features from excerpts across instead of keeping them separated. Looking at the code behind the ``opf_split'' program confirms that it takes no care to avoid a biased partition. Another curious detail in the paper is that they write they have 33,618 MFCC vectors from the 1000 excerpts in GTZAN. I get 1,291,628 MFCC vectors.

So, I decided to run this evaluation as I think they did:

./runOPF.sh alldata.bin 0.5 0.5 1 1
where "alldata.bin" is an OPF-formatted file of the features I compute in MATLAB, the first two numbers specify the train/test split, the last two numbers denote whether feature normalization is used, and how many independent trials to run. Here is some of the output:

Training time: 23248.525391 seconds
Testing time: 30824.958984 seconds
Supervised OPF mean accuracy 74.323967
We see that after nearly 15 hours of computation, we don't get anywhere near the 98.8% accuracy. And without feature normalization, the accuracy rises only to about 76.3%. The paper reports that the training and testing times for OPF in GTZAN are 9 and 4 seconds, respectively. Respectfully, my computer is not so slow as to cause a 7000 fold increase in computation time. I tried several other things to increase the accuracy, but nothing was working.

Then I tried testing and training on the same fold, and got an accuracy of 99.97%. Joao confirms that this appears to be at least part of what happened.

Now, I am going to run the same experiment, but using a proper partitioning, and the fault filtering necessary for evaluating systems with GTZAN. I predict that we should see the classification accuracy drop from 74 to at least 55.
Hello, and welcome to the Paper of the Day (Po'D): Music genre classification risk and rejection edition. Today's paper is my second accepted for presentation at the 2013 IEEE Int. Conf. on Multimedia and Expo: B. L. Sturm, "Music genre recognition with risk and rejection", Proc. ICME, 2013.

The one-line summary of my paper, for those in a hurry: Some misclassifications are much worse than others, so we show how to make an MGR system take that into account.

When it came time in my multivariate statistics course to come up with fun examples of considering risk in classification, I said, "Consider a music genre recognition system that labels a classical piece 'metal' --- the horror! Hence, we can specify for the system that it must be quite sure something is metal before calling it 'metal'." Then I said, "I will show you the results of this easy example in the next class period."

It took a bit longer than that to get the system working, as it was not as trivial as I thought. And while some researchers in music genre recognition over the past ten years have hinted at such a possibility, we find that no one has actually done it. Before I knew it, I had given birth to a paper!

Hello, and welcome to the Paper of the Day (Po'D): Music genre classification via compressive sampling edition. Today's paper is: B. L. Sturm, "On Music genre classification via compressive sampling", Proc. ICME 2013. This paper is the closing chapter on the findings reported in K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, "Music genre classification via compressive sampling," in Proc. Int. Soc. Music Information Retrieval, (Amsterdam, the Netherlands), pp. 387-392, Aug. 2010.

The one-line summary of my paper, for those in a hurry: Results contradicting two well-supported findings of machine learning and music information research? We show the contradictions are not real.

I first discussed the work of Chang et al. here; and then two years later discussed several issues with the work, and finally reproduced it and submitted a paper with my code. My paper is now accepted and revised with many changes suggested by the helpful reviews. This is my third negative results paper (the first is here, the second here). I must take care to not become too negative!

Anyhow, it is quite satisfying to receive the following reviewer comment on my paper:

The paper provides extremely reproducible results that help to clear the confusion caused by previous works. The result is consistent with other works which show that compressive sampling / random projection reduce classification accuracy. Classification research is heavily directed by the top performers in the field. In this case, the authors address the failings of previous authors to sufficiently explain their methods. Without papers such as this one, the field continues to be muddied by works that claim inflated results without providing sufficient data to reproduce their work, and researchers waste time chasing phantom results. I applaud the rigor with which the research was performed and explained.

Blog Roll

About this Archive

This page is an archive of entries from April 2013 listed from newest to oldest.

March 2013 is the previous archive.

May 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.