January 2012 Archives

In a previous post, I spoke of some classification outcomes using the Tzanetakis music genre dataset. My observations, or unsupported justifications, should be taken worth a grain of salt because they assume the classifier is looking at and compare the same things I am comparing. Then, in the last post, I noted there exist several problems in the training and testing dataset. I have finally completed a thorough study of this dataset, and present a detailed list of faults here.

This finding is not good news for the many new studies and those over the past decade that rely only on the Tzanetakis dataset for testing and comparing results. Confirming results with other datasets is always a good idea; but I don't have enough experience with other datasets yet --- and I don't know whether their integrity has been validated.

However, in this paper I argue that the many faults in the Tzanetakis dataset presents new and interesting challenges. Since our datasets have grown past the point where human validation is impossible, we need tools that can automatically find problems, like distortions, versions, and possible mislabelings. Furthermore, when we only have access to features and not to the audio data itself, we have to build tools to do the same in the feature space. In these directions, my large catalog of faults provides a ground truth to test such tools. With my limited memory too, I am sure I missed some versions. But I am confident all replicas are found (using a simplified version of the Shazam fingerprint method).

What happens when ...

| No Comments
over 10 years of research in automatic music genre recognition is built upon tests and comparisons with a dataset rife with errors, such as replicas and mislabelings?

matches_Jazz.png In the similarity matrix above, which I created with a simplified version of the Shazam fingerprint method (http://imi.aau.dk/~bst/software/index.html), we see several red squares showing that of the 100 excerpts in the Jazz category 13 are identical replicas. No cross-validation can cure that. Why hasn't anyone listened to this dataset? (Or maybe someone has, but I cannot find such a discussion.)

Many more details to come...
Over the past year I have been working with my student Pardis on automatic music genre recognition, a problem that I feel has been approached too often in a way that is at odds with reality. The aim of recognizing the genre thought to be embodied by an excerpt of a recording of a piece of music is by nature not well defined --- a fact reflected by the observation that humans often disagree on labels in music. There is not so much argument over whether an '8' is an '8' in a dataset of hand-written digits; but whether a musical excerpt of a music genre dataset is Jazz or Classical depends often on the viewpoint of whoever created that dataset (which I show below).

Since about 2001, many studies in this area have used the 1.2 GB dataset assembled by George Tzanetakis, who was one of the first to study this area. The typical approach has been to design and test a set of acoustic features with a classifier, then report the mean accuracies from cross validation, and perhaps a confusion table. Below is a confusion table I just created from this dataset using one classification method and set of acoustic features.

confusion.png

Play me a tiny violin

| 1 Comment
Here is a popular description of an excellent experiment (done by a colleague of mine at LAM, Paris 6). The attendant paper is C. Fritz, J. Curtin, J. Poitevineau, P. Morrel-Samuels, and F.-C. Tao, "Player preferences among new and old violins", PNAS http://dx.doi.org/10.1073/pnas.1114999109. Surprisingly, this paper already has a Wikipedia page!

Of course, with conclusions like, "We found that (i) the most-preferred violin was new; (ii) the least-preferred was by Stradivari ..." there is certain to be controversy. Here is a NY Times article containing criticism by a few professional musicians: that the tests were conducted in a hotel room and not a concert hall, that it takes time for a musician to get to know an instrument, and that quality varies among the instruments by great luthiers of history. I think the last two criticisms have more merit, but for the first, we can just say this study shows one should play million dollar violins in large reverberant spaces because in smaller rooms --- the kind of space where the violin will spend most of its time resonating --- they don't sound as good (or play as well) as new more modestly priced instruments.

This article will provide an excellent example of experimental design for my course on the analysis and design of experiments this semester.
While putting together the finishing touches on our paper for CMMR 2012 about music genre classification with sparse representation classification, we noticed something funny going on with the classifier. In our experiments, we are measuring classifier performance using features that have undergone some dimensionality reduction. Standard ways to do this are projecting the dataset onto its significant principal directions, thereby forming linear combinations of the original features and maintaining variability in the data in a lower dimensional space. Another way is to project the dataset onto the span of a set of positive features found through non-negative matrix factorization. A non-adaptive approach is just randomly projecting the dataset onto a subspace. We can also downsample the features by lowpass filtering and decimation. So we coded these up, and after much debugging, are quite sure things are working as expected. I made a mistake though when specifying the downsample factors, and ended up running lots of experiments with features that were ideally interpolated higher-dimensional versions of their original low dimensional selves. This interpolation appears to provide somewhat of a boost to the accuracy.

In the figure below, we compare the classification accuracy for four reduction methods and several reduction factors. (You can see my mistake has shifted the "Downsample" line a bit.) At a factor of 4, the feature dimension is 1/4 that of the original. At 1, we are just using the original feature. And at a factor of 0.5, the dimension is twice that of the original, created by putting a zero between each feature element and then low pass filtering to remove the alias. I expect there to be a dimensionality that is just right for maximizing the accuracy, and for there to be some benefit in reducing the dimensionality given that the amount of training data we have does not change. So the dip at no reduction (1) makes sense. But why the boost of nearly 8% in mean accuracy with an ideal interpolation of the features? (We have seen this happen repeatedly with other features as well.)

errors_ours.pdf.png Is the lowpass filtering of the ideal interpolation making things more discriminable for the sparse representation classifier? This is something we will have to explore and isolate its cause.

PS: Sorry for the long delay in posts! Happy new year too! Much more will come in a few weeks after submission, exams, and semester start.

Blog Roll

About this Archive

This page is an archive of entries from January 2012 listed from newest to oldest.

December 2011 is the previous archive.

February 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.