January 2013 Archives

Hello, and welcome to Paper of the Day (Po'D): Three current issues in music autotagging edition. Today's paper is a provocative one: G. Marques, M. Domingues, T. Langlois, and F. Gouyon, "Three current issues in music autotagging," in Proc. ISMIR, 2011.

What is "music autotagging"? Let's not get ahead of ourselves. What is a "tag"? In general, is a term applied by someone to music to make it more useful to them (thanks to Mark Levy for that great definition). So, on last.fm, if we view Roger Whittaker's "The Last Farewell", we see people have applied several tags to the song: "easy listening", "roger whittaker", "schlager", and "oldies". Some of those tags are quite useless and/or meaningless to me. One of these tags notes a use of the music (easy listening, maybe), one redundantly names the artist/performer/singer in the music, and two are completely unrelated to the musical content (shlager, which is "hit" in German apparently). (Here are many other tags people have applied to this song.) In T. Bertin-Mahieux, D.Eck, and M.Mandel, Machine Audition: Principles, Algorithms and Systems, ch. Automatic tagging of audio: The state-of-the-art. IGI Publishing, 2010, we learn that in 2007, 68% of the tags on last.fm describe genre ("rock"); 12% describe the location ("Brooklyn" for Beastie Boys); 5% describes mood ("chill"); 4% describe opinion ("favorite"); 4% describe instrumentation ("contrabassoon"); 3% describe 'style' ("political"); and the rest is a mixed bag.

So, what is "music autotagging" (MA)? That is just the act of a machine programmed to apply tags to a piece of music like we see above. And the Po'D provides illumination on three concerning issues in this line of research:

  1. Current approaches to MA evaluation are too sensitive with respect to imbalances in the data, and end up painting too rosy of a picture of performance and progress.
  2. Current approaches to MA do not generalize across datasets, and measuring performance in one dataset ends up painting too rosy of a picture of performance and progress.
  3. Tag post-processing using tag co-occurrences does not work well because, you know, pulling one's self up out of mire by the bootstraps, i.e., achieving good tags to begin is what is necessary.
With regards to the first concern, Marques et al. discuss how high mean F-scores (over all tags) can be achieved easily when a large amount of the data is tagged with "foo", and a system applies "foo" to all data. Hence, they recommend using as well mean F-scores per tag. Furthermore, one must consider that some of the data has many tags, and a lot of the data has few tags. There is also the problem of data integrity: what to do with misspellings? duplicated excerpts? excerpts without tags? mutually exclusive tags? and so on.

With regards to the second concern, Marques et al. show extremely interesting results of training two MA systems on one dataset, and testing it on another. This shows the systems to be quite different in terms of generalizing across the two. Furthermore, when they apply tags to two different and un-tagged datasets and look at tag frequencies, they find nearly the same behaviors! That is, to excerpts in both datasets, the MA applies "man singing" with the same frequency; and "acoustic"; and "duet". And this occurs in the face of the systems having "good" F-scores when evaluated in their own dataset. (2-fold CV)

With regards to the last concern, Marques et al. look at the quantitative differences in F-scores (for individual tags) before and after a post-processing stage. The idea of the stage is that if the tags "rock," "Springsteen" and "oboe" are applied to a song, then "oboe" should be removed or substituted by "guitar". Here the experiment reveals that the benefits achieved by this approach are quite limited by the success of the application of the tags in the first place. Hence, work must be focused on improving the initial application of tags.

Bottom line: We have a ways to go.
Yesterday, I got everything working for a toy problem tuned such that the approach proposed by Bagci and Erzin works nicely. Now, we return to apply it to the real-world problem of music genre recognition. We use a 3-s decision window (each observation has 300 features), with 13 MFCCs in each feature (including the zeroth coefficient), model each class with a mixture of 8 Gaussians having diagonal covariance matrices. (I have to specify a small regularization term to avoid ill-conditioned covariance matrices. I also specify a maximum number of EM iterations of 200.) We observe the following classification errors for each fold.
The genre recognition approach proposed by Bagci and Erzin is motivated by several key assumptions, which I boil down to the following:

  1. there are common properties among music genres, for instance, "similar types of instruments with similar rhythmic patterns";
  2. the nature of some of these common properties is such that bags of frames of features (BFFs) encapsulate them;
  3. this means we should expect poor automatic genre recognition using BFFs if these common properties are quite prevalent among the classes; (or, that music genres have common properties is to some extent responsible for their ambiguousness;)
  4. we can model the common properties of a set of music genres by combining the features of frames that are misclassified in each genre;
  5. we can model the non-common properties of each of a set of music genres by combining the features of frames that are correctly classified in each genre;
  6. with these models, we can thereby ascribe a confidence that any given frame encapsulates common properties, and thus whether to treat it as indicative of genre or not.
Yesterday, I posted some initial results with the music genre recognition system proposed by Bagci and Erzin. Since I am not too confident that I understand what PRTools is doing, I have decided to implement the process with the stats toolbox of MATLAB, and get it working on a standard machine learning dataset: the handwritten digits of the US Postal Service .
Having reviewed yesterday the work on music genre classification by boosting with inter-genre similarity, I now have some results. Adapting my MFCC code to fit the description in the text, and using the excellent PRTools toolbox, it only took me a matter of minutes to create the code. The time to run it, however, takes hours.
Hello, and welcome to the Paper of the Day (Po'D): Automatic classification of musical genres using inter-genre similarity edition. Today's paper is: U. Bağci and E. Erzin, "Automatic classification of musical genres using inter-genre similarity," IEEE Signal Proc. Letters, vol. 14, pp. 521-524, Aug. 2007. Related to this work are the following four:
  1. U. Bag ̆cı and E. Erzin, "Boosting classifiers for music genre classifi- cation," in Proc. 20th Int. Symp. Comput. Inform. Sci. (ISCIS'05), Istanbul, Turkey, Oct. 2005, pp. 575-584.
  2. U. Bagci and E. Erzin, "Inter genre similarity modeling for automatic music genre classification," in Proc. IEEE Signal Process. Comm. Apps., pp. 1-4, Apr. 2006.
  3. U. Bagci and E. Erzin, "Inter genre similarity modeling for automatic music genre classification," in Proc. DAFx 2006.
(Is the DAFx paper an English translation of the IEEE conference paper written in Turkish?)

This work is next on my docket for reproduction, for not the least of reasons that it reports a classification accuracy of over 92% in the GTZAN dataset.
Hello, and welcome to the Paper of the Day (Po'D): Multi-tasking with joint semantic spaces edition. Today's paper is: J. Weston, S. Bengio and P. Hamel, "Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval," J. New Music Research, vol. 40, no. 4, pp. 337-348, 2011.

This article proposes and tests a novel approach (pronounced MUSCLES but written MUSLSE) for describing a music signal along multiple directions, including semantically meaningful ones. This work is especially relevant since it applies to problems that remain unsolved, such as artist identification and music recommendation (in fact the first two authors are employees of Google). The method proposed in this article models a song (or a short excerpt of a song) as a triple in three vector spaces learned from a training dataset: one vector space is created from artists, one created from tags, and the last created from features of the audio. The benefit of using vector spaces is that they bring quantitative and well-defined machinery, e.g., projections and distances.

MUSCLES attempts to learn each vector space together so as to preserve (dis)similarity. For instance, vectors mapped from artists that are similar (e.g., Brittney Spears and Christina Aguilera) should point in nearly the same direction; while those that are not similar (e.g., Engelbert Humperdink and The Rubberbandits), should be nearly orthogonal. Similarly, so should vectors mapped from tags that are semantically close (e.g., "dark" and "moody"), and semantically disjoint (e.g., "teenage death song" and "NYC"). For features extracted from the audio, one hopes the features themselves are comparable, and are able to reflect some notion of similarity at least at the surface level of the audio. MUSCLES takes this a step further to learn the vector spaces so that one can take inner products between vectors from different spaces --- which is definitely a novel concept in music information retrieval.

As an example, consider we have some unidentified and untagged song for which we would like to identify the artist, propose some tags, and/or find similar songs. By designing vector spaces using MUSCLES, we can retrieve a list of possible artists by taking the inner product between the feature vector of the song and our learned artist vectors. As Weston et al. define, we can judge as similar those giving the highest magnitudes. Similarly, we can retrieve a list of possible tags by taking the inner product between the feature vector of the song and our learned tag vectors. And we can retrieve a list of similar songs by taking the inner product between the feature vector of the song and those of our training set (which is a typical approach, but by no means a good approach, to judging music similarity).

To learn these vector spaces, MUSCLES uses what appears to be constrained optimization (the norms of the vectors in each space are limited) using a simple gradient descent of a cost function within a randomly selected vector space (stochastic gradient descent). Since they are interested in increasing precision, the authors optimize with respect to either margin ranking loss, and weighted approximate ranked pairwise loss --- both relevant for the figure of merit precision @ k). Furthermore, when one wishes to design a system that can do all three tasks (artist prediction, tag prediction, and music similarity), MUSCLES considers all three vector spaces, and optimizes each one in turn by holding the others constant.

Weston et al. test MUSCLES using the TagATune dataset (16,289 song clips in training, 6,499 clips for testing, 160 unique tags applied to all data), as well as a private one they assemble to test artist identification and similarity (275,930 training clips, 66,072 test clips, from 26,972 artists). Their audio features are essentially histograms of codebook vectors (learned using K-means from the training set). They compare the performance of MUSCLES against a one-vs-rest SVM, and show MUSCLES has better mean p@k for several k (and with statistical significance at alpha = 0.05) when predicting tags for the TagATune dataset. For artist identification in the bigger dataset, we see MUSCLES performs better than the SVM in artist prediction, song prediction, and song similarity (judged by whether the artist is the same). Finally, Weston et al. show MUSCLES not only requires much less overhead than an SVM, but also that a nice by-product of the MUSCLES learning approach is sanity checking of the models. In other words, we can check the appropriateness of vectors learned for tags by simply taking inner products and inspecting whether those close together contain similar concepts.

Overall, this is a very nice paper with many interesting ideas. I am curious about a few things. First, how does one deal with near duplicates of tags? For instance, "hip hop" and "hip-hop" are treated separately by many tagging services, but they are essentially the same. So, does one need to preprocess the tag data before MUSCLES, or can MUSCLES automatically weed out such spurious duplicates? Second, in order to use MUSCLES, the feature dimensions of each space must be the same. What happens when we have only a few artists, but many years of their recordings --- for instance, many musicians change their style over the years. Will MUSCLES automatically find a vector for young Beethoven and old Beethoven? This restriction is necessary, but how does one set that dimensionality? I like the machinery that comes with vector spaces; however, I don't think it makes sense to think of artists or timbre spaces in the same way as tag spaces. For instance, I mention above Brittney Spears and Christina Aguilera should point in the same direction, but Engelbert Humperdink and The Rubberbandits should be nearly orthogonal. However, should the tags "sad" and "happy" be orthogonal, or point in exact opposite directions? What does it mean for two artists to point in opposite directions? Or two song textures? A further problem is that MUSCLES judges similarity by magnitude inner product. In such a case, if "sad" and "happy" point in exact opposite directions, then MUSCLES will say they are highly similar.

Work in-between

| No Comments
Most of the past year I have unexpectedly devoted to research in music genre recognition. In major part, this comes from a "discovery" that the most widely-used publicly-available dataset for work in this area has repetitions, artist duplication, and mislabelings. I am now putting some finishing touches on my thorough analysis of this dataset and the effects of its faults on the results produced with it. Here is a little graphic from my article. ex01-1.png This shows the highest accuracies reported in all the papers I find testing genre recognition systems using all 10 classes of the GTZAN dataset. The numbers cross-reference the citations in my article (which took me a day to figure out how to do automatically :). The legend shows symbols for works that use two-fold cross validation (2fCV), and so on. The four red "x" are results that are incorrect, e.g., this. The top grey line show the maximum accuracy I estimate (optimistically) when considering the mislabelings in the dataset. If a system scores above it, then it might not be as good as a system that scores below it, with respect to recognizing genre (and we all should know now that classification accuracy is not enough :). The dashed gray line is the maximum accuracy I get using high-performing systems (two get above 83% accuracy on GTZAN) tested on a version of GTZAN missing replicas, and using artist filtering (the same artist does not appear in the training and test sets). Nearly all of the work we find lies between the two lines. And none of the work shown uses an artist filter. Hence, we have over 90 papers that contain a decade of clearly optimistic (and quite possibly wrong) results.

Blog Roll

About this Archive

This page is an archive of entries from January 2013 listed from newest to oldest.

December 2012 is the previous archive.

February 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.