Paper of the Day (Po'D): Multi-tasking with joint semantic spaces

| No Comments
Hello, and welcome to the Paper of the Day (Po'D): Multi-tasking with joint semantic spaces edition. Today's paper is: J. Weston, S. Bengio and P. Hamel, "Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval," J. New Music Research, vol. 40, no. 4, pp. 337-348, 2011.

This article proposes and tests a novel approach (pronounced MUSCLES but written MUSLSE) for describing a music signal along multiple directions, including semantically meaningful ones. This work is especially relevant since it applies to problems that remain unsolved, such as artist identification and music recommendation (in fact the first two authors are employees of Google). The method proposed in this article models a song (or a short excerpt of a song) as a triple in three vector spaces learned from a training dataset: one vector space is created from artists, one created from tags, and the last created from features of the audio. The benefit of using vector spaces is that they bring quantitative and well-defined machinery, e.g., projections and distances.

MUSCLES attempts to learn each vector space together so as to preserve (dis)similarity. For instance, vectors mapped from artists that are similar (e.g., Brittney Spears and Christina Aguilera) should point in nearly the same direction; while those that are not similar (e.g., Engelbert Humperdink and The Rubberbandits), should be nearly orthogonal. Similarly, so should vectors mapped from tags that are semantically close (e.g., "dark" and "moody"), and semantically disjoint (e.g., "teenage death song" and "NYC"). For features extracted from the audio, one hopes the features themselves are comparable, and are able to reflect some notion of similarity at least at the surface level of the audio. MUSCLES takes this a step further to learn the vector spaces so that one can take inner products between vectors from different spaces --- which is definitely a novel concept in music information retrieval.

As an example, consider we have some unidentified and untagged song for which we would like to identify the artist, propose some tags, and/or find similar songs. By designing vector spaces using MUSCLES, we can retrieve a list of possible artists by taking the inner product between the feature vector of the song and our learned artist vectors. As Weston et al. define, we can judge as similar those giving the highest magnitudes. Similarly, we can retrieve a list of possible tags by taking the inner product between the feature vector of the song and our learned tag vectors. And we can retrieve a list of similar songs by taking the inner product between the feature vector of the song and those of our training set (which is a typical approach, but by no means a good approach, to judging music similarity).

To learn these vector spaces, MUSCLES uses what appears to be constrained optimization (the norms of the vectors in each space are limited) using a simple gradient descent of a cost function within a randomly selected vector space (stochastic gradient descent). Since they are interested in increasing precision, the authors optimize with respect to either margin ranking loss, and weighted approximate ranked pairwise loss --- both relevant for the figure of merit precision @ k). Furthermore, when one wishes to design a system that can do all three tasks (artist prediction, tag prediction, and music similarity), MUSCLES considers all three vector spaces, and optimizes each one in turn by holding the others constant.

Weston et al. test MUSCLES using the TagATune dataset (16,289 song clips in training, 6,499 clips for testing, 160 unique tags applied to all data), as well as a private one they assemble to test artist identification and similarity (275,930 training clips, 66,072 test clips, from 26,972 artists). Their audio features are essentially histograms of codebook vectors (learned using K-means from the training set). They compare the performance of MUSCLES against a one-vs-rest SVM, and show MUSCLES has better mean p@k for several k (and with statistical significance at alpha = 0.05) when predicting tags for the TagATune dataset. For artist identification in the bigger dataset, we see MUSCLES performs better than the SVM in artist prediction, song prediction, and song similarity (judged by whether the artist is the same). Finally, Weston et al. show MUSCLES not only requires much less overhead than an SVM, but also that a nice by-product of the MUSCLES learning approach is sanity checking of the models. In other words, we can check the appropriateness of vectors learned for tags by simply taking inner products and inspecting whether those close together contain similar concepts.

Overall, this is a very nice paper with many interesting ideas. I am curious about a few things. First, how does one deal with near duplicates of tags? For instance, "hip hop" and "hip-hop" are treated separately by many tagging services, but they are essentially the same. So, does one need to preprocess the tag data before MUSCLES, or can MUSCLES automatically weed out such spurious duplicates? Second, in order to use MUSCLES, the feature dimensions of each space must be the same. What happens when we have only a few artists, but many years of their recordings --- for instance, many musicians change their style over the years. Will MUSCLES automatically find a vector for young Beethoven and old Beethoven? This restriction is necessary, but how does one set that dimensionality? I like the machinery that comes with vector spaces; however, I don't think it makes sense to think of artists or timbre spaces in the same way as tag spaces. For instance, I mention above Brittney Spears and Christina Aguilera should point in the same direction, but Engelbert Humperdink and The Rubberbandits should be nearly orthogonal. However, should the tags "sad" and "happy" be orthogonal, or point in exact opposite directions? What does it mean for two artists to point in opposite directions? Or two song textures? A further problem is that MUSCLES judges similarity by magnitude inner product. In such a case, if "sad" and "happy" point in exact opposite directions, then MUSCLES will say they are highly similar.

Leave a comment

About this Entry

This page contains a single entry by Bob L. Sturm published on January 18, 2013 11:41 AM.

Work in-between was the previous entry in this blog.

Paper of the Day (Po'D): Automatic classification of musical genres using inter-genre similarity edition is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.