# September 2013 Archives

## Paper of the Day (Po'D): Towards a universal representation for audio information retrieval and analysis Edition

Hello, and welcome to Paper of the Day (Po'D): Towards a universal representation for audio information retrieval and analysis Edition. Today's paper is, B. S. Jensen, R. Troelsgård, J. Larsen, and L. K. Hansen, "Towards a universal representation for audio information retrieval and analysis", Proc. ICASSP, 2013. My one line summary of this article:

A generative multi-modal topic model of music is built from low-level audio features, lyrics, and/or tags.
Essentially, the paper proposes modeling a piece of music by the generative model depicted in a figure from the paper.

For the music signals considered in this paper, there are $$S$$ music data (in the training dataset), $$M$$ modalities (e.g., 3 if using lyrics, tags and audio features), and $$T$$ "topics" (abstractions of the content or stuff in the modalities). For song $$s$$ and modality $$m$$, there are $$N_{sm}$$ "tokens", each of which generates a "word" i.e., the features extracted from that modality of that music. The goal is to model the words of some music data as a random process involving the parameters $$\alpha, \{\beta_m\}$$, and latent variables $$\{\phi_t\}, \{z_{sm}\}$$, and "tokens." This model "generates" each word of song $$s$$ in modality $$m$$ of music by drawing $$\alpha$$ and creating a distribution over the topics, then drawing from this distribution a topic, then drawing a $$\beta_m$$ and creating for the drawn topic a distribution over a vocabulary of "words" in modality $$m$$, and finally drawing a "word" from that distribution. The lead author Bjørn Jensen has given me a quick tutorial in this starting from latent semantic analysis (LSA), moving to probabilistic LSA (pLSA), and ending with latent Dirichlet allocation.

First, LSA. We observe a set of documents $$\{d_i\}$$, and each document is a set of words $$\{w_j\}$$. We might want to discover in our set of documents what topics there are and what words compose the topics. We might want to find relevant documents in our set given a query. Or we might want to be able to predict the topics of an unseen document. So, we build a word co-occurrence matrix $$\MD$$, where each column is a document and each row is a word. Each element is the frequency of a word in a document. We posit that each of our documents is explained by a collection of words (observables) associated with several topics (latent variable). This is then a matrix factorization problem. We can perform PCA, or SVD, or non-negative matrix factorization, to obtain: $$\MD \approx \Phi\Theta$$. Each column of $$\Phi$$ is a topic, and each row denotes a word frequency characteristic of the topic. Each column of $$\Theta$$ describe how a document in our collection is composed by these topics.

Now, pLSA. For our set of documents, what we are really interested in is discovering the joint probability of document-word co-occurrences: $$p(d,\vw)$$, where $$\vw$$ is a vector of word co-occurrences. Assuming that a document is created from topics, and words spring from these topics, and that the document and its words are conditionally independent given a topic, we can express this joint probability as $$p(d,\vw) = \sum_{z\in Z} p(d,\vw|z) p(z) = \sum_{z\in Z} p(d|z) p(\vw|z) p(z) = p(d) \sum_{z\in Z} p(\vw|z) p(z|d)$$ where $$Z$$ is the set of topics. Now, we have to learn from our set of documents the conditional probabilities $$\{p(\vw|z)\}_Z$$ describing the underlying set of topics in terms of the word frequencies, and we have to learn the topical composition of our documents $$\{p(z|d)\}$$. This can be achieved using Markov Chain Monte Carlo (MCMC) methods to discover the distributions that maximize $$p(d,\vw)$$ over our set of documents. (Note to self: review MCMC.) With this model then, we can do some of what we set out to do with LSA: discover in our set of documents what topics there are, what words compose the topics, and what topics are in a given document; or to find relevant documents in our set given a query. However, we cannot compute $$p(d^*,\vw)$$ for a new document $$d^*$$ because we do not know what generates $$p(z|d^*)$$ for this document. By specifying a model of $$p(z|d)$$, we move to LDA.

Now, in LDA we assume the topic distribution $$p(z|d)$$, and perhaps the word distribution $$p(\vw|z)$$, arise from probabilistic models with unknown parameters. The resulting model is a true generative model, in that each word of a document comes from sampling from the sampled topic distribution, and then sampling from a sampled word distribution of that topic. (Note to self: learn what that even means.) With such a model, we can now estimate for a new document, $$p(z|d^*)$$ by a fold-in procedure (Note to self: see previous Note to self.), and thus $$p(d^*,\vw)$$. We can now answer such questions as: how likely is it that this new document was produced by the topics of our model? What are the topics of this new document?

Now, this Po'D considers modeling document co-occurrences with multiple modalities. So, it aims to solve $$p(d,\vw_1, \vw_2, \ldots, \vw_M) = p(d) \sum_{z\in Z} p(z|d) \prod_M p(\vw_m|z)$$ where $$\{\vw_m\}_M$$ is the set of document $$m$$-modality co-occurrences, and the assumption here is that a document is conditionally independent of all modalities given the topics, and that all modalities are independent. This is exactly the model in the figure above. Given a trained model and a new song, one can estimate $$p(z|d^*)$$ by holding all other quantities constant, using a portion of $$d^*$$ ("fold-in"), and sampling using MCMC.

Before I proceed, it is now time to address those notes to myself.

## Formalizing Evaluation in Music Information Retrieval: A Look at the MIREX Automatic Mood Classification Task

One thing I have come to appreciate during the past two years is the necessity to employ formalism. Formalism is a way to see and work with things without ambiguity, to circumvent semantics, to find flaws and avoid them, and to make assumptions clear and their qualification of conclusions. I might have looked at such a sentence two years ago and thought it a senseless piece of self-serving gibberish irrelevant to the way I was working --- which was quite formal, if I may say so! I was using standardized datasets and accepted approaches to systematically evaluate algorithms for music genre recognition, not to mention recovery from compressive sampling. I was even testing for statistical significance!

And things were good; but, I began to see the necessity for something deeper than the standardized and systematic ways in which I was working. I then developed a deep appreciation for analysis; and learned first hand how bad ideas and wasted efforts can be avoided with a little analysis. And then the edifice of my standardized and systematic ways of working was cracked to its foundation.

And things were bad; but then I realized this summer the core problem.

## Paper of the Day (Po'D): Multi-label sparse coding for automatic image annotation Edition

Hello, and welcome to the Paper of the Day (Po'D): Multi-label sparse coding for automatic image annotation Edition. Today's paper is C. Wang, S. Yan, L. Zhang, and H.-J. Zhang, "Multi-label sparse coding for automatic image annotation," in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 1643-1650, 2009. My one line description of this work is:

Multilabel sparse coding is reported to produce good results.
I will go in reverse from how the paper presents the approach. Finally, we decode the sparse code of a query data. Given a "label matrix" $$\MC$$ of size $$N_c \times N$$ --- where the $$i$$th column denotes which of $$N_c$$ tags are relevant to the $$i$$th training data --- and a solution vector $$\alpha_t$$ from a query data, we find the rows of $$\MC\alpha_t$$ with the largest values. (This is never defined in the paper.) Since each row of $$\MC$$ is associated with one tag, we thereby select those tags relevant to the query.

Penultimately, we produce a sparse code for a query $$\MP\vx_t$$. To do this we find the solution vector $$\alpha_t$$ by solving $$\alpha_l = \arg\min_\alpha \lambda\|\alpha\|_1 + \frac{1}{2}\|\MP\vx_t - [\MP\MX | \MI]\alpha\|_2^2$$ where $$\MX = [\vx_1, \vx_2, \ldots, \vx_N]$$ is the $$N$$ training data, and $$\MP$$ is a projection. ($$\lambda$$ is not defined in the paper.) (Note that $$\alpha_l$$ is long, so we assume we chop off the end so only $$N_c$$ rows remain.)

Antepenultimately, we set $$\MP = \MI$$, or form the projection $$\MP$$ (Wang et al. refer to this as "multilabel linear embedding") by selecting from the eigenvectors of $$\MX\left[\MD - \MW_1 + \frac{\beta}{2}(\MI-\MW_2)^T(\MI-\MW_2)\right]\MX^T$$ where $$[\MD]_{ii} := \ve_i^T\MW_1\mathbf{1} - [\MW_1]_{ii}$$, $$\MW_1$$ and $$\MW_2$$ are "semantic graphs", and $$\beta = 0.1$$ in the paper. Each column of $$\MP^T$$ is an eigenvector of the above matrix, and we keep however many we want to keep. (For the data in the paper, this goes from a space of 40,960, to 1000-2000.)

Preantepenultimately, we create the semantic graphs of the training data in the following way. First, we create the label matrix $$\MC$$ from the train data. The $$i$$th column is non-zero only in the rows associated with the tags of the $$i$$th training vector. Then, all columns of $$\MC$$ are made unit norm. Define $$[\MW_1]_{ij} = 1$$ if $$\vc_i = \vc_j$$, and zero otherwise. Thus, $$\MW_1$$ specify which training data share the same set of tags. Second, the $$i$$th column of matrix $$\MW_2$$ is created in the following way. Remove the $$i$$th column of $$\MC$$ to create $$\MC'$$. Find a sparse representation of the $$i$$th column of $$\MC$$ by solving $$\beta_i = \arg\min_\beta \lambda\|\beta\|_1 + \frac{1}{2}\|\vc_i - [\MC' | \MI]\beta\|_2^2.$$ For $$1 \le j \le i-1$$, set $$[\MW_2]_{ij} = \beta_j$$; and for $$i+1 \le j \le N$$, set $$[\MW_2]_{ij} = \beta_{j-1}$$. ($$\lambda$$ here is not defined in the paper.\) $$\MW_1$$ and $$\MW_2$$ thus attempt to embody how the training data are related in a tag space, or semantically, rather than in the feature space. And so begins the procedure for multi-label sparse coding.

A variant of this approach was adopted for automatic tagging of music signals in, Y. Panagakis, C. Kotropoulos, and G. R. Arce, "Sparse multi-label linear embedding nonnegative tensor factorization for automatic music tagging," in Proc. EUSIPCO, (Aalborg, Denmark), pp. 492-496, Aug. 2010. Instead of posing the sparse representation problems above as a Lagrangian, they are posed as minimization subject to equality constraints. Furthermore, tensors are used rather than supervectors of features.

The empirical results of Wang et al. show that for two image datasets, performance is about the same when $$\MP$$ is designed by the above procedure, or when no "embedding" is done, i.e., $$\MP = \MI$$. Panagakis et al. use the embedding procedure, and report high performance in a music tagging dataset. Without the tensor approach, but still using embedding, the results are still seen to be competitive.

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk