Hello, and welcome to Paper of the Day (Po'D): Towards a universal representation for audio information retrieval and analysis Edition. Today's paper is, B. S. Jensen, R. Troelsgård, J. Larsen, and L. K. Hansen, "Towards a universal representation for audio information retrieval and analysis", Proc. ICASSP, 2013. My one line summary of this article:

For the music signals considered in this paper, there are \(S\) music data (in the training dataset), \(M\) modalities (e.g., 3 if using lyrics, tags and audio features), and \(T\) "topics" (abstractions of the content or stuff in the modalities). For song \(s\) and modality \(m\), there are \(N_{sm}\) "tokens", each of which generates a "word" i.e., the features extracted from that modality of that music. The goal is to model the words of some music data as a random process involving the parameters \(\alpha, \{\beta_m\}\), and latent variables \( \{\phi_t\}, \{z_{sm}\}\), and "tokens." This model "generates" each word of song \(s\) in modality \(m\) of music by drawing \(\alpha\) and creating a distribution over the topics, then drawing from this distribution a topic, then drawing a \(\beta_m\) and creating for the drawn topic a distribution over a vocabulary of "words" in modality \(m\), and finally drawing a "word" from that distribution. The lead author Bjørn Jensen has given me a quick tutorial in this starting from latent semantic analysis (LSA), moving to probabilistic LSA (pLSA), and ending with latent Dirichlet allocation.

First, LSA. We observe a set of documents \(\{d_i\}\), and each document is a set of words \(\{w_j\}\). We might want to discover in our set of documents what topics there are and what words compose the topics. We might want to find relevant documents in our set given a query. Or we might want to be able to predict the topics of an unseen document. So, we build a word co-occurrence matrix \(\MD\), where each column is a document and each row is a word. Each element is the frequency of a word in a document. We posit that each of our documents is explained by a collection of words (observables) associated with several topics (latent variable). This is then a matrix factorization problem. We can perform PCA, or SVD, or non-negative matrix factorization, to obtain: \(\MD \approx \Phi\Theta\). Each column of \(\Phi\) is a topic, and each row denotes a word frequency characteristic of the topic. Each column of \(\Theta\) describe how a document in our collection is composed by these topics.

Now, pLSA. For our set of documents, what we are really interested in is discovering the joint probability of document-word co-occurrences: \(p(d,\vw)\), where \(\vw\) is a vector of word co-occurrences. Assuming that a document is created from topics, and words spring from these topics, and that the document and its words are conditionally independent given a topic, we can express this joint probability as $$ p(d,\vw) = \sum_{z\in Z} p(d,\vw|z) p(z) = \sum_{z\in Z} p(d|z) p(\vw|z) p(z) = p(d) \sum_{z\in Z} p(\vw|z) p(z|d) $$ where \(Z\) is the set of topics. Now, we have to learn from our set of documents the conditional probabilities \(\{p(\vw|z)\}_Z\) describing the underlying set of topics in terms of the word frequencies, and we have to learn the topical composition of our documents \(\{p(z|d)\}\). This can be achieved using Markov Chain Monte Carlo (MCMC) methods to discover the distributions that maximize \(p(d,\vw)\) over our set of documents. (Note to self: review MCMC.) With this model then, we can do some of what we set out to do with LSA: discover in our set of documents what topics there are, what words compose the topics, and what topics are in a given document; or to find relevant documents in our set given a query. However, we cannot compute \(p(d^*,\vw)\) for a new document \(d^*\) because we do not know what generates \(p(z|d^*)\) for this document. By specifying a model of \(p(z|d)\), we move to LDA.

Now, in LDA we assume the topic distribution \(p(z|d)\), and perhaps the word distribution \(p(\vw|z)\), arise from probabilistic models with unknown parameters. The resulting model is a true generative model, in that each word of a document comes from sampling from the sampled topic distribution, and then sampling from a sampled word distribution of that topic. (Note to self: learn what that even means.) With such a model, we can now estimate for a new document, \(p(z|d^*)\) by a fold-in procedure (Note to self: see previous Note to self.), and thus \(p(d^*,\vw)\). We can now answer such questions as: how likely is it that this new document was produced by the topics of our model? What are the topics of this new document?

Now, this Po'D considers modeling document co-occurrences with multiple modalities. So, it aims to solve $$ p(d,\vw_1, \vw_2, \ldots, \vw_M) = p(d) \sum_{z\in Z} p(z|d) \prod_M p(\vw_m|z) $$ where \(\{\vw_m\}_M\) is the set of document \(m\)-modality co-occurrences, and the assumption here is that a document is conditionally independent of all modalities given the topics, and that all modalities are independent. This is exactly the model in the figure above. Given a trained model and a new song, one can estimate \(p(z|d^*)\) by holding all other quantities constant, using a portion of \(d^*\) ("fold-in"), and sampling using MCMC.

Before I proceed, it is now time to address those notes to myself.

A generative multi-modal topic model of music is built from low-level audio features, lyrics, and/or tags.Essentially, the paper proposes modeling a piece of music by the generative model depicted in a figure from the paper.

For the music signals considered in this paper, there are \(S\) music data (in the training dataset), \(M\) modalities (e.g., 3 if using lyrics, tags and audio features), and \(T\) "topics" (abstractions of the content or stuff in the modalities). For song \(s\) and modality \(m\), there are \(N_{sm}\) "tokens", each of which generates a "word" i.e., the features extracted from that modality of that music. The goal is to model the words of some music data as a random process involving the parameters \(\alpha, \{\beta_m\}\), and latent variables \( \{\phi_t\}, \{z_{sm}\}\), and "tokens." This model "generates" each word of song \(s\) in modality \(m\) of music by drawing \(\alpha\) and creating a distribution over the topics, then drawing from this distribution a topic, then drawing a \(\beta_m\) and creating for the drawn topic a distribution over a vocabulary of "words" in modality \(m\), and finally drawing a "word" from that distribution. The lead author Bjørn Jensen has given me a quick tutorial in this starting from latent semantic analysis (LSA), moving to probabilistic LSA (pLSA), and ending with latent Dirichlet allocation.

First, LSA. We observe a set of documents \(\{d_i\}\), and each document is a set of words \(\{w_j\}\). We might want to discover in our set of documents what topics there are and what words compose the topics. We might want to find relevant documents in our set given a query. Or we might want to be able to predict the topics of an unseen document. So, we build a word co-occurrence matrix \(\MD\), where each column is a document and each row is a word. Each element is the frequency of a word in a document. We posit that each of our documents is explained by a collection of words (observables) associated with several topics (latent variable). This is then a matrix factorization problem. We can perform PCA, or SVD, or non-negative matrix factorization, to obtain: \(\MD \approx \Phi\Theta\). Each column of \(\Phi\) is a topic, and each row denotes a word frequency characteristic of the topic. Each column of \(\Theta\) describe how a document in our collection is composed by these topics.

Now, pLSA. For our set of documents, what we are really interested in is discovering the joint probability of document-word co-occurrences: \(p(d,\vw)\), where \(\vw\) is a vector of word co-occurrences. Assuming that a document is created from topics, and words spring from these topics, and that the document and its words are conditionally independent given a topic, we can express this joint probability as $$ p(d,\vw) = \sum_{z\in Z} p(d,\vw|z) p(z) = \sum_{z\in Z} p(d|z) p(\vw|z) p(z) = p(d) \sum_{z\in Z} p(\vw|z) p(z|d) $$ where \(Z\) is the set of topics. Now, we have to learn from our set of documents the conditional probabilities \(\{p(\vw|z)\}_Z\) describing the underlying set of topics in terms of the word frequencies, and we have to learn the topical composition of our documents \(\{p(z|d)\}\). This can be achieved using Markov Chain Monte Carlo (MCMC) methods to discover the distributions that maximize \(p(d,\vw)\) over our set of documents. (Note to self: review MCMC.) With this model then, we can do some of what we set out to do with LSA: discover in our set of documents what topics there are, what words compose the topics, and what topics are in a given document; or to find relevant documents in our set given a query. However, we cannot compute \(p(d^*,\vw)\) for a new document \(d^*\) because we do not know what generates \(p(z|d^*)\) for this document. By specifying a model of \(p(z|d)\), we move to LDA.

Now, in LDA we assume the topic distribution \(p(z|d)\), and perhaps the word distribution \(p(\vw|z)\), arise from probabilistic models with unknown parameters. The resulting model is a true generative model, in that each word of a document comes from sampling from the sampled topic distribution, and then sampling from a sampled word distribution of that topic. (Note to self: learn what that even means.) With such a model, we can now estimate for a new document, \(p(z|d^*)\) by a fold-in procedure (Note to self: see previous Note to self.), and thus \(p(d^*,\vw)\). We can now answer such questions as: how likely is it that this new document was produced by the topics of our model? What are the topics of this new document?

Now, this Po'D considers modeling document co-occurrences with multiple modalities. So, it aims to solve $$ p(d,\vw_1, \vw_2, \ldots, \vw_M) = p(d) \sum_{z\in Z} p(z|d) \prod_M p(\vw_m|z) $$ where \(\{\vw_m\}_M\) is the set of document \(m\)-modality co-occurrences, and the assumption here is that a document is conditionally independent of all modalities given the topics, and that all modalities are independent. This is exactly the model in the figure above. Given a trained model and a new song, one can estimate \(p(z|d^*)\) by holding all other quantities constant, using a portion of \(d^*\) ("fold-in"), and sampling using MCMC.

Before I proceed, it is now time to address those notes to myself.