The Latin Music Database (LMD) was created around 2007 by Silla et al. for use in a comparative evaluation of particular approaches for music genre classification. It has been used in the MIREX Latin music genre recognition task since 2009.

LMD is described in, C. N. Silla, A. L. Koerich, and C. A. A. Kaestner, "The Latin music database," in Proc. ISMIR, 2008. That paper describes the LMD as 3,227 song recordings, each labeled in one of ten different classes: Axé, Batchata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja, and Tango. This dataset is notable among those created for music genre recognition because it contains music outside the realm of Western popular music. Like the Ballroom dataset, each music recording is assigned a single label by "experts in Brazilian dance" according to the appropriate dance. However, unlike GTZAN and Ballroom, the audio data is not freely available; only pre-computed features are available for download.

Searching through the references of my music genre recognition survey, I find this dataset (or portions of it) has been used in the evaluations of music genre recognition systems in at least 16 conference papers and journal articles:

  1. Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon. Music genre recognition using spectrograms. In Proc. Int. Conf. Systems, Signals and Image Process., 2011.
  2. Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon. Comparing textural features for music genre classification. In Proc. IEEE World Cong. Comp. Intell., June 2012.
  3. Y.M.G. Costa, L.S. Oliveira, A.L. Koerich, F. Gouyon, and J.G. Martins. Music genre classification using LBP textural features. Signal Process., 92(11):2723-2737, Nov. 2012.
  4. S. Doraisamy and S. Golzari. Automatic musical genre classification and artificial immune recognition system. In Z. W. Ras and A. A. Wieczorkowska, editors, Advances in Music Information Retrieval, pages 390-402. Springer, 2010.
  5. N. A. Draman, C. Wilson, and S. Ling. Modified AIS-based classifier for music genre classification. In Proc. ISMIR, pages 369-374, 2010.
  6. T. Lidy, C. Silla, O. Cornelis, F. Gouyon, A. Rauber, C. A. A. Kaestner, and A. L. Koerich. On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing non-western and ethnic music collections. Signal Process., 90(4):1032-1048, 2010.
  7. M. Lopes, F. Gouyon, A. Koerich, and L. E. S. Oliveira. Selection of training instances for music genre classification. In Proc. ICPR, Istanbul, Turkey, 2010.
  8. G. Marques, T. Langlois, F. Gouyon, M. Lopes, and M. Sordo. Short-term feature space and music genre classification. J. New Music Research, 40(2):127-137, 2011.
  9. G. Marques, M. Lopes, M. Sordo, T. Langlois, and F. Gouyon. Additional evidence that common low-level features of individual audio frames are not representative of music genres. In Proc. SMC, Barcelona, Spain, July 2010.
  10. A. Schindler and A. Rauber. Capturing the temporal domain in echonest features for improved classification effectiveness. In Proc. Adaptive Multimedia Retrieval, Oct. 2012.
  11. C. Silla, A. Koerich, and C. Kaestner. Improving automatic music genre classification with hybrid content-based feature vectors. In Proc. Symp. Applied Comp., Sierre, Switzerland, Mar. 2010.
  12. C. N. Silla, A. Koerich, and C. Kaestner. Automatic music genre classification using ensembles of classifiers. In Proc. IEEE Int. Conf. Systems, Man, Cybernetics, pages 1687.1692, 2007.
  13. C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. Feature selection in automatic music genre classification. In Proc. IEEE Int. Symp. Mulitmedia, pages 39-44, 2008.
  14. C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. A feature selection approach for automatic music genre classification. Int. J. Semantic Computing, 3(2):183-208, 2009.
  15. C. Silla, C. Kaestner, and A. Koerich. Time-space ensemble strategies for automatic music genre classification. In Jaime Sichman, Helder Coelho, and Solange Rezende, editors, Advances in Artificial Intelligence, pages 339-348. Springer Berlin / Heidelberg, 2006.
  16. C.N. Silla and A. A. Freitas. Novel top-down approaches for hierarchical classification and their application to automatic music genre classification. In IEEE Int. Conf. Systems, Man, and Cybernetics, San Antonio, USA, Oct. 2009.
However, as for GTZAN, and as for Ballroom, it appears that researchers have taken for granted the integrity of LMD. I have acquired the audio for LMD, which has 3,229 song files (two more than stated by Silla et al.). Through my fingerprinting method (just a little Shazam-like implementation), I compare all songs in each class, and find 213 replicas. This is in spite of the cautions Silla et al. (2008) describe taking in creating LMD.

So far, I have only looked for replicas within each class, and not across classes; but we now know at least 6.5% of the dataset is replicated (which is greater than the 5% in GTZAN, and which we already know cannot be ignored). Below, I list the replicas I find.

One solution to the wine puzzle

| No Comments
I posed this puzzle a few days ago, partly as an exercise in experimental design.

What is different between the two test conditions?

One thing that is different is that the bottle shape is visible in the first condition, but not visible in the second. Even though we carefully covered the label to conceal its identity, we were naive to the fact that the bottle shape can carry information about the region from which the wine comes. The following picture shows the variety of shapes.

bottleshapes.jpg From the left, bottles of this shape usually come from Bordeaux; the next bottle shape from Burgundy; the next from Rhône; the next from Champagne; and the next from Côtes de Provence. In our dataset then, the bottle shape is confounded with its region. We thus inadvertently trained our experts not to recognize the region from which the wine comes based on the contents of the bottles, but instead on the shape of its container. Hence, when Team Småg can no longer see the bottle, they must guess.

I came up with this example to show how easy it can be to believe that one's learning experiment is valid, and that the figure of merit reflects real-world performance, but that is actually invalid due to an independent variable of which the experimenter is unaware. This fact is directly discussed as limitations by Ava Chase in her article, "Music descriminations by carp (Cyprinus carpio)":

Even a convincing demonstration of categorization can fail to identify the stimulus features that exert control at any given time, especially if the stimuli are complex. In particular, there can be uncertainty as to whether classification behavior had been under the stimulus control of the features in terms of which the experimenter had defined the categories or whether the subjects had discovered an effective discriminant of which the experimenter was unaware. The diversity of S+/S- pairings presumably rules out the possibility that the fish could have been relying on only a single discriminant, such as timbre, but a constant concern is the possible existence of a simple attribute that would have allowed the subjects merely to discriminate instead of categorizing. (emphasis mine)

Faults in the Ballroom dataset

The Ballroom dataset (BD) was created around 2004 by F. Gouyon et al. for use in a comparative evaluation of particular features for music genre classification. BD is described in the paper, F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, "Evaluating rhythmic descriptors for musical genre classification", in Proc. Int. Conf. Audio Eng. Society, June 2004.

[BD] contains excerpts from 698 pieces of music, around 30 seconds long. The audio quality of this data is quite low, it was originally fetched in real audio format [from], with a compression factor of almost 22 with respect to the common 44.1 kHz 16 bits mono WAV format. ... The data covers eight musical sub-genres of ballroom dance music: [Jive, Quickstep, Tango, Waltz, Viennese Waltz, Samba, Cha Cha Cha and Rumba].
Hello, and welcome to Paper of the Day (Po'D): Cross-validation and bootstrap edition. Today's paper is: R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in Proc. Int. Joint Conf. Artificial Intell., 1995.

Let's first get some things straight:
  • We have a space of instances \(\mathcal{V}\), many of which are unlabeled.
  • We have a set of possible labels, \(\mathcal{Y}\).
  • We have a dataset of \(n\) labeled instances from \(\mathcal{V}\): \(\mathcal{D} = \{x_1, \ldots, x_n\}\), where \(x_i = \{v_i \in \mathcal{V}, y_i \in \mathcal{Y}\}\).
  • We have an "inducer" \(\mathcal{I}\), which takes \(\mathcal{D}\) and creates a classifier \(\mathcal{C}\).
  • Hence, \(\mathcal{I}(\mathcal{D},v)\) is the label predicted by the classifier for a \(v\in\mathcal{V}\),
As Kohavi says, "Given a finite dataset, we would like to estimate the future performance of a classifier induced by the given inducer and dataset." Or, more specifically and with less redundancy: Given a finite \(\mathcal{D}\), we want to estimate the accuracy of a classifier induced by \(\mathcal{I}\). Kohavi defines the accuracy of \(\mathcal{C}\) as "the probability of correctly classifying a randomly selected instance."

Kohavi considers several options here for estimating accuracy. First, there is the "holdout method." Take some number of elements of \(\mathcal{D}\) for testing, and use the rest --- call this \(\mathcal{D}_t\) --- for training. Then, accuracy is estimated $$ \frac{1}{|\mathcal{D}\backslash\mathcal{D}_t|} \sum_{x_i \in \mathcal{D}\backslash\mathcal{D}_t} \delta(\mathcal{I}(\mathcal{D}_t,v_i),y_i) $$ where \(\delta(i,j) = 1\) if \(i = j\), and zero else. Then there is the "random subsampling method", which repeats "holdout" several times and averages the resulting estimates.

Then, there is the "cross-validation method". This is done in the same way as the "holdout method", except \(\mathcal{D}\) is partitioned into a number of \(k\) non-overlapping folds, and each fold is used as a testing dataset while the remainder is used for training. The accuracy is then estimated as the mean of the \(k\) test accuracies.

Then, there is the "bootstrap method". Define a training dataset \(\mathcal{D}_t\) as \(n\) labeled instances picked with replacement from \(\mathcal{D}\). The remaining instances are then used for testing. The 0.632 bootstrap estimate of accuracy is $$ 0.368 \textrm{acc}_s + 0.632 \frac{1}{|\mathcal{D}\backslash\mathcal{D}_t|} \sum_{x_i \in \mathcal{D}\backslash\mathcal{D}_t} \delta(\mathcal{I}(\mathcal{D}_t,v_i),y_i). $$ Kohavi seeks to determine how these approaches fare. His methodology is as follows. Take two inducers: C4.5 and naive Bayes. Take six US Irvine datasets, and create a random dataset with two classes. The datasets are selected based on having at least 500 instances, and for which "the learning curve for both algorithms did not flatten out ... before 100 instances." (I assume this means that the ordered set of classifiers induced from either algorithm using a incrementally growing training dataset show accuracies on the same training datasets that increase.) Then, estimate the "true accuracies of a real dataset" using the "random subsampling method". For instance, for the dataset "Breast cancer", 50 of the 699 instances are selected at random as the test set. The remainder is used for training. Then, a classifier is induced by one of the inducers and that training dataset, and its accuracy is computed. This is repeated 500 times, and the mean accuracy is computed. This is called the true accuracy.

Now, verbatim:
"To see how well an accuracy estimation method performs, we sampled instances from the dataset (uniformly without replacement), and created a training set of the desired size. We then ran the induction algorithm on the training set and tested the classifier on the rest of the instances of the dataset. This was repeated 50 times at points where the learning curve was sloping up. The same folds in cross-validation and the same samples in bootstrap were used for both algorithms compared."
It seems then that the training datasets are built by adding instances one by one to induce a classifier, looking at its accuracy on that training dataset, and either adding a new random instance if there is not enough gain, or stopping. I don't know what this means for cross-validation and bootstrap.

I will stop there because the missing details in the paper and its inconsistencies have thoroughly confused me, and I doubt its experiments are of much use. First, Kohavi claims that the main goal is to estimate the accuracy of a classifier induced by \(\mathcal{I}\) and a finite \(\mathcal{D}\); but then he conflates this with estimating the accuracies of several classifiers built from the same \(\mathcal{I}\) but several partitions of \(\mathcal{D}\). When is estimating the true accuracy of a \(\mathcal{C}\) the same as computing the accuracies of many classifiers built using a \(\mathcal{I}\) but several partitions of \(\mathcal{D}\)? The "holdout" and "bootstrap methods" are the only ones that are estimating the accuracy of a single \(\mathcal{C}\). The "random subsampling method" is estimating the accuracies of several different \(\mathcal{C}\); the "\(k\)-fold cross validation method" is estimating the accuracies of \(k\) different \(\mathcal{C}\); and the "leave one out method" is estimating the accuracies of \(n\) different \(\mathcal{C}\). These last three methods are instead saying something general about \(\mathcal{I}\) using \(\mathcal{D}\), and not anything specific about a particular \(\mathcal{C}\). (Could this be what he means in the title by "Model Selection"?) Second, Kohavi admits, "the true accuracies of a real dataset cannot be computed because we don't know the target concept". (I assume here he means the accuracy of a classifier on the set of all instances in \(\mathcal{V}\) from which a finite dataset is sampled.) Still, Kohavi claims, "... we can estimate the true accuracies using the holdout method" (which is actually the "random subsampling method" since he repeats holdout 500 times). Why? If that is true, then why not always do it that way?

In short, I do not think this paper provides any meaningful comparison of these methods for estimating the true accuracy of a classifier. Two methods say something about a single classifier and a dataset, and the other three say something about an inducer and a dataset by looking at several classifiers it induces. Comparing the numbers from all methods to those produced by the "random subsampling method" might not be saying anything at all about their usefulness in the real world if none of them actually pertain to the real world. What is needed, at least, is a dataset that has a known target concept. Then "true accuracies" can be accurately estimated, and compared to the results from the methods above. That is what I intend to do next (and which I am sure someone has already done).

A Wine Puzzle

| No Comments
A contest comes to town to find the best person at identifying wines. Not the kind of wines, but the region from which each comes, featuring wines from: Bordeaux, Burgundy, Alsace, and Côtes de Provence. A large cash prize, not to mention a lot of wine tasting, makes the opportunity hard to miss. Hence, I assemble a team (Team Smag), among which the prize will be divided equally if one of us wins.

To prepare Team Smag, I go to several stores to buy several bottles of wine from each region, and then cover the labels of the bottles to remove any identifying information. (I place a random ID on the bottom of the bottle to record its region, but to keep it secret from the others.) I randomly (but not haphazardly) put aside one third of the wines from each region for testing, and use the rest for training. My assistant will help train the other teammates. We train each person by repeating trials. In a trial, the participant sits in a room with several empty glasses. Outside the room, my assistant randomly selects a bottle from the training set, takes it into the room, pours a few sips of wine in a clean glass, and instructs the participant to drink and say from which region it comes. I then say whether they are correct or not. When a participant has 20 correct consecutive trials, they are considered trained. After all teammates have been trained, my assistant tests each one (a few days later) by having them identify the regions of the other 1/3 of the wines in the bottles I saved for testing. The testing is done in the same way as the training. We find that everyone scores nearly perfectly.

The day of the contest arrives. Along with all others, Team Smag pays the entry fees, and each member takes a seat in different areas of the arena. In front of each participant sits 16 glasses of wine (just a few sips in each), each numbered 1-16. There is a No. 2 pencil and a bubble sheet as well, to note the region of each wine. Great fun is had; and when all over, the bubble sheets are tallied and a big board displays the results. Not a single member of Team Smag scores much higher than random. What happened?
Dear taxpayers of Denmark,

You have been very gracious to support my research the past two years by Independent postdoc grant No. 11-105218 from Det Frie Forskningsråd. With the aid of this grant: I have produced 11 journal article submissions, 3 of which are now published, 1 that has been accepted, and 2 that are currently in review; I have produced 15 conference paper submissions, 12 of which are now published; I have given 20 public seminars on my research around Europe and the USA; and I have made 13 research visits of a week or more each.

Funny story: my DFF FTP postdoc grant proposal, titled, "Greedy Sparse Approximation and the Automatic Description of Audio and Music Data", is concerned with applying particular signal decomposition approaches to extracting content from audio data. However, very little of my resulting research directly addresses this goal. What I soon learned during the first quarter of my project is that the standard experimental framework practiced in a significant amount of research in this area is not as useful or illuminating as thought by its research community: invalid conclusions are being derived from poor evaluation using faulty data, which in turn can lead research widely astray, and waste valuable resources. The second quarter of my project led me to produce a thorough survey of the extent to which this is happening, and reveals a surprising lack of scientific practice in more than 500 published works. If the standard approach to evaluating a solution is faulty, then why continue to produce solutions and not change the evaluation? Hence, by the end of the first half of my project, it made no sense to produce new approaches to extracting information from audio signals and use the same poor evaluation to "measure" progress. Clearly, the most significant problem, and the one that must be addressed before all others, is how to scientifically evaluate such solutions. To this end, I devoted the last half of my project to answering why this has happened, and what can be done about it. As a result, many new directions have emerged that I will explore over the next several years with the various collaborators I have made during my DFF FTP Independent Postdoc.

Below the fold, I provide a detailed record of the results of my DFF FTP Independent Postdoc, both failures and successes.
Hello, and welcome to the Paper of the Day (Po'D): Music descriminations by carp (Cyprinus carpio) edition. Today's paper is an interesting one: A. R. Chase, "Music descriminations by carp (Cyprinus carpio)", Animal Learning & Behavior, vol. 29, no. 4, pp. 336-353, 2001. This paper details a series of experiments aiming to show that fish are capable of categorizing musical sound by genre. I am interested in this paper primarily because its core is about evaluating a "black box" in a complex and human-centered task; and it comes from a scientific discipline in which it is of paramount importance experiments have validity with respect to the scientific questions of interest. Hence, this paper might be relevant for evaluating music information systems (other black boxes) because of its scientific methodology. After a preliminary phase involving training three koi fish to discriminate between musical sound and silence, Chase performs four different experiments to train and test the fish in discriminating musical sound stimuli. The four experiments are designed to answer two specific questions:

  1. Can fish learn to discriminate between complex auditory stimuli that humans can also learn to discriminate?
  2. If so, does this behavior generalize to novel auditory stimuli satisfying the same discriminability?
While the results of the four experiments are clearly affirmative to both questions, questions remain as to what features the fish are using to perform this task. Two experiments rule out some local features, such as timbre.

As the complex auditory stimuli, Chase uses musical audio. The stimuli used in three experiments are differentiable by humans along ``genre'' or ``style'', i.e., ``blues'' music (recordings of music by John Lee Hooker, Muddy Waters, and Blues compilations) and ``classical'' music (recordings of music by Bach, Handel, Vivaldi, etc.). The stimuli used in the fourth experiment are differentiable by humans along melody. The results of the first three experiments show three fish capable in learning to discriminate between "blues" and "classical" music. Furthermore, probes with novel stimuli, and an iterative reversal test, argue that the fish were "exhibiting open-ended categorization" of the two kinds of stimuli.

The first two experiments do not show what features of the stimulus are being used by the fish, or whether they are performing the task "on the basis of deeper generic attributes", rather than an unknown discriminant. The third experiment thus attempts to answer whether the fish can learn to discriminate between musical sounds that are labeled either "blues" or "classical" without the cue of timbre. Music by John Lee Hooker, Bach, and Vivaldi were transcribed to MIDI, and rendered with the same and/or different instruments. The results show the fish learn to discriminate between the "styles", even when the signals are produced with the same MIDI instrument; but that performance decreases when the same instruments are used then when they are different. The final experiment attempts to answer whether the fish can learn to discriminate between two different melodies though they have the same rhythm and timbre. The stimuli are a melody of Paganini, and one where the notes are reversed, but the timing is preserved. The one fish completing the experiment shows a capacity to perform this task. Several trials controlled for local features, such as starting and ending notes. Further work is necessary to determine what features the fish are learning to use to perform this task; but the experiments definitely prove fish are able to work with complex auditory stimuli.

I really like how this paper shows the time and effort necessary to scientifically and efficiently answer real questions --- even though it treats musical genre in an artificial way (i.e., Aristotelian categorization). (To criticize this point any further will miss the message of the work.) The experiments it discusses take place over what appears to be at least two years, and which can't really take place over any shorter time-span because living subjects can only be rewarded with so much food at a time. Such a time commitment absolutely requires a solid experimental plan to minimize waste and maximize results. Chase performs four experiments (and a preliminary one), but answers several questions with each experiment, and also uses elements of one experiment to prepare the subjects for the next experiment. The work in this paper exemplifies the kinds of considerations taken for granted in much experimental work in my own disciplines (signal processing and machine learning applied to audio and music signals). Machines need no reward or reinforcement, but why should they be evaluated any differently than Chase evaluates Beauty, Oro and Pepi (the three fish)?

The final lecture schedule

| No Comments
As my grant is now winding down, I am in the final push to push out a few more articles, and go on an evangelism spree. Watch for when I am in your town!

  1. Wednesday Oct. 30, 18h30, OFAI, Vienna
  2. Wednesday Nov. 13, 15h30, MTG, Barcelona
  3. Monday Nov. 25, 14h00, Télécom ParisTech, Paris
  4. Tuesday Nov. 26, all day, AAU, Copenhagen
  5. Wednesday Dec. 4, 11h00 Fraunhofer IAIS, Bonn
My current show is called, "The crisis of evaluation in MIR".

Abstract: I critically address the "crisis of evaluation" in music information retrieval (MIR), with particular emphasis paid to music genre recognition, music mood recognition, and autotagging. I demonstrate four things: 1) many published results unknowingly use datasets with faults that render them meaningless; 2) state-of-the-art ("high classification accuracy") systems are fooled by irrelevant factors; 3) most published results are based upon an invalid evaluation design; and 4) a lot of work has unknowingly built, tuned, tested, compared and advertised "horses" instead of solutions. (The example of the horse Clever Hans provides an appropriate illustration.) I argue these problems occur because: 1) many researchers assume a dataset is a good dataset because many others use it; 2) many researchers assume evaluation that is standard in machine learning or information retrieval are useful and relevant for MIR; 3) many researchers mistake systematic, rigorous, and standardized evaluation for being scientific evaluation; and 4) problems and success criteria remain ill-defined, and thus evaluation poor, because researchers do not define appropriate use cases. I show how this "crisis of evaluation" can be addressed by formalizing evaluation in MIR to make clear its aims, parts, design, execution, interpretation, and assumptions. I also present several alternative evaluation approaches that can separate horses from solutions.

PhD course Nov. 25-29, 2013

| No Comments
Topics in Music Analysis, Cognition and Synthesis,

Doctoral School of Engineering and Science at Aalborg University Copenhagen

The course consists of three parts covering the topics of music analysis, cognition, and synthesis with emphasis on recent advances. The first part covers methods and models for music analysis. This includes statistical methods for parameterization of music signals and methods for music information retrieval. In the second part, models of music perception and cognition are covered, including analysis of symbolic representations of music. Finally, physical models of music instruments are covered in the last part along with parametric and interactive methods for synthesis and spatialization of sounds.

Prerequisites: Basic knowledge of sound and music computing
Hello, and welcome to Paper of the Day (Po'D): Towards a universal representation for audio information retrieval and analysis Edition. Today's paper is, B. S. Jensen, R. Troelsgård, J. Larsen, and L. K. Hansen, "Towards a universal representation for audio information retrieval and analysis", Proc. ICASSP, 2013. My one line summary of this article:

A generative multi-modal topic model of music is built from low-level audio features, lyrics, and/or tags.
Essentially, the paper proposes modeling a piece of music by the generative model depicted in a figure from the paper.

graphicalmodel.jpg For the music signals considered in this paper, there are \(S\) music data (in the training dataset), \(M\) modalities (e.g., 3 if using lyrics, tags and audio features), and \(T\) "topics" (abstractions of the content or stuff in the modalities). For song \(s\) and modality \(m\), there are \(N_{sm}\) "tokens", each of which generates a "word" i.e., the features extracted from that modality of that music. The goal is to model the words of some music data as a random process involving the parameters \(\alpha, \{\beta_m\}\), and latent variables \( \{\phi_t\}, \{z_{sm}\}\), and "tokens." This model "generates" each word of song \(s\) in modality \(m\) of music by drawing \(\alpha\) and creating a distribution over the topics, then drawing from this distribution a topic, then drawing a \(\beta_m\) and creating for the drawn topic a distribution over a vocabulary of "words" in modality \(m\), and finally drawing a "word" from that distribution. The lead author Bjørn Jensen has given me a quick tutorial in this starting from latent semantic analysis (LSA), moving to probabilistic LSA (pLSA), and ending with latent Dirichlet allocation.

First, LSA. We observe a set of documents \(\{d_i\}\), and each document is a set of words \(\{w_j\}\). We might want to discover in our set of documents what topics there are and what words compose the topics. We might want to find relevant documents in our set given a query. Or we might want to be able to predict the topics of an unseen document. So, we build a word co-occurrence matrix \(\MD\), where each column is a document and each row is a word. Each element is the frequency of a word in a document. We posit that each of our documents is explained by a collection of words (observables) associated with several topics (latent variable). This is then a matrix factorization problem. We can perform PCA, or SVD, or non-negative matrix factorization, to obtain: \(\MD \approx \Phi\Theta\). Each column of \(\Phi\) is a topic, and each row denotes a word frequency characteristic of the topic. Each column of \(\Theta\) describe how a document in our collection is composed by these topics.

Now, pLSA. For our set of documents, what we are really interested in is discovering the joint probability of document-word co-occurrences: \(p(d,\vw)\), where \(\vw\) is a vector of word co-occurrences. Assuming that a document is created from topics, and words spring from these topics, and that the document and its words are conditionally independent given a topic, we can express this joint probability as $$ p(d,\vw) = \sum_{z\in Z} p(d,\vw|z) p(z) = \sum_{z\in Z} p(d|z) p(\vw|z) p(z) = p(d) \sum_{z\in Z} p(\vw|z) p(z|d) $$ where \(Z\) is the set of topics. Now, we have to learn from our set of documents the conditional probabilities \(\{p(\vw|z)\}_Z\) describing the underlying set of topics in terms of the word frequencies, and we have to learn the topical composition of our documents \(\{p(z|d)\}\). This can be achieved using Markov Chain Monte Carlo (MCMC) methods to discover the distributions that maximize \(p(d,\vw)\) over our set of documents. (Note to self: review MCMC.) With this model then, we can do some of what we set out to do with LSA: discover in our set of documents what topics there are, what words compose the topics, and what topics are in a given document; or to find relevant documents in our set given a query. However, we cannot compute \(p(d^*,\vw)\) for a new document \(d^*\) because we do not know what generates \(p(z|d^*)\) for this document. By specifying a model of \(p(z|d)\), we move to LDA.

Now, in LDA we assume the topic distribution \(p(z|d)\), and perhaps the word distribution \(p(\vw|z)\), arise from probabilistic models with unknown parameters. The resulting model is a true generative model, in that each word of a document comes from sampling from the sampled topic distribution, and then sampling from a sampled word distribution of that topic. (Note to self: learn what that even means.) With such a model, we can now estimate for a new document, \(p(z|d^*)\) by a fold-in procedure (Note to self: see previous Note to self.), and thus \(p(d^*,\vw)\). We can now answer such questions as: how likely is it that this new document was produced by the topics of our model? What are the topics of this new document?

Now, this Po'D considers modeling document co-occurrences with multiple modalities. So, it aims to solve $$ p(d,\vw_1, \vw_2, \ldots, \vw_M) = p(d) \sum_{z\in Z} p(z|d) \prod_M p(\vw_m|z) $$ where \(\{\vw_m\}_M\) is the set of document \(m\)-modality co-occurrences, and the assumption here is that a document is conditionally independent of all modalities given the topics, and that all modalities are independent. This is exactly the model in the figure above. Given a trained model and a new song, one can estimate \(p(z|d^*)\) by holding all other quantities constant, using a portion of \(d^*\) ("fold-in"), and sampling using MCMC.

Before I proceed, it is now time to address those notes to myself.