Hello, and welcome to Paper of the Day (P'oD): The problem of accuracy as an evaluation criterion edition. Today's paper is one that I found after I published my tirade against classification accuracy: E. Law, "The problem of accuracy as an evaluation criterion," in Proc. ICML, 2008. I certainly should have included it.

My one-line precis of this (position) paper is: To evaluate solutions proposed to address problems centered on humans, humans must be directly involved in the mix.
Law takes a brief look at a key problem in each of three different research domains in which machine learning is being applied:
  1. Delimiting regions of interest in an image.
  2. Translation between written languages.
  3. Recorded music autotagging.
In each, she raises concern with accepted evaluation approaches. For region of interest detection, one accepted measure of algorithm performance is based on the amount of area overlap between its output rectangles and those in the ground truth. For machine translation, current metrics (BLEU, like precision) don't take into consideration that there can exist many acceptable translations. For music autotagging, a metric based only on the number of matching tags (precision and recall) while disregarding the meaning of "incorrect" tags might not reveal significant differences between algorithms producing the same score. She puts it very nicely:
"The problem in using accuracy to compare learned and ground truth data is that we are comparing sets of things without explicitly stating which subset is more desirable than another."
For each problem, she argues that the metric uses loses sight of the motivation behind solving the problem. For region of interest detection, it is object recognition. For machine translation, it is preservation of meaning. For music autotagging, it is facilitating information retrieval. Hence, humans must be involved in the evaluation. Including humans, of course, increases the cost of evaluation; but Law argues the evaluation process can be gamified, and made fun to do.

I think Law's paper is very nice, has good clear examples, and provides an interesting alternative. However, I would broaden her thesis beyond metrics because she is really taking aim at more (as am I). A discussion on which metric is more meaningful than another is unproductive without considering at the same time the design and the dataset used in an experiment (as well as the measurement model), and, before that, the explicitly specified hypotheses upon which the evaluation rests, and, before that, a well-defined (formal) description of the research problem. It is, I would argue, the whole enterprise of research problem solving that must be reconsidered.
Wordpress.com does not offer the latex support I need (macros, and sensible delimitation, see this mess), and apparently is blocked in particular countries around the world. So, I am going to try to move this blog to QMUL.

This blog is moving!

| No Comments

This blog and all of its contents are moving to: High Noon GMT. (At least I think I moved most of its contents ...)

Sorry for the hassle!

Some ISMIR 2014 Papers

| No Comments

ISMIR 2014 was a fantastic event! I really enjoyed the venue, the food, the organization, and most of all the variety of work and interesting discussions that resulted. Now that I am back, I want to review more closely about 50 papers from it. I include some of my notes below.



The authors hypothesize that contributing to human mood assignments to music are factors that are cultural, experiential, and dependent upon language proficiency. They conduct an experiment crossing three factors: participant origin ("Chinese in Canada", "Canadians of Chinese origin", "Canadians of non-Chinese origin"), "songs" stimuli ("the first 90 seconds" of 50 "very popular English-language songs of the 2000's"), stimuli presentation (lyrics only, music audio, lyrics and music audio). They use 100 participants (students at Waterloo, "33 Chinese living in Canada for less than 3 years", "33 Canadians, not of Chinese origin ... with English as their mother tongue", "34 Canadians of Chinese origin, born and brought up in Canada"). Each participant is instructed to label the mood of each stimulus in a presentation as one of the 5 clusters of the MIREX emotion model. Each participant labels 10 songs (first 3 only lyrics, next 3 only audio, last 4 audio+lyrics), contributing 1000 total responses covering all 50 songs. (The experimental design (mapping) is specified no further.)

In their analysis, the authors compute for each group a "distribution of responses", which I assume means an estimation of the joint probability P_origin(mood, song, presentation). This is what they wish to compare across groups. However, note that each song stimuli then receives about 20 responses from all groups in all presentations. In each presentation, only 6 or 7 responses are given from all groups for one song. Each group then contributes around 1 or 2 responses for each song in each presentation. The estimate of the above joint probability should then be very poor.

I agree with the complaint that mood labeling of music is quite poorly defined, and highly ambiguous with regards to extrinsic influences. To me it seems obvious that labeling music with "emotion" is specific to an individual working within some cultural context that requires such labeling (Halloween playlist, for instance). But this experiment as designed does not really address that hypothesis. For one, there are too few responses in the cross-factor design. Also, as a music listener who does not listen to the lyrics in music, I am skeptical of the relevance of "lyrics" only presentation of "music". How is "lyrics", music?

Now, how to design the experiment to make a valid conclusion about the dependence of mood assignment on participant origin? I say ask a bunch of Western ears to label the moods of some classical Indian music using a variety of ragas and talas. Absurdity will result.


Transfer learning is the adaptation of models learned for some task (source) for some other task (target). In this paper, models are learned for music audio signals using one dataset (Million Song) for the source tasks "user listening preference prediction" and "tag prediction", and the adapted for the target tasks "genre classification" and "tag prediction". Essentially, the authors extract low-level features from audio spectrograms, perform dimensionality reduction, and then train multilayer perceptrons on the source task. These trained systems then are used to produce "high-level" features of a new dataset, which are then used to train an SVM for a different target task.

The authors test the low-level features in "genre classification" and "tag prediction" using 5 different datasets. For instance, they use 10fCV in GTZAN and find an increase of accuracy from about 85% using the low-level features to about 88% using transfer learning. Experiments on other datasets show similar trends. They conclude, "We have shown that features learned in this fashion work well for other audio classification tasks on different datasets, consistently outperforming a purely unsupervised feature learning approach." This is not a valid conclusion since: 1) they do not control for all independent variables in the measurement models of the experiments (e.g. the faults in GTZAN make a significant contribution to the outcome), 2) they do not define the problems being solved (classification by any means? by relevant means?), and 3) they do not specify "work well" and "consistently outperforming". This approach appears to reproduce a lot of "ground truth" in some datasets, but the reproduction of ground truth does not imply that something relevant for content-based music classification has been learned and is being used.

Are these "high-level" features really closer to the "musical surface", i.e., music content? It would be interesting to redo the experiment using GTZAN but taking into consideration its faults. Also, of course, to subject it to the method of irrelevant transformations to see if it is relying on confounds in the dataset.


Association analysis is a data mining technique that finds relationships between sets of unique objects in order to build logical implications, i.e., If A then (probably) B. In this work, quantized features extracted from labeled acoustic signals are used to produce such rules. Those quantized extracted features that appear frequent enough in signals with a particular label are then taken to imply that label. For instance, if many signals of label i have large (or small) values in feature dimension j at times {t_1, t_2}, then that is taken to imply i.

This paper reports experiments with the Latin music dataset (LMD), and a portion of the million song dataset. In the LMD, MFCC features are extracted from the first 30 seconds of each song. (This means the features can include the applause and speaking that begins many of the "live" songs. Also, no artist filtering is used, and there is no consideration of all the replicas.) Results show that the proposed systems reproduce "ground truth" labels more than random selection.

Regardless of the results, the evaluation design used in this work (Classify) is invalid with respect to genre recognition. Reproducing "ground truth" labels here does not provide any evidence that the rules learned have anything to do with the meaning of those labels in the dataset and in the real world. Taken to absurdity, that the first 30 seconds of audio recordings labeled "Forro" have a large first MFCC at times 13.1 seconds and a small 8th MFCC at 25.2 seconds is not a particularly useful rule, or one that is at all relevant to the task. Furthermore, this work approaches the problem of music genre recognition as an Aristotelian one, and presupposes the low-level features are "content-based" features relevant to the undefined task of music genre classification. It would be nice if the problem of music genre recognition was like that, but it just isn't.


Transductive learning sidesteps the inductive step of building models of classes, and instead performs classification via similarity with exemplars. This is useful when there is not enough training data to build suitable models, or approximately good labels are desired. It is essentially a semi-supervised form of clustering.

This paper encodes low-level features (MFCCs) into bags of frames of features, and then builds a bipartite heterogeneous network to propagate labels through the network to unlabled data. Experiments on labeling music (GTZAN and Homburg) show the approach reproduces some "ground truth", but no fault filtering is used in GTZAN. Unfortunately, the experiments in this work do not show whether the results come from considerations of the music, or from something else unrelated.

I like the idea of transductive learning because it appears based more on notions of similarity (or proximity in some metric space), than on building general models that may be unachievable or unrealistic. However, the sanity of this approach for genre recognition (or music description in general) is highly dependent on the space in which the similarity is gauged (of course). A space from BFFs from MFCCs will likely have little to do with the high-level content used to judge the similarity of music. However, I can image several spaces for the same collection of music that emphasize specific high-level aspects of music, such as rhythm, key, instrumentation, and so on. Now, how to measure similarity in these spaces in a meaningful way?

Hello, and welcome to Paper of the Day (Po'D): Kiki-Bouba edition. Today's paper is my own: B. L. Sturm and N. Collins, "THE KIKI-BOUBA CHALLENGE: ALGORITHMIC COMPOSITION FOR CONTENT-BASED MIR RESEARCH & DEVELOPMENT", in Proc. Int. Symp. Music Info. Retrieval, Oct. 2014. Below is the video of my presentation from a few days ago (powerpoint slides here).

The one-line precis of our paper is:
The Kiki-Bouba Challenge (KBC) attempts to change the incentive in content-based MIR research from reproducing ground truth in a dataset to solving problems.

Beginning from my research in music machine listening, I have become more and more aware of applications of machine learning to cultural products, and the pitfalls that accompany such work. I previously critiqued a study applying clustering of image features to photographs of paintings by different artists. Here is a new one: clustering of Shakespeare's plays into genres by word frequencies. (This work is published in: S. Allison, R. Heuser, M. Jockers, F. Moretti and M. Witmore, "Quantitative Formalism: an Experiment", Pamphlets of the Stanford Literary Lab, Jan. 2011.)

On its face, this seems reasonable. As Allison et al. comment, certain words are closely associated with genres, like "castle" with "gothic". However, they discover they are able to automatically and correctly cluster Shakespeare's plays by using frequencies of only 37 words:

"a", "and", "as", "be", "but", "for", "have", "he", "him", "his", "i", "in", "is", "it", "me", "my", "not", "of", "p_apos", "p_colon", "p_comma", "p_exlam", "p_hyphen", "p_period", "p_ques", "p_semi", "so", "that", "the", "this", "thou", "to", "what", "will", "with", "you", "your"

At this point, it is reasonable to pause before making any claim that the clustering -- though correct it may be -- is a result of or caused by genre recognition. To accept such a conclusion entails accepting the words above and their frequencies as the mysterious ingredients that separate "tragedy" from "comedy". Unfortunately, it appears Allison et al. accept just that, calling these word frequency features the observable tips of the "icebergs" that are genres.

Hello, and welcome to Paper of the Day (Po'D): On the epistemological crisis in genomics edition. Today's paper is E. R. Dougherty, "On the epistemological crisis in genomics", Current Genomics, vol 9, pp. 69-79, 2008. (I have discussed a previous paper by Dougherty and Dalton here. )

From its beginning, Dougherty's article is on the attack, and minces no words:

There is an epistemological crisis in genomics. The rules of the scientific game are not being followed. ... High-throughput technologies such as gene-expression microarrays have [led] to the accumulation of massive amounts of data, orders of magnitude in excess to what has heretofore been conceivable. But the accumulation of data does not constitute science, nor does the [a posteriori] rational analysis of data.

Dougherty moves from the ancient to more modern philosophy, highlighting the essential roles in Science played by experiments performed with controlled conditions, the formulation of knowledge through mathematics (models), and the necessity of verification of models through their prediction of data, not their explanation of data. The following paragraph makes this latter quality clearer:

Science is not about data fitting. Consider designing a linear classifier .... The result might be good relative to the assembled data; indeed, [it] might even classify the data perfectly. But this linear-classifier model does not constitute a scientific theory unless there is an error rate associated with the line, predicting the error rate on future observations. ... In practice, the error rate of a classifier is estimated via some error-estimation procedure, so that the validity of the model depends upon this procedure. Specifically, the degree to which one knows the classifier error, which quantifies the predictive capacity of the classifier, depends upon the mathematical properties of the estimation procedure. Absent an understanding of those properties, the results are meaningless.

Dougherty provides a nice illustration of how unreliable such error rates can be. Using real microarray data of genes (independent variables) and tumor types (dependent variable), Dougherty builds and tests several classifiers on subsets of the data, and compares their estimated error rates with their "true error rates" (which is estimated using all of the data). The two appear quite uncorrelated. (A similar example is on Dalton's research webpage.) Dougherty is led to the conclusion that many publications in genomics are "lacking scientific content", and refers to Kant when he remarks, "A good deal of the crisis in genomics turns on a return to 'groping in the dark'."

Since publication, this article appears to have been referenced only 31 times, 19 of which are not from Dougherty and/or Dalton. I look forward to seeing how it has been received in those papers, and its lessons taken into practice. Looks like I will be reading a lot more bioinformatics research.

QMUL, there I go!

I am extremely pleased to report that in December I will be moving to the School of Electronic Engineering and Computer Science at Queen Mary University of London! I am really looking forward to joining and contributing to such a leading light in my field.

Now, how to migrate this blog?

This summer I have the opportunity to read more closely R. A. Bailey, Design of comparative experiments. Cambridge University Press, 2008. One thing I really like about her approach is its incorporation of linear algebra and probability theory, which is essentially estimation theory. This provides an unambiguous picture of what is going on in an experiment, the assumptions that are in play, and the relevance and meaning of particular statistical tests. Below, I explicate some of the fundamental subspaces of an experiment.

The program for EUSIPCO 2014 has been announced. Papers of interest for me include:

Comparison of Different Representations Based on Nonlinear Features for Music Genre Classification Athanasia Zlatintsi (National Technical University of Athens, Greece); Petros Maragos (National Technical University of Athens, Greece)

Fast Music Information Retrieval with Indirect Matching Takahiro Hayashi (Niigata University & Department of Information Engineering, Faculty of Engineering, Japan); Nobuaki Ishii (Niigata University, Japan); Masato Yamaguchi (Niigata University, Japan)

Audio Concept Classification with Hierarchical Deep Neural Networks Mirco Ravanelli (Fondazione Bruno Kessler (FBK), Italy); Benjamin Elizalde (ICSI Berkeley, USA); Karl Ni (Lawrence Livermore National Laboratory, USA); Gerald Friedland (International Computer Science Institute, USA)

Unsupervised Learning and Refinement of Rhythmic Patterns for Beat and Downbeat Tracking Florian Krebs (Johannes Kepler University, Linz, Austria); Filip Korzeniowski (Johannes Kepler University, Linz, Austria); Maarten Grachten (Austrian Research Institute for Artificial Intelligence, Austria); Gerhard Widmer (Johannes Kepler University Linz, Austria)

Speech-Music Discrimination: a Deep Learning Perspective Aggelos Pikrakis (University of Piraeus, Greece); Sergios Theodoridis (University of Athens, Greece)

Exploring Superframe Co-occurrence for Acoustic Event Recognition Huy Phan (University of Lübeck, Germany); Alfred Mertins (Institute for Signal and Image Processing, University of Luebeck, Germany)

Detecting Sound Objects in Audio Recordings Anurag Kumar (Carnegie Mellon University, USA); Rita Singh (Carnegie Mellon University, USA); Bhiksha Raj (Carnegie Mellon University, USA)

A Montage Approach to Sound Texture Synthesis Sean O'Leary (IRCAM, France); Axel Roebel (IRCAM, France)

A Compressible Template Protection Scheme for Face Recognition Based on Sparse Representation Yuichi Muraki (Tokyo Metropolitan University, Japan); Masakazu Furukawa (Tokyo Metropolitan University, Japan); Masaaki Fujiyoshi (Tokyo Metropolitan University, Japan); Yoshihide Tonomura (NTT, Japan); Hitoshi Kiya (Tokyo Metropolitan University, Japan)

Sparse Reconstruction of Facial Expressions with Localized Gabor Moments André Mourão (Universidade Nova Lisbon, Portugal); Pedro Borges (Universidade Nova de Lisboa, Portugal); Nuno Correia (Computer Science, Portugal); Joao Magalhaes (Universidade Nova Lisboa, Portugal)

Pornography Detection Using BossaNova Video Descriptor Carlos Caetano (Federal University of Minas Gerais, Brazil); Sandra Avila (University of Campinas, Brazil); Silvio Guimarães (PUC Minas, Brazil); Arnaldo Araújo (Federal University of Minas Gerais, Brazil)

Feature Level Combination for Object Recognition Abdollah Amirkhani-Shahraki (IUST & IranUniversity of Science and Technology, Iran)

Sparse Representation and Least Squares-based Classification in Face Recognition Michael Iliadis (Northwestern University, USA); Leonidas Spinoulas (Northwestern University, USA); Albert S. Berahas (Northwestern University, USA); Haohong Wang (TCL Research America, USA); Aggelos K Katsaggelos (Northwestern University, USA)

Greedy Methods for Simultaneous Sparse Approximation Leila Belmerhnia (CRAN, Université de Lorraine, CNRS, France); El-Hadi Djermoune (CRAN, Nancy-Universite, CNRS, France); David Brie (CRAN, Nancy Université, CNRS, France)

Sparse Matrix Decompositions for Clustering Thomas Blumensath (University of Southampton, United Kingdom)

Evaluation of Non-Linear Combinations of Rescaled Reassigned Spectrograms Maria Sandsten (Lund University, Sweden)