Beginning from my research in music machine listening, I have become more and more aware of applications of machine learning to cultural products, and the pitfalls that accompany such work. I previously critiqued a study applying clustering of image features to photographs of paintings by different artists. Here is a new one: clustering of Shakespeare's plays into genres by word frequencies. (This work is published in: S. Allison, R. Heuser, M. Jockers, F. Moretti and M. Witmore, "Quantitative Formalism: an Experiment", Pamphlets of the Stanford Literary Lab, Jan. 2011.)

On its face, this seems reasonable. As Allison et al. comment, certain words are closely associated with genres, like "castle" with "gothic". However, they discover they are able to automatically and correctly cluster Shakespeare's plays by using frequencies of only 37 words:

"a", "and", "as", "be", "but", "for", "have", "he", "him", "his", "i", "in", "is", "it", "me", "my", "not", "of", "p_apos", "p_colon", "p_comma", "p_exlam", "p_hyphen", "p_period", "p_ques", "p_semi", "so", "that", "the", "this", "thou", "to", "what", "will", "with", "you", "your"

At this point, it is reasonable to pause before making any claim that the clustering -- though correct it may be -- is a result of or caused by genre recognition. To accept such a conclusion entails accepting the words above and their frequencies as the mysterious ingredients that separate "tragedy" from "comedy". Unfortunately, it appears Allison et al. accept just that, calling these word frequency features the observable tips of the "icebergs" that are genres.

Hello, and welcome to Paper of the Day (Po'D): On the epistemological crisis in genomics edition. Today's paper is E. R. Dougherty, "On the epistemological crisis in genomics", Current Genomics, vol 9, pp. 69-79, 2008. (I have discussed a previous paper by Dougherty and Dalton here. )

From its beginning, Dougherty's article is on the attack, and minces no words:


There is an epistemological crisis in genomics. The rules of the scientific game are not being followed. ... High-throughput technologies such as gene-expression microarrays have [led] to the accumulation of massive amounts of data, orders of magnitude in excess to what has heretofore been conceivable. But the accumulation of data does not constitute science, nor does the [a posteriori] rational analysis of data.

Dougherty moves from the ancient to more modern philosophy, highlighting the essential roles in Science played by experiments performed with controlled conditions, the formulation of knowledge through mathematics (models), and the necessity of verification of models through their prediction of data, not their explanation of data. The following paragraph makes this latter quality clearer:

Science is not about data fitting. Consider designing a linear classifier .... The result might be good relative to the assembled data; indeed, [it] might even classify the data perfectly. But this linear-classifier model does not constitute a scientific theory unless there is an error rate associated with the line, predicting the error rate on future observations. ... In practice, the error rate of a classifier is estimated via some error-estimation procedure, so that the validity of the model depends upon this procedure. Specifically, the degree to which one knows the classifier error, which quantifies the predictive capacity of the classifier, depends upon the mathematical properties of the estimation procedure. Absent an understanding of those properties, the results are meaningless.

Dougherty provides a nice illustration of how unreliable such error rates can be. Using real microarray data of genes (independent variables) and tumor types (dependent variable), Dougherty builds and tests several classifiers on subsets of the data, and compares their estimated error rates with their "true error rates" (which is estimated using all of the data). The two appear quite uncorrelated. (A similar example is on Dalton's research webpage.) Dougherty is led to the conclusion that many publications in genomics are "lacking scientific content", and refers to Kant when he remarks, "A good deal of the crisis in genomics turns on a return to 'groping in the dark'."

Since publication, this article appears to have been referenced only 31 times, 19 of which are not from Dougherty and/or Dalton. I look forward to seeing how it has been received in those papers, and its lessons taken into practice. Looks like I will be reading a lot more bioinformatics research.

QMUL, there I go!

| 2 Comments
I am extremely pleased to report that in December I will be moving to the School of Electronic Engineering and Computer Science at Queen Mary University of London! I am really looking forward to joining and contributing to such a leading light in my field.

Now, how to migrate this blog?

This summer I have the opportunity to read more closely R. A. Bailey, Design of comparative experiments. Cambridge University Press, 2008. One thing I really like about her approach is its incorporation of linear algebra and probability theory, which is essentially estimation theory. This provides an unambiguous picture of what is going on in an experiment, the assumptions that are in play, and the relevance and meaning of particular statistical tests. Below, I explicate some of the fundamental subspaces of an experiment.

The program for EUSIPCO 2014 has been announced. Papers of interest for me include:

Comparison of Different Representations Based on Nonlinear Features for Music Genre Classification Athanasia Zlatintsi (National Technical University of Athens, Greece); Petros Maragos (National Technical University of Athens, Greece)

Fast Music Information Retrieval with Indirect Matching Takahiro Hayashi (Niigata University & Department of Information Engineering, Faculty of Engineering, Japan); Nobuaki Ishii (Niigata University, Japan); Masato Yamaguchi (Niigata University, Japan)

Audio Concept Classification with Hierarchical Deep Neural Networks Mirco Ravanelli (Fondazione Bruno Kessler (FBK), Italy); Benjamin Elizalde (ICSI Berkeley, USA); Karl Ni (Lawrence Livermore National Laboratory, USA); Gerald Friedland (International Computer Science Institute, USA)

Unsupervised Learning and Refinement of Rhythmic Patterns for Beat and Downbeat Tracking Florian Krebs (Johannes Kepler University, Linz, Austria); Filip Korzeniowski (Johannes Kepler University, Linz, Austria); Maarten Grachten (Austrian Research Institute for Artificial Intelligence, Austria); Gerhard Widmer (Johannes Kepler University Linz, Austria)

Speech-Music Discrimination: a Deep Learning Perspective Aggelos Pikrakis (University of Piraeus, Greece); Sergios Theodoridis (University of Athens, Greece)

Exploring Superframe Co-occurrence for Acoustic Event Recognition Huy Phan (University of Lübeck, Germany); Alfred Mertins (Institute for Signal and Image Processing, University of Luebeck, Germany)

Detecting Sound Objects in Audio Recordings Anurag Kumar (Carnegie Mellon University, USA); Rita Singh (Carnegie Mellon University, USA); Bhiksha Raj (Carnegie Mellon University, USA)

A Montage Approach to Sound Texture Synthesis Sean O'Leary (IRCAM, France); Axel Roebel (IRCAM, France)

A Compressible Template Protection Scheme for Face Recognition Based on Sparse Representation Yuichi Muraki (Tokyo Metropolitan University, Japan); Masakazu Furukawa (Tokyo Metropolitan University, Japan); Masaaki Fujiyoshi (Tokyo Metropolitan University, Japan); Yoshihide Tonomura (NTT, Japan); Hitoshi Kiya (Tokyo Metropolitan University, Japan)

Sparse Reconstruction of Facial Expressions with Localized Gabor Moments André Mourão (Universidade Nova Lisbon, Portugal); Pedro Borges (Universidade Nova de Lisboa, Portugal); Nuno Correia (Computer Science, Portugal); Joao Magalhaes (Universidade Nova Lisboa, Portugal)

Pornography Detection Using BossaNova Video Descriptor Carlos Caetano (Federal University of Minas Gerais, Brazil); Sandra Avila (University of Campinas, Brazil); Silvio Guimarães (PUC Minas, Brazil); Arnaldo Araújo (Federal University of Minas Gerais, Brazil)

Feature Level Combination for Object Recognition Abdollah Amirkhani-Shahraki (IUST & IranUniversity of Science and Technology, Iran)

Sparse Representation and Least Squares-based Classification in Face Recognition Michael Iliadis (Northwestern University, USA); Leonidas Spinoulas (Northwestern University, USA); Albert S. Berahas (Northwestern University, USA); Haohong Wang (TCL Research America, USA); Aggelos K Katsaggelos (Northwestern University, USA)

Greedy Methods for Simultaneous Sparse Approximation Leila Belmerhnia (CRAN, Université de Lorraine, CNRS, France); El-Hadi Djermoune (CRAN, Nancy-Universite, CNRS, France); David Brie (CRAN, Nancy Université, CNRS, France)

Sparse Matrix Decompositions for Clustering Thomas Blumensath (University of Southampton, United Kingdom)

Evaluation of Non-Linear Combinations of Rescaled Reassigned Spectrograms Maria Sandsten (Lund University, Sweden)

Today, I present a talk at the SoundSoftware 2014 Third Workshop on Software and Data for Audio and Music Research: "How reproducibility tipped the scale toward article acceptance".

I discuss a recent episode in which our submission of a negative result article -- contradicting previously published work -- was favorably reviewed, and eventually published (here). The review process, and the persuasion of the reviewers, were greatly aided by our efforts at reproducibility. We won a reproducibility prize last year for this work.

There appears to be a bevy of good looking papers. I am particularly looking forward to learning more about these:

A COMPOSITIONAL HIERARCHICAL MODEL FOR MUSIC INFORMATION RETRIEVAL

AN ANALYSIS AND EVALUATION OF AUDIO FEATURES FOR MULTITRACK MUSIC MIXTURES

AN ASSOCIATION-BASED APPROACH TO GENRE CLASSIFICATION IN MUSIC

AUTOMATIC INSTRUMENT CLASSIFICATION OF ETHNOMUSICOLOGICAL AUDIO RECORDINGS

CLASSIFYING EEG RECORDINGS OF RHYTHM PERCEPTION

CODEBOOK BASED SCALABLE MUSIC TAGGING WITH POISSON MATRIX FACTORIZATION

DETECTING DROPS IN EDM: CONTENT-BASED APPROACHES TO A SOCIALLY SIGNIFICANT MUSIC EVENT

EVALUATING THE EVALUATION MEASURES FOR BEAT TRACKING

IMPROVING MUSIC RECOMMENDER SYSTEMS: WHAT CAN WE LEARN FROM RESEARCH ON MUSIC TASTES?

INFORMATION-THEORETIC MEASURES OF MUSIC LISTENING BEHAVIOUR

JAMS: A JSON ANNOTATED MUSIC SPECIFICATION FOR REPRODUCIBLE MIR RESEARCH

MIR_EVAL

MODELING TEMPORAL STRUCTURE IN MUSIC FOR EMOTION PREDICTION USING PAIRWISE COMPARISONS

MUSIC CLASSIFICATION BY TRANSDUCTIVE LEARNING USING BIPARTITE HETEROGENEOUS NETWORKS

ON COMPARATIVE STATISTICS FOR LABELLING TASKS: WHAT CAN WE LEARN FROM MIREX ACE 2013?

ON CULTURAL AND EXPERIENTIAL ASPECTS OF MUSIC MOOD

ON INTER-RATER AGREEMENT IN AUDIO MUSIC SIMILARITY

TEN YEARS OF MIREX (MUSIC INFORMATION RETRIEVAL EVALUATION EXCHANGE): REFLECTIONS, CHALLENGES AND OPPORTUNITIES

THEORETICAL FRAMEWORK OF A COMPUTATIONAL MODEL OF AUDITORY MEMORY FOR MUSIC EMOTION RECOGNITION

TRANSFER LEARNING BY SUPERVISED PRE-TRAINING FOR AUDIO-BASED MUSIC CLASSIFICATION

WHAT IS THE EFFECT OF AUDIO QUALITY ON THE ROBUSTNESS OF MFCCS AND CHROMA FEATURES?


Hello, and welcome to Paper of the Day (Po'D): Intriguing properties of neural networks edition. Today's paper is: C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus, "Intriguing properties of neural networks", in Proc. Int. Conf. Learning Representations, 2014. Today's paper is very exciting for me because I see "horses" nearly being called "horses" in a machine learning research domain outside music information retrieval. Furthermore, the arguments that this work is apparently causing resembles what I have received in peer review of my work. For instance, see the comments on this post. Or the reviews here. Some amount of press is also resulting, e.g., ZDnet, Slashdot; and the results of the paper are also being used to bolster the argument that the hottest topic in machine learning is over-hyped.

The one-line precis of this paper is: The deep neural network: as uninterpretable as it ever was; and now acting in ways the contradict notions of generalization.

Hello, and welcome to Paper of the Day (Po'D): Horses and more horeses edition. Today's paper is: B. L. Sturm, "A Simple Method to Determine if a Music Information Retrieval System is a `Horse'", IEEE Trans. Multimedia, 2014 (in press). This double-header of a Po'D also includes this paper: B. L. Sturm, C. Kereliuk, and A. Pikrakis, "A Closer Look at Deep Learning Neural Networks with Low-level Spectral Periodicity Features", Proc. 4th International Workshop on Cognitive Information Processing, June 2014.

The one-line precis of these papers is:
For some use cases, it is important to ensure Music Information Retrieval (MIR) systems are reproducing "ground truth" for the right reasons: Here's how.

The State of the Art

| No Comments
Finally published is my article, The State of the Art Ten Years After A State of the Art: Future Research in Music Information Retrieval. This article, one culmination of my recently finished postdoc grant, examines the past ten years of research since Aucouturier and Pachet's "Representing Musical Genre: A State of the Art" in 2003. The one-line summary:

The state of the art now is nearly the same as it was then.
Preprint is available here.

This work leads to five "prescriptions" for motivating progress in research addressing real problems, not only in music genre recognition (whatever that is, see section 2), but more broadly in music information retrieval (see section 4), and wider still, any application of machine learning (grant proposal pending):

  1. Define problems with use cases and formalism
  2. Design valid and relevant experiments
  3. Perform system analysis deeper than just evaluation
  4. Acknowledge limitations and proceed with skepticism
  5. Make reproducible work reproducible.
These are quite obvious, of course; but I have found that they are practiced only rarely.

Since my work on the GTZAN dataset (which is now formally incorporated in this JNMR article), I have received several emails from people wondering whether it is ok to use GTZAN, or what else to use, for testing their genre recognition systems. For instance:
"I just went through your paper and you have criticized datasets like GTZAN. I knew that GTZAN is a very popular dataset that is used in evaluation. But now there are flaws with it, what are the alternatives? And generally researchers don't stop at testing with just one dataset, maybe they take 3-4 datasets. In your opinion, what are the best music datasets which you can work with and avoid the problems of datasets like GTZAN?"
Some reviewers of my analysis of GTZAN have suggested the dataset should be banished; or that much better datasets are now available. My reservations about GTZAN are the following: GTZAN should not be banished, but used properly. This means one must use it with full consideration of its faults, and draw appropriate conclusions from results derived from it with full acknowledgement of the limitations of the experiment. GTZAN can still be useful for MIR, as long as it is used properly. Any other dataset is not going to be free of faults. For instance, I have recently found faults in other well-used datasets: BALLROOM and Latin Music Dataset.

Whether or not a dataset, or set of datasets, is "good" depends on its relevance to the scientific question that is being asked. In a very real sense, the least of one's worries should be datasets. Much more effort must be made in the design, implementation and analysis of valid and relevant experiments.