This summer I have the opportunity to read more closely R. A. Bailey, Design of comparative experiments. Cambridge University Press, 2008. One thing I really like about her approach is its incorporation of linear algebra and probability theory, which is essentially estimation theory. This provides an unambiguous picture of what is going on in an experiment, the assumptions that are in play, and the relevance and meaning of particular statistical tests. Below, I explicate some of the fundamental subspaces of an experiment.
The program for EUSIPCO 2014 has been announced. Papers of interest for me include:
Comparison of Different Representations Based on Nonlinear Features for Music Genre Classification Athanasia Zlatintsi (National Technical University of Athens, Greece); Petros Maragos (National Technical University of Athens, Greece)
Fast Music Information Retrieval with Indirect Matching Takahiro Hayashi (Niigata University & Department of Information Engineering, Faculty of Engineering, Japan); Nobuaki Ishii (Niigata University, Japan); Masato Yamaguchi (Niigata University, Japan)
Audio Concept Classification with Hierarchical Deep Neural Networks Mirco Ravanelli (Fondazione Bruno Kessler (FBK), Italy); Benjamin Elizalde (ICSI Berkeley, USA); Karl Ni (Lawrence Livermore National Laboratory, USA); Gerald Friedland (International Computer Science Institute, USA)
Unsupervised Learning and Refinement of Rhythmic Patterns for Beat and Downbeat Tracking Florian Krebs (Johannes Kepler University, Linz, Austria); Filip Korzeniowski (Johannes Kepler University, Linz, Austria); Maarten Grachten (Austrian Research Institute for Artificial Intelligence, Austria); Gerhard Widmer (Johannes Kepler University Linz, Austria)
Speech-Music Discrimination: a Deep Learning Perspective Aggelos Pikrakis (University of Piraeus, Greece); Sergios Theodoridis (University of Athens, Greece)
Exploring Superframe Co-occurrence for Acoustic Event Recognition Huy Phan (University of Lübeck, Germany); Alfred Mertins (Institute for Signal and Image Processing, University of Luebeck, Germany)
Detecting Sound Objects in Audio Recordings Anurag Kumar (Carnegie Mellon University, USA); Rita Singh (Carnegie Mellon University, USA); Bhiksha Raj (Carnegie Mellon University, USA)
A Montage Approach to Sound Texture Synthesis Sean O'Leary (IRCAM, France); Axel Roebel (IRCAM, France)
A Compressible Template Protection Scheme for Face Recognition Based on Sparse Representation Yuichi Muraki (Tokyo Metropolitan University, Japan); Masakazu Furukawa (Tokyo Metropolitan University, Japan); Masaaki Fujiyoshi (Tokyo Metropolitan University, Japan); Yoshihide Tonomura (NTT, Japan); Hitoshi Kiya (Tokyo Metropolitan University, Japan)
Sparse Reconstruction of Facial Expressions with Localized Gabor Moments André Mourão (Universidade Nova Lisbon, Portugal); Pedro Borges (Universidade Nova de Lisboa, Portugal); Nuno Correia (Computer Science, Portugal); Joao Magalhaes (Universidade Nova Lisboa, Portugal)
Pornography Detection Using BossaNova Video Descriptor Carlos Caetano (Federal University of Minas Gerais, Brazil); Sandra Avila (University of Campinas, Brazil); Silvio Guimarães (PUC Minas, Brazil); Arnaldo Araújo (Federal University of Minas Gerais, Brazil)
Feature Level Combination for Object Recognition Abdollah Amirkhani-Shahraki (IUST & IranUniversity of Science and Technology, Iran)
Sparse Representation and Least Squares-based Classification in Face Recognition Michael Iliadis (Northwestern University, USA); Leonidas Spinoulas (Northwestern University, USA); Albert S. Berahas (Northwestern University, USA); Haohong Wang (TCL Research America, USA); Aggelos K Katsaggelos (Northwestern University, USA)
Greedy Methods for Simultaneous Sparse Approximation Leila Belmerhnia (CRAN, Université de Lorraine, CNRS, France); El-Hadi Djermoune (CRAN, Nancy-Universite, CNRS, France); David Brie (CRAN, Nancy Université, CNRS, France)
Sparse Matrix Decompositions for Clustering Thomas Blumensath (University of Southampton, United Kingdom)
Evaluation of Non-Linear Combinations of Rescaled Reassigned Spectrograms Maria Sandsten (Lund University, Sweden)
I discuss a recent episode in which our submission of a negative result article -- contradicting previously published work -- was favorably reviewed, and eventually published (here). The review process, and the persuasion of the reviewers, were greatly aided by our efforts at reproducibility. We won a reproducibility prize last year for this work.
There appears to be a bevy of good looking papers. I am particularly looking forward to learning more about these:
A COMPOSITIONAL HIERARCHICAL MODEL FOR MUSIC INFORMATION RETRIEVAL
AN ANALYSIS AND EVALUATION OF AUDIO FEATURES FOR MULTITRACK MUSIC MIXTURES
AN ASSOCIATION-BASED APPROACH TO GENRE CLASSIFICATION IN MUSIC
AUTOMATIC INSTRUMENT CLASSIFICATION OF ETHNOMUSICOLOGICAL AUDIO RECORDINGS
CLASSIFYING EEG RECORDINGS OF RHYTHM PERCEPTION
CODEBOOK BASED SCALABLE MUSIC TAGGING WITH POISSON MATRIX FACTORIZATION
DETECTING DROPS IN EDM: CONTENT-BASED APPROACHES TO A SOCIALLY SIGNIFICANT MUSIC EVENT
EVALUATING THE EVALUATION MEASURES FOR BEAT TRACKING
IMPROVING MUSIC RECOMMENDER SYSTEMS: WHAT CAN WE LEARN FROM RESEARCH ON MUSIC TASTES?
INFORMATION-THEORETIC MEASURES OF MUSIC LISTENING BEHAVIOUR
JAMS: A JSON ANNOTATED MUSIC SPECIFICATION FOR REPRODUCIBLE MIR RESEARCH
MODELING TEMPORAL STRUCTURE IN MUSIC FOR EMOTION PREDICTION USING PAIRWISE COMPARISONS
MUSIC CLASSIFICATION BY TRANSDUCTIVE LEARNING USING BIPARTITE HETEROGENEOUS NETWORKS
ON COMPARATIVE STATISTICS FOR LABELLING TASKS: WHAT CAN WE LEARN FROM MIREX ACE 2013?
ON CULTURAL AND EXPERIENTIAL ASPECTS OF MUSIC MOOD
ON INTER-RATER AGREEMENT IN AUDIO MUSIC SIMILARITY
TEN YEARS OF MIREX (MUSIC INFORMATION RETRIEVAL EVALUATION EXCHANGE): REFLECTIONS, CHALLENGES AND OPPORTUNITIES
THEORETICAL FRAMEWORK OF A COMPUTATIONAL MODEL OF AUDITORY MEMORY FOR MUSIC EMOTION RECOGNITION
TRANSFER LEARNING BY SUPERVISED PRE-TRAINING FOR AUDIO-BASED MUSIC CLASSIFICATION
WHAT IS THE EFFECT OF AUDIO QUALITY ON THE ROBUSTNESS OF MFCCS AND CHROMA FEATURES?
A few months ago, I submitted to ISMIR 2014 a paper essentially casting into the MIR conference community my five prescriptions for motivating scientific research in music information retrieval, as well as a summary of the story of Clever Hans. I provocatively titled the paper, ``The future of scientific research in music information retrieval'', the meaning of which comes from the first sentence of the abstract: "We make five prescriptions that can help ensure future research in music information retrieval (MIR) contributes valid (scientific) knowledge." I meant my submission not as a research paper presenting a new MIR system/problem/dataset, but as a position paper, summarizing in one place the major findings from my work and collaborations of the past two years in MIR, and proposing "a way forward" when it comes to what I term "a crisis in MIR evaluation: a large number of published works related to machine music listening (> 500) report results using evaluations that lack the validity for making any meaningful comparisons or conclusions with regards to machine music listening."
The reviews are in (rejection), but the machinery to respond to the comments does not exist. Undoubtedly, the four reviewers spent a good amount of time reviewing and discussing my paper from their perspectives and with a clear ability in the topics. The quality of the reviews (as well as on my other four submissions) is by and large exceptional among the conferences to which I submit, and I very much appreciate the reviewers' efforts. Their comments reveal where my text has fallen short of the goal, which helps me significantly to refine the delivery of the ideas I am advocating. Below, I try to correct these discrepancies in line with the reviewer comments. (I understand this is unconventional; however, I think the discussion is useful for illuminating my five prescriptions just published in JNMR.)
The one-line precis of this paper is: The deep neural network: as uninterpretable as it ever was; and now acting in ways the contradict notions of generalization.
The one-line precis of these papers is:
For some use cases, it is important to ensure Music Information Retrieval (MIR) systems are reproducing "ground truth" for the right reasons: Here's how.
The state of the art now is nearly the same as it was then.Preprint is available here.
This work leads to five "prescriptions" for motivating progress in research addressing real problems, not only in music genre recognition (whatever that is, see section 2), but more broadly in music information retrieval (see section 4), and wider still, any application of machine learning (grant proposal pending):
- Define problems with use cases and formalism
- Design valid and relevant experiments
- Perform system analysis deeper than just evaluation
- Acknowledge limitations and proceed with skepticism
- Make reproducible work reproducible.
Since my work on the GTZAN dataset (which is now formally incorporated in this JNMR article), I have received several emails from people wondering whether it is ok to use GTZAN, or what else to use, for testing their genre recognition systems. For instance:
"I just went through your paper and you have criticized datasets like GTZAN. I knew that GTZAN is a very popular dataset that is used in evaluation. But now there are flaws with it, what are the alternatives? And generally researchers don't stop at testing with just one dataset, maybe they take 3-4 datasets. In your opinion, what are the best music datasets which you can work with and avoid the problems of datasets like GTZAN?"Some reviewers of my analysis of GTZAN have suggested the dataset should be banished; or that much better datasets are now available. My reservations about GTZAN are the following: GTZAN should not be banished, but used properly. This means one must use it with full consideration of its faults, and draw appropriate conclusions from results derived from it with full acknowledgement of the limitations of the experiment. GTZAN can still be useful for MIR, as long as it is used properly. Any other dataset is not going to be free of faults. For instance, I have recently found faults in other well-used datasets: BALLROOM and Latin Music Dataset.
Whether or not a dataset, or set of datasets, is "good" depends on its relevance to the scientific question that is being asked. In a very real sense, the least of one's worries should be datasets. Much more effort must be made in the design, implementation and analysis of valid and relevant experiments.
LMD is described in, C. N. Silla, A. L. Koerich, and C. A. A. Kaestner, "The Latin music database," in Proc. ISMIR, 2008. That paper describes the LMD as 3,227 song recordings, each labeled in one of ten different classes: Axé, Batchata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja, and Tango. This dataset is notable among those created for music genre recognition because it contains music outside the realm of Western popular music. Like the Ballroom dataset, each music recording is assigned a single label by "experts in Brazilian dance" according to the appropriate dance. However, unlike GTZAN and Ballroom, the audio data is not freely available; only pre-computed features are available for download.
Searching through the references of my music genre recognition survey, I find this dataset (or portions of it) has been used in the evaluations of music genre recognition systems in at least 16 conference papers and journal articles:
- Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon. Music genre recognition using spectrograms. In Proc. Int. Conf. Systems, Signals and Image Process., 2011.
- Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon. Comparing textural features for music genre classification. In Proc. IEEE World Cong. Comp. Intell., June 2012.
- Y.M.G. Costa, L.S. Oliveira, A.L. Koerich, F. Gouyon, and J.G. Martins. Music genre classification using LBP textural features. Signal Process., 92(11):2723-2737, Nov. 2012.
- S. Doraisamy and S. Golzari. Automatic musical genre classification and artificial immune recognition system. In Z. W. Ras and A. A. Wieczorkowska, editors, Advances in Music Information Retrieval, pages 390-402. Springer, 2010.
- N. A. Draman, C. Wilson, and S. Ling. Modified AIS-based classifier for music genre classification. In Proc. ISMIR, pages 369-374, 2010.
- T. Lidy, C. Silla, O. Cornelis, F. Gouyon, A. Rauber, C. A. A. Kaestner, and A. L. Koerich. On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing non-western and ethnic music collections. Signal Process., 90(4):1032-1048, 2010.
- M. Lopes, F. Gouyon, A. Koerich, and L. E. S. Oliveira. Selection of training instances for music genre classification. In Proc. ICPR, Istanbul, Turkey, 2010.
- G. Marques, T. Langlois, F. Gouyon, M. Lopes, and M. Sordo. Short-term feature space and music genre classification. J. New Music Research, 40(2):127-137, 2011.
- G. Marques, M. Lopes, M. Sordo, T. Langlois, and F. Gouyon. Additional evidence that common low-level features of individual audio frames are not representative of music genres. In Proc. SMC, Barcelona, Spain, July 2010.
- A. Schindler and A. Rauber. Capturing the temporal domain in echonest features for improved classification effectiveness. In Proc. Adaptive Multimedia Retrieval, Oct. 2012.
- C. Silla, A. Koerich, and C. Kaestner. Improving automatic music genre classification with hybrid content-based feature vectors. In Proc. Symp. Applied Comp., Sierre, Switzerland, Mar. 2010.
- C. N. Silla, A. Koerich, and C. Kaestner. Automatic music genre classification using ensembles of classifiers. In Proc. IEEE Int. Conf. Systems, Man, Cybernetics, pages 1687.1692, 2007.
- C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. Feature selection in automatic music genre classification. In Proc. IEEE Int. Symp. Mulitmedia, pages 39-44, 2008.
- C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. A feature selection approach for automatic music genre classification. Int. J. Semantic Computing, 3(2):183-208, 2009.
- C. Silla, C. Kaestner, and A. Koerich. Time-space ensemble strategies for automatic music genre classification. In Jaime Sichman, Helder Coelho, and Solange Rezende, editors, Advances in Artificial Intelligence, pages 339-348. Springer Berlin / Heidelberg, 2006.
- C.N. Silla and A. A. Freitas. Novel top-down approaches for hierarchical classification and their application to automatic music genre classification. In IEEE Int. Conf. Systems, Man, and Cybernetics, San Antonio, USA, Oct. 2009.
So far, I have only looked for replicas within each class, and not across classes; but we now know at least 6.5% of the dataset is replicated (which is greater than the 5% in GTZAN, and which we already know cannot be ignored). Below, I list the replicas I find.
What is different between the two test conditions?
One thing that is different is that the bottle shape is visible in the first condition, but not visible in the second. Even though we carefully covered the label to conceal its identity, we were naive to the fact that the bottle shape can carry information about the region from which the wine comes. The following picture shows the variety of shapes.
From the left, bottles of this shape usually come from Bordeaux; the next bottle shape from Burgundy; the next from Rhône; the next from Champagne; and the next from Côtes de Provence. In our dataset then, the bottle shape is confounded with its region. We thus inadvertently trained our experts not to recognize the region from which the wine comes based on the contents of the bottles, but instead on the shape of its container. Hence, when Team Småg can no longer see the bottle, they must guess.
I came up with this example to show how easy it can be to believe that one's learning experiment is valid, and that the figure of merit reflects real-world performance, but that is actually invalid due to an independent variable of which the experimenter is unaware. This fact is directly discussed as limitations by Ava Chase in her article, "Music descriminations by carp (Cyprinus carpio)":
Even a convincing demonstration of categorization can fail to identify the stimulus features that exert control at any given time, especially if the stimuli are complex. In particular, there can be uncertainty as to whether classification behavior had been under the stimulus control of the features in terms of which the experimenter had defined the categories or whether the subjects had discovered an effective discriminant of which the experimenter was unaware. The diversity of S+/S- pairings presumably rules out the possibility that the fish could have been relying on only a single discriminant, such as timbre, but a constant concern is the possible existence of a simple attribute that would have allowed the subjects merely to discriminate instead of categorizing. (emphasis mine)
Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark