The program for EUSIPCO 2014 has been announced. Papers of interest for me include:

Comparison of Different Representations Based on Nonlinear Features for Music Genre Classification Athanasia Zlatintsi (National Technical University of Athens, Greece); Petros Maragos (National Technical University of Athens, Greece)

Fast Music Information Retrieval with Indirect Matching Takahiro Hayashi (Niigata University & Department of Information Engineering, Faculty of Engineering, Japan); Nobuaki Ishii (Niigata University, Japan); Masato Yamaguchi (Niigata University, Japan)

Audio Concept Classification with Hierarchical Deep Neural Networks Mirco Ravanelli (Fondazione Bruno Kessler (FBK), Italy); Benjamin Elizalde (ICSI Berkeley, USA); Karl Ni (Lawrence Livermore National Laboratory, USA); Gerald Friedland (International Computer Science Institute, USA)

Unsupervised Learning and Refinement of Rhythmic Patterns for Beat and Downbeat Tracking Florian Krebs (Johannes Kepler University, Linz, Austria); Filip Korzeniowski (Johannes Kepler University, Linz, Austria); Maarten Grachten (Austrian Research Institute for Artificial Intelligence, Austria); Gerhard Widmer (Johannes Kepler University Linz, Austria)

Speech-Music Discrimination: a Deep Learning Perspective Aggelos Pikrakis (University of Piraeus, Greece); Sergios Theodoridis (University of Athens, Greece)

Exploring Superframe Co-occurrence for Acoustic Event Recognition Huy Phan (University of Lübeck, Germany); Alfred Mertins (Institute for Signal and Image Processing, University of Luebeck, Germany)

Detecting Sound Objects in Audio Recordings Anurag Kumar (Carnegie Mellon University, USA); Rita Singh (Carnegie Mellon University, USA); Bhiksha Raj (Carnegie Mellon University, USA)

A Montage Approach to Sound Texture Synthesis Sean O'Leary (IRCAM, France); Axel Roebel (IRCAM, France)

A Compressible Template Protection Scheme for Face Recognition Based on Sparse Representation Yuichi Muraki (Tokyo Metropolitan University, Japan); Masakazu Furukawa (Tokyo Metropolitan University, Japan); Masaaki Fujiyoshi (Tokyo Metropolitan University, Japan); Yoshihide Tonomura (NTT, Japan); Hitoshi Kiya (Tokyo Metropolitan University, Japan)

Sparse Reconstruction of Facial Expressions with Localized Gabor Moments André Mourão (Universidade Nova Lisbon, Portugal); Pedro Borges (Universidade Nova de Lisboa, Portugal); Nuno Correia (Computer Science, Portugal); Joao Magalhaes (Universidade Nova Lisboa, Portugal)

Pornography Detection Using BossaNova Video Descriptor Carlos Caetano (Federal University of Minas Gerais, Brazil); Sandra Avila (University of Campinas, Brazil); Silvio Guimarães (PUC Minas, Brazil); Arnaldo Araújo (Federal University of Minas Gerais, Brazil)

Feature Level Combination for Object Recognition Abdollah Amirkhani-Shahraki (IUST & IranUniversity of Science and Technology, Iran)

Sparse Representation and Least Squares-based Classification in Face Recognition Michael Iliadis (Northwestern University, USA); Leonidas Spinoulas (Northwestern University, USA); Albert S. Berahas (Northwestern University, USA); Haohong Wang (TCL Research America, USA); Aggelos K Katsaggelos (Northwestern University, USA)

Greedy Methods for Simultaneous Sparse Approximation Leila Belmerhnia (CRAN, Université de Lorraine, CNRS, France); El-Hadi Djermoune (CRAN, Nancy-Universite, CNRS, France); David Brie (CRAN, Nancy Université, CNRS, France)

Sparse Matrix Decompositions for Clustering Thomas Blumensath (University of Southampton, United Kingdom)

Evaluation of Non-Linear Combinations of Rescaled Reassigned Spectrograms Maria Sandsten (Lund University, Sweden)

Today, I present a talk at the SoundSoftware 2014 Third Workshop on Software and Data for Audio and Music Research: "How reproducibility tipped the scale toward article acceptance".

I discuss a recent episode in which our submission of a negative result article -- contradicting previously published work -- was favorably reviewed, and eventually published (here). The review process, and the persuasion of the reviewers, were greatly aided by our efforts at reproducibility. We won a reproducibility prize last year for this work.

There appears to be a bevy of good looking papers. I am particularly looking forward to learning more about these:

A COMPOSITIONAL HIERARCHICAL MODEL FOR MUSIC INFORMATION RETRIEVAL

AN ANALYSIS AND EVALUATION OF AUDIO FEATURES FOR MULTITRACK MUSIC MIXTURES

AN ASSOCIATION-BASED APPROACH TO GENRE CLASSIFICATION IN MUSIC

AUTOMATIC INSTRUMENT CLASSIFICATION OF ETHNOMUSICOLOGICAL AUDIO RECORDINGS

CLASSIFYING EEG RECORDINGS OF RHYTHM PERCEPTION

CODEBOOK BASED SCALABLE MUSIC TAGGING WITH POISSON MATRIX FACTORIZATION

DETECTING DROPS IN EDM: CONTENT-BASED APPROACHES TO A SOCIALLY SIGNIFICANT MUSIC EVENT

EVALUATING THE EVALUATION MEASURES FOR BEAT TRACKING

IMPROVING MUSIC RECOMMENDER SYSTEMS: WHAT CAN WE LEARN FROM RESEARCH ON MUSIC TASTES?

INFORMATION-THEORETIC MEASURES OF MUSIC LISTENING BEHAVIOUR

JAMS: A JSON ANNOTATED MUSIC SPECIFICATION FOR REPRODUCIBLE MIR RESEARCH

MIR_EVAL

MODELING TEMPORAL STRUCTURE IN MUSIC FOR EMOTION PREDICTION USING PAIRWISE COMPARISONS

MUSIC CLASSIFICATION BY TRANSDUCTIVE LEARNING USING BIPARTITE HETEROGENEOUS NETWORKS

ON COMPARATIVE STATISTICS FOR LABELLING TASKS: WHAT CAN WE LEARN FROM MIREX ACE 2013?

ON CULTURAL AND EXPERIENTIAL ASPECTS OF MUSIC MOOD

ON INTER-RATER AGREEMENT IN AUDIO MUSIC SIMILARITY

TEN YEARS OF MIREX (MUSIC INFORMATION RETRIEVAL EVALUATION EXCHANGE): REFLECTIONS, CHALLENGES AND OPPORTUNITIES

THEORETICAL FRAMEWORK OF A COMPUTATIONAL MODEL OF AUDITORY MEMORY FOR MUSIC EMOTION RECOGNITION

TRANSFER LEARNING BY SUPERVISED PRE-TRAINING FOR AUDIO-BASED MUSIC CLASSIFICATION

WHAT IS THE EFFECT OF AUDIO QUALITY ON THE ROBUSTNESS OF MFCCS AND CHROMA FEATURES?


A few months ago, I submitted to ISMIR 2014 a paper essentially casting into the MIR conference community my five prescriptions for motivating scientific research in music information retrieval, as well as a summary of the story of Clever Hans. I provocatively titled the paper, ``The future of scientific research in music information retrieval'', the meaning of which comes from the first sentence of the abstract: "We make five prescriptions that can help ensure future research in music information retrieval (MIR) contributes valid (scientific) knowledge." I meant my submission not as a research paper presenting a new MIR system/problem/dataset, but as a position paper, summarizing in one place the major findings from my work and collaborations of the past two years in MIR, and proposing "a way forward" when it comes to what I term "a crisis in MIR evaluation: a large number of published works related to machine music listening (> 500) report results using evaluations that lack the validity for making any meaningful comparisons or conclusions with regards to machine music listening."

The reviews are in (rejection), but the machinery to respond to the comments does not exist. Undoubtedly, the four reviewers spent a good amount of time reviewing and discussing my paper from their perspectives and with a clear ability in the topics. The quality of the reviews (as well as on my other four submissions) is by and large exceptional among the conferences to which I submit, and I very much appreciate the reviewers' efforts. Their comments reveal where my text has fallen short of the goal, which helps me significantly to refine the delivery of the ideas I am advocating. Below, I try to correct these discrepancies in line with the reviewer comments. (I understand this is unconventional; however, I think the discussion is useful for illuminating my five prescriptions just published in JNMR.)

Hello, and welcome to Paper of the Day (Po'D): Intriguing properties of neural networks edition. Today's paper is: C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus, "Intriguing properties of neural networks", in Proc. Int. Conf. Learning Representations, 2014. Today's paper is very exciting for me because I see "horses" nearly being called "horses" in a machine learning research domain outside music information retrieval. Furthermore, the arguments that this work is apparently causing resembles what I have received in peer review of my work. For instance, see the comments on this post. Or the reviews here. Some amount of press is also resulting, e.g., ZDnet, Slashdot; and the results of the paper are also being used to bolster the argument that the hottest topic in machine learning is over-hyped.

The one-line precis of this paper is: The deep neural network: as uninterpretable as it ever was; and now acting in ways the contradict notions of generalization.

Hello, and welcome to Paper of the Day (Po'D): Horses and more horeses edition. Today's paper is: B. L. Sturm, "A Simple Method to Determine if a Music Information Retrieval System is a `Horse'", IEEE Trans. Multimedia, 2014 (in press). This double-header of a Po'D also includes this paper: B. L. Sturm, C. Kereliuk, and A. Pikrakis, "A Closer Look at Deep Learning Neural Networks with Low-level Spectral Periodicity Features", Proc. 4th International Workshop on Cognitive Information Processing, June 2014.

The one-line precis of these papers is:
For some use cases, it is important to ensure Music Information Retrieval (MIR) systems are reproducing "ground truth" for the right reasons: Here's how.

The State of the Art

| No Comments
Finally published is my article, The State of the Art Ten Years After A State of the Art: Future Research in Music Information Retrieval. This article, one culmination of my recently finished postdoc grant, examines the past ten years of research since Aucouturier and Pachet's "Representing Musical Genre: A State of the Art" in 2003. The one-line summary:

The state of the art now is nearly the same as it was then.
Preprint is available here.

This work leads to five "prescriptions" for motivating progress in research addressing real problems, not only in music genre recognition (whatever that is, see section 2), but more broadly in music information retrieval (see section 4), and wider still, any application of machine learning (grant proposal pending):

  1. Define problems with use cases and formalism
  2. Design valid and relevant experiments
  3. Perform system analysis deeper than just evaluation
  4. Acknowledge limitations and proceed with skepticism
  5. Make reproducible work reproducible.
These are quite obvious, of course; but I have found that they are practiced only rarely.

Since my work on the GTZAN dataset (which is now formally incorporated in this JNMR article), I have received several emails from people wondering whether it is ok to use GTZAN, or what else to use, for testing their genre recognition systems. For instance:
"I just went through your paper and you have criticized datasets like GTZAN. I knew that GTZAN is a very popular dataset that is used in evaluation. But now there are flaws with it, what are the alternatives? And generally researchers don't stop at testing with just one dataset, maybe they take 3-4 datasets. In your opinion, what are the best music datasets which you can work with and avoid the problems of datasets like GTZAN?"
Some reviewers of my analysis of GTZAN have suggested the dataset should be banished; or that much better datasets are now available. My reservations about GTZAN are the following: GTZAN should not be banished, but used properly. This means one must use it with full consideration of its faults, and draw appropriate conclusions from results derived from it with full acknowledgement of the limitations of the experiment. GTZAN can still be useful for MIR, as long as it is used properly. Any other dataset is not going to be free of faults. For instance, I have recently found faults in other well-used datasets: BALLROOM and Latin Music Dataset.

Whether or not a dataset, or set of datasets, is "good" depends on its relevance to the scientific question that is being asked. In a very real sense, the least of one's worries should be datasets. Much more effort must be made in the design, implementation and analysis of valid and relevant experiments.
The Latin Music Database (LMD) was created around 2007 by Silla et al. for use in a comparative evaluation of particular approaches for music genre classification. It has been used in the MIREX Latin music genre recognition task since 2009.

LMD is described in, C. N. Silla, A. L. Koerich, and C. A. A. Kaestner, "The Latin music database," in Proc. ISMIR, 2008. That paper describes the LMD as 3,227 song recordings, each labeled in one of ten different classes: Axé, Batchata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja, and Tango. This dataset is notable among those created for music genre recognition because it contains music outside the realm of Western popular music. Like the Ballroom dataset, each music recording is assigned a single label by "experts in Brazilian dance" according to the appropriate dance. However, unlike GTZAN and Ballroom, the audio data is not freely available; only pre-computed features are available for download.

Searching through the references of my music genre recognition survey, I find this dataset (or portions of it) has been used in the evaluations of music genre recognition systems in at least 16 conference papers and journal articles:

  1. Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon. Music genre recognition using spectrograms. In Proc. Int. Conf. Systems, Signals and Image Process., 2011.
  2. Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon. Comparing textural features for music genre classification. In Proc. IEEE World Cong. Comp. Intell., June 2012.
  3. Y.M.G. Costa, L.S. Oliveira, A.L. Koerich, F. Gouyon, and J.G. Martins. Music genre classification using LBP textural features. Signal Process., 92(11):2723-2737, Nov. 2012.
  4. S. Doraisamy and S. Golzari. Automatic musical genre classification and artificial immune recognition system. In Z. W. Ras and A. A. Wieczorkowska, editors, Advances in Music Information Retrieval, pages 390-402. Springer, 2010.
  5. N. A. Draman, C. Wilson, and S. Ling. Modified AIS-based classifier for music genre classification. In Proc. ISMIR, pages 369-374, 2010.
  6. T. Lidy, C. Silla, O. Cornelis, F. Gouyon, A. Rauber, C. A. A. Kaestner, and A. L. Koerich. On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing non-western and ethnic music collections. Signal Process., 90(4):1032-1048, 2010.
  7. M. Lopes, F. Gouyon, A. Koerich, and L. E. S. Oliveira. Selection of training instances for music genre classification. In Proc. ICPR, Istanbul, Turkey, 2010.
  8. G. Marques, T. Langlois, F. Gouyon, M. Lopes, and M. Sordo. Short-term feature space and music genre classification. J. New Music Research, 40(2):127-137, 2011.
  9. G. Marques, M. Lopes, M. Sordo, T. Langlois, and F. Gouyon. Additional evidence that common low-level features of individual audio frames are not representative of music genres. In Proc. SMC, Barcelona, Spain, July 2010.
  10. A. Schindler and A. Rauber. Capturing the temporal domain in echonest features for improved classification effectiveness. In Proc. Adaptive Multimedia Retrieval, Oct. 2012.
  11. C. Silla, A. Koerich, and C. Kaestner. Improving automatic music genre classification with hybrid content-based feature vectors. In Proc. Symp. Applied Comp., Sierre, Switzerland, Mar. 2010.
  12. C. N. Silla, A. Koerich, and C. Kaestner. Automatic music genre classification using ensembles of classifiers. In Proc. IEEE Int. Conf. Systems, Man, Cybernetics, pages 1687.1692, 2007.
  13. C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. Feature selection in automatic music genre classification. In Proc. IEEE Int. Symp. Mulitmedia, pages 39-44, 2008.
  14. C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. A feature selection approach for automatic music genre classification. Int. J. Semantic Computing, 3(2):183-208, 2009.
  15. C. Silla, C. Kaestner, and A. Koerich. Time-space ensemble strategies for automatic music genre classification. In Jaime Sichman, Helder Coelho, and Solange Rezende, editors, Advances in Artificial Intelligence, pages 339-348. Springer Berlin / Heidelberg, 2006.
  16. C.N. Silla and A. A. Freitas. Novel top-down approaches for hierarchical classification and their application to automatic music genre classification. In IEEE Int. Conf. Systems, Man, and Cybernetics, San Antonio, USA, Oct. 2009.
However, as for GTZAN, and as for Ballroom, it appears that researchers have taken for granted the integrity of LMD. I have acquired the audio for LMD, which has 3,229 song files (two more than stated by Silla et al.). Through my fingerprinting method (just a little Shazam-like implementation), I compare all songs in each class, and find 213 replicas. This is in spite of the cautions Silla et al. (2008) describe taking in creating LMD.

So far, I have only looked for replicas within each class, and not across classes; but we now know at least 6.5% of the dataset is replicated (which is greater than the 5% in GTZAN, and which we already know cannot be ignored). Below, I list the replicas I find.

One solution to the wine puzzle

| No Comments
I posed this puzzle a few days ago, partly as an exercise in experimental design.

What is different between the two test conditions?

One thing that is different is that the bottle shape is visible in the first condition, but not visible in the second. Even though we carefully covered the label to conceal its identity, we were naive to the fact that the bottle shape can carry information about the region from which the wine comes. The following picture shows the variety of shapes.

bottleshapes.jpg From the left, bottles of this shape usually come from Bordeaux; the next bottle shape from Burgundy; the next from Rhône; the next from Champagne; and the next from Côtes de Provence. In our dataset then, the bottle shape is confounded with its region. We thus inadvertently trained our experts not to recognize the region from which the wine comes based on the contents of the bottles, but instead on the shape of its container. Hence, when Team Småg can no longer see the bottle, they must guess.

I came up with this example to show how easy it can be to believe that one's learning experiment is valid, and that the figure of merit reflects real-world performance, but that is actually invalid due to an independent variable of which the experimenter is unaware. This fact is directly discussed as limitations by Ava Chase in her article, "Music descriminations by carp (Cyprinus carpio)":

Even a convincing demonstration of categorization can fail to identify the stimulus features that exert control at any given time, especially if the stimuli are complex. In particular, there can be uncertainty as to whether classification behavior had been under the stimulus control of the features in terms of which the experimenter had defined the categories or whether the subjects had discovered an effective discriminant of which the experimenter was unaware. The diversity of S+/S- pairings presumably rules out the possibility that the fish could have been relying on only a single discriminant, such as timbre, but a constant concern is the possible existence of a simple attribute that would have allowed the subjects merely to discriminate instead of categorizing. (emphasis mine)

Faults in the Ballroom dataset

| 2 Comments
The Ballroom dataset (BD) was created around 2004 by F. Gouyon et al. for use in a comparative evaluation of particular features for music genre classification. BD is described in the paper, F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, "Evaluating rhythmic descriptors for musical genre classification", in Proc. Int. Conf. Audio Eng. Society, June 2004.

[BD] contains excerpts from 698 pieces of music, around 30 seconds long. The audio quality of this data is quite low, it was originally fetched in real audio format [from http://www.ballroomdancers.com], with a compression factor of almost 22 with respect to the common 44.1 kHz 16 bits mono WAV format. ... The data covers eight musical sub-genres of ballroom dance music: [Jive, Quickstep, Tango, Waltz, Viennese Waltz, Samba, Cha Cha Cha and Rumba].