Recently in Audio Signals Category

Hello, and welcome to the Paper of the Day (Po'D): A Survey of Evaluation in Music Genre Recognition. Today's paper is B. L. Sturm, "A Survey of Evaluation in Music Genre Recognition", Proc. Adaptive Multimedia Retrieval, Copenhagen, Denmark, Oct. 2012.

This paper is best summarized by a particularly riveting line of section 2.2:

The most-used publicly available dataset in music genre recognition work is that produced in [378,379], often called "GTZAN." This audio dataset appears in more than 23% (96) of the references [5,11,14,16,18,27,33,35-40,53,57,58, 84,91,106,107, 109, 114, 130, 131, 136, 138, 142, 143, 163, 164, 177, 182, 191, 199, 201, 202, 204-206, 208, 209, 212-215, 217, 218, 223, 236, 237, 240, 241, 246, 270, 272, 285-290, 314, 318, 319, 322, 323, 325, 331, 336, 337, 339-341, 344, 345, 362-366, 368, 371-374, 377-379, 398,399,402,404,405, 407,411,416].
The numbers just sort of roll off the tongue. I think I might approach the presentation of this paper like at a humanities conference, where I read it. Aloud. With no slides. It is really only 7 pages of text, and 14 pages of references. I can skip the references.




And in the style of Harvard author name and date referencing, here is the first line of my paper:

Despite much work [Abeßer et al., 2008, 2009, 2010, 2012, Ahonen, 2010, Ahrendt et al., 2004, 2005, Ahrendt, 2006, Almoosa et al., 2010, Anan et al., 2011, And ́en and Mallat, 2011, Anglade et al., 2009a,b, 2010, Annesi et al., 2007, Arabi and Lu, 2009, Arenas-Garcia et al., 2006, Ariyaratne and Zhang, 2012, Aryafar and Shokoufandeh, 2011, Aryafar et al., 2012, Aucouturier and Pachet, 2002, 2003, Aucouturier and Pampalk, 2008, Aucouturier, 2009, Avcu et al., 2007, Bagci and Erzin, 2006, Ba ̆gci and Erzin, 2007, Balkema, 2007, Balkema and van der Heijden, 2010, Barbedo and Lopes, 2007, Barbedo, 2008, Barbieri et al., 2010, Barreira et al., 2011, Basili et al., 2004, Behun, 2012, Benetos and Kotropou- los, 2008, 2010, Bergstra et al., 2006, Bergstra, 2006, Bergstra et al., 2010, Bickerstaffe and Makalic, 2003, Bigerelle and Iost, 2000, Blume et al., 2008, Brecheisen et al., 2006, Burred and Lerch, 2003, Burred, 2004, 2005, Burred and Peeters, 2009, Casey et al., 2008, Cataltepe et al., 2007, Chai and Vercoe, 2001, Chang et al., 2008, 2010, Charami et al., 2007, Charbuillet et al., 2011, Chase, 2001, Chen et al., 2006, 2008, 2009, Chen and Chen, 2009, Chen et al., 2010, Chew et al., 2005, Cilibrasi et al., 2004, Cilibrasi and Vitanyi, 2005, Cor- nelis et al., 2010, Correa et al., 2010, Costa et al., 2004, 2011, 2012b,a, Craft et al., 2007, Craft, 2007, Cruz-Alc ́azar and Vidal, 2008, Dannenberg et al., 2001, Dannenberg, 2010, DeCoro et al., 2007, Dehghani and Lovett, 2006, Dellandrea et al., 2005, Deshpande et al., 2001, Dieleman et al., 2011, Diodati and Piazza, 2000, Dixon et al., 2003, 2004, 2010, Doraisamy et al., 2008, Doraisamy and Golzari, 2010, Downie et al., 2005, Downie, 2008, Downie et al., 2010, Draman et al., 2010, 2011, Esmaili et al., 2004, Ezzaidi and Rouat, 2007, Ezzaidi et al., 2009, Fadeev et al., 2009, Fernandez et al., 2011, Fern ́andez and Ch ́avez, 2012, Fiebrink and Fujinaga, 2006, Flexer et al., 2005, 2006, Flexer, 2006, 2007, Flexer and Schnitzer, 2009, 2010, Frederico, 2004, Fu et al., 2010a,b, 2011a,b, Garc ́ıa et al., 2007, Garcia-Garcia et al., 2010, Garc ́ıa et al., 2012, Gedik and Alpkocak, 2006, Genussov and Cohen, 2010, Gjerdingen and Perrott, 2008, Golub, 2000, Golzari et al., 2008a,c,b, Gonz ́alez et al., 2010, Goto et al., 2003, Goulart et al., 2011, 2012, Gouyon et al., 2004, Gouyon and Dixon, 2004, Gouyon, 2005, Grimaldi et al., 2003, 2006, Grosse et al., 2007, Guaus, 2009, Hamel and Eck, 2010, Han et al., 1998, Hansen et al., 2005, Harb et al., 2004, Harb and Chen, 2007, Hartmann, 2011, Heittola, 2003, Henaff et al., 2011, Herkiloglu et al., 2006, de la Higuera et al., 2005, Hillewaere et al., 2012, Holzapfel and Stylianou, 2007, 2008a,b, 2009, Homburg et al., 2005, Honingh and Bod, 2011, Hsieh et al., 2012, Hu and Ogihara, 2012, In ̃esta et al., 2009, ISMIR, 2004, ISMIS, 2011, Izmirli, 2009, Jang et al., 2008, Jennings et al., 2004, Jensen et al., 2006, Jiang et al., 2002, Jin and Bie, 2006, Lu et al., 2009, Jothilakshmi and Kathiresan, 2012, Ju et al., 2010, Kaminskas and Ricci, 2012, Karkavitsas and Tsihrintzis, 2011, 2012, Karydis, 2006, Karydis et al., 2006, Kiernan, 2000, Kim and Cho, 2011, Kini et al., 2011, Kirss, 2007, Kitahara et al., 2008, Kobayakawa and Hoshi, 2011, Koerich and Poitevin, 2005, Kofod and Ortiz-Arroyo, 2008, Kosina, 2002, Kostek et al., 2011, Kotropoulos et al., 2010, Krumhansl, 2010, Kuo and Shan, 2004, Lambrou et al., 1998, Lampropoulos et al., 2005, 2010, 2012, Langlois and Marques, 2009a,b, Lee and Downie, 2004, Lee et al., 2006, 2007, 2008, 2009b,a,c, 2011, Lehn-Schioler et al., 2006, de Leon and Inesta, 2002, de Le ́on and In ̃esta, 2003, 2004, de Leon and Inesta, 2007, de Leon and Martinez, 2012, Levy and Sandler, 2006, Li et al., 2003, Li and Tzanetakis, 2003, Li and Ogihara, 2004, Li and Sleep, 2005, Li and Ogihara, 2005, 2006, Li et al., 2009, 2010, Li and Chan, 2011, Lidy and Rauber, 2003, Lidy, 2003, Lidy and Rauber, 2005, Lidy, 2006, Lidy et al., 2007, Lidy and Rauber, 2008, Lidy et al., 2010b,a, Lim et al., 2011, Lin et al., 2004, Lippens et al., 2004, Liu et al., 2007, 2008, 2009a,b, Lo and Lin, 2010, Loh and Emmanuel, 2006, Lopes et al., 2010, Lukashevich et al., 2009, Lukashevich, 2012, M. et al., 2011, Mace et al., 2011, Manaris et al., 2005, 2008, 2011, Mandel et al., 2006, Manzagol et al., 2008, Markov and Matsui, 2012, Marques and Langlois, 2009, Marques et al., 2010, 2011b,a, Matityaho and Furst, 1995, Mayer et al., 2008b, Mayer and Rauber, 2010a,b, Mayer et al., 2010, Mayer and Rauber, 2011, McKay and Fujinaga, 2004, McKay, 2004, McKay and Fujinaga, 2005, 2006, 2008, McKay, 2010, McKay and Fujinaga, 2010, McKay et al., 2010, McKinney and Breebaart, 2003, Meng et al., 2005, Meng and Shawe- Taylor, 2008, Mierswa and Morik, 2005, MIREX, 2005, 2007, 2008, 2009, 2010, 2011, 2012, Mitra and Wang, 2008, Mitri et al., 2004, Moerchen et al., 2005, 2006, Nagathil et al., 2010, 2011, Nayak and Bhutani, 2011, Neubarth et al., 2011, Neu- mayer and Rauber, 2007, Nie et al., 2009, Nopthaisong and Hasan, 2007, Norowi et al., 2005, Novello et al., 2006, Orio, 2006, Orio et al., 2011, Pampalk et al., 2003, 2005, Pampalk, 2006, Panagakis et al., 2008, 2009a,b, 2010a,b, Panagakis and Kotropoulos, 2010, Paradzinets et al., 2009, Park, 2009a,b, 2010, Park et al., 2011, Peeters, 2007, 2011, In ̃esta and Rizo, 2009, P ́erez et al., 2010, P ́erez-Sancho et al., 2005, P ́erez et al., 2008, Perez et al., 2008, 2009, P ́erez, 2009, Pohle, 2005, Pohle et al., 2006, 2008, 2009, Porter and Neuringer, 1984, Pye, 2000, Rafailidis et al., 2009, Rauber and Fru ̈hwirth, 2001, Rauber et al., 2002, Ravelli et al., 2010, Reed and Lee, 2006, 2007, Rin et al., 2010, Ren and Jang, 2011, 2012, Ribeiro et al., 2012, Rizzi et al., 2008, Rocha, 2011, Rump et al., 2010, Ruppin and Yeshurun, 2006, Salamon et al., 2012, Sanden et al., 2008, 2010, Sanden and Zhang, 2011a,b, Sanden et al., 2012, de los Santos, 2010, Scaringella and Zoia, 2005, Scaringella et al., 2006, Schierz and Budka, 2011, Schindler et al., 2012, Schindler and Rauber, 2012, Seo and Lee, 2011, Seo, 2011, Serra et al., 2011, Seyerlehner, 2010, Seyerlehner et al., 2010, 2011, Shao et al., 2004, Shen et al., 2005, 2006, 2010, Silla et al., 2006, 2007, 2008a,b, Silla and Freitas, 2009, Silla et al., 2009, 2010, Silla and Freitas, 2011, Simsekli, 2010, Soltau, 1997, Soltau et al., 1998, Song et al., 2007, Song and Zhang, 2008, Sonmez, 2005, Sordo et al., 2008, Sotiropoulos et al., 2008, Srinivasan and Kankanhalli, 2004, Sturm and Noorzad, 2012, Sturm, 2012a,b, Sundaram and Narayanan, 2007, Happi Ti- etche et al., 2012, Tsai and Bao, 2010, Tsatsishvili, 2011, Tsunoo et al., 2009a,b, 2011, Turnbull and Elkan, 2005, Typke et al., 2005, Tzagkarakis et al., 2006, Tzanetakis et al., 2001, Tzanetakis and Cook, 2002, Tzanetakis, 2002, Tzanetakis et al., 2003, Umapathy et al., 2005, Valdez and Guevara, 2011, Vatolkin et al., 2010, 2011, Vatolkin, 2012, V ̈olkel et al., 2010, Wang et al., 2008, 2009, 2010, Weihs et al., 2007, Welsh et al., 1999, West and Cox, 2004, 2005, West and Lamere, 2007, West, 2008, Whitman and Smaragdis, 2002, Wiggins, 2009, Wu et al., 2011, Wu ̈lfing and Riedmiller, 2012, Xu et al., 2003, Yang et al., 2011a,b, Yao et al., 2010, Yaslan and Cataltepe, 2006a,b, 2009, Yeh and Yang, 2012, Ying et al., 2012, Yoon et al., 2005, Zanoni et al., 2012, Zeng et al., 2009, Zhang and Zhou, 2003, Zhang et al., 2008, Zhen and Xu, 2010a,b, Zhou et al., 2012, Zhu et al., 2004], music genre recognition (MGR) remains a compelling problem to solve by a machine.

Music genre flowchart

| 1 Comment
flow.png From: T. Zhang, "Semi-automatic approach for music classification," in Proc. SPIE Conf. on Internet Multimedia Management Systems, 2003.

The authors put together a flowchart for automatic classification. I was curious about "detect features of symphony", especially when one only has a 30 second clip: "Since a symphony is composed of multiple movements and repetitions, there is an alternation between relatively high volume audio signal (e.g. performance of the whole orchestra) and low volume audio signal (e.g. performance of single instrument or a few instruments of the orchestra) along the music piece. ... Thus, by checking the existence of alternation between high volume and low volume intervals (with each interval longer than a certain threshold) and/or repetition(s) in the whole music piece, symphonies will be distinguished [from other genres]."

Props to the authors for attempting the impossible, but any flowchart for assigning music genre must be broken from the very first decision. Genres are not uniquely specified by characteristics that mutually exclude others.

Music genre taxonomy

| No Comments
genretax.png From: J. G. A. Barbedo and A. Lopes, "Automatic genre classification of musical signals," EURASIP Journal on Advances in Signal Processing, 2007.

The authors specify the meaning of each of these labels. For instance, "Dance" music has "strong percussive elements and very marked beating." Stemming from "Dance" there is "Jazz", "characterized by the predominance of instruments like piano and saxophone. Electric guitars and drums can also be present; vocals, when present, are very characteristic." And stemming from "Dance," stemming from "Jazz," there is "Cool", a "jazz style [that is] light and introspective, with a very slow rhythm." The genres "Techno" and "Disco" --- which both emphasize the importance of listening with your body and feet --- do not stem from "Dance," but instead from "Pop/Rock," "the largest class, including a wide variety of songs."

Props to the authors for attempting the impossible, but any taxonomy of music genre must be broken from the very first stem. Genres are not like species, and cannot be arranged like so. (On the plus side, it appears that to differentiate introspective music from non-introspective music requires only four spectral features computed over 21.3 ms windows.)

Plagiarism

| No Comments
It started when I read the first sentence of the introduction of D. P. L. and K. Surresh, "An optimized feature set for music genre classification based on Support Vector Machine", in Proc. Recent Advances in Intelligent Computational Systems, Sep. 2011. They write:

Music is now so readily accessible in digital form that personal collections can easily exceed the practical limits on the time we have to listen to them: ten thousand music tracks on a personal music device have a total duration of approximately 30 days of continuous audio.
Then I googled "Music is now so readily accessible in digital form", and look at this! The top first hit is from an article in press: Angelina Tzacheva, Dirk Schlingmann, Keith Bell, "Automatic Detection of Emotions with Music Files", Int. J. Social Network Mining, in press 2012. I can't read the entire article; but the first two sentences of the abstract are:

The amount of music files available on the Internet is constantly growing, as well as the access to recordings. Music is now so readily accessible in digital form that personal collections can easily exceed the practical limits of the time we have to listen to them.
The source of this text, however, is in the third search result: M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes and M. Slaney, "Content-based Music Information Retrieval: Current Directions and Future Challenges", Proc. IEEE, vol. 96, no. 4, pp. 668-696, Apr. 2008. The first sentence of their introduction is an exact match to the text in L. and Suresh:

Music is now so readily accessible in digital form that personal collections can easily exceed the practical limits on the time we have to listen to them: ten thousand music tracks on a personal music device have a total duration of approximately 30 days of continuous audio.
I don't care to search for other examples of plagiarism in this publication, or that of Tzacheva et al. Even finding one lifted sentence in a work tells me how much time I should spend with it. Better for me to just write a blog post about it, and then send a complaint to IEEE.
This looks like a really rewarding thing to solve, but after listening to the sounds myself while viewing the labels, I am not sure it is so solvable with audio features alone. Still, I might try a little something to see what happens.
During the weekend, I experimented with the scattering coefficient features for music genre recognition. At first, I was using AdaBoost with 1000 decision stumps, giving me just above 80% accuracy. These features being 469 dimensions makes the training process very slow, so I decided why not test a much quicker approach given by Bayesian classification with Gaussianity assumptions. So, I learned class-dependent means and covariances for each class from the training data, the covariance matrix of all the training data. I then implemented the Mahalanobis distance (MDC) and full quadratic classifiers (FDC), the benefits of which include their simple implementation, and quick training and testing. Furthermore, within a Bayesian framework, we can naturally introduce concepts of confidence, risk, and rejection. But I started simple: equal priors and uniform risk. Below we see the mean classification results from 10 independent trials of 10-fold stratified cross validation.
I have been experimenting with the approach to feature extraction posed in J. Andén and S. Mallat, "Multiscale scattering for audio classification," Proc. Int. Soc. Music Info. Retrieval, 2011. Specifically, I have substituted these "scattering coefficients" for the features used by Bergstra et al. 2006 into AdaBoost for music genre recognition.

The idea behind the features reminds me of the temporal modulation analysis of Panagakis et al. 2009, which itself comes from S. A. Shamma, "Encoding sound timbre in the auditory system", IETE J. Research, vol. 49, no. 2, pp. 145-156, Mar.-Apr. 2003. One difference is that these scattering coefficients are not psychoacoustically derived, yet they appear just as powerful as those that are.
I just came across these two interesting issued patents on concatenative synthesis: Just one word difference in their titles, and the latter patent was filed on the same day as the former one for one year; yet both are granted. That is strange.

Scatter is on its way back!

| No Comments
I am all moved into London now, and my first task was to get Scatter up and running again. After several dozen hours of hacking code that was dropped over 3 years ago, "Build succeeded" was music to my ears. Here is its first picture as it awoke from its slumber:

SCATTER01.png It is buggy, and crashing every now and then. But here is what I hope to have recreated very soon:

Music genre recognition results

| No Comments
I have finally completed my paper "Three revealing experiments in music genre recognition", submitted to ISMIR 2012, which formalizes my results here, here, here, here, here, and here. I make available here the attendant code for reproducing all experiments and figures.

My one line summary of my work is:
Two of the most accurate systems for automatically recognizing music genre are not recognizing music genre, but something else.
That something else is the subject of further work, not to mention replicating and testing other systems --- starting with this curious work.

Thank you to all commentators and test participants!

About this Archive

This page is an archive of recent entries in the Audio Signals category.

Compressed sensing is the next category.

Find recent content on the main index or look in the archives to find all content.