November 2012 Archives

Next up for reproduction

| No Comments
C. Lee, J. Shih, K. Yu, and H. Lin, "Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features," IEEE Trans. Multimedia, vol. 11, pp. 670-682, June 2009.

They report a classification accuracy of 90.6% in GTZAN, and 86.83% in ISMIR2004.

C.-H. Lee, C.-H. Chou, C.-C. Lien, and J.-C. Fang, "Music genre classification using modulation spectral features and multiple prototype vectors representation," in Int. Cong. Image and Signal Process., 2011.

This one appears to improve on the ISMIR2004 accuracy: 89.99%.
A few days ago, I wrote a bit about some work integrating compressive sampling with sparse representation classification for the problem of music genre recognition. The high accuracy that work reports first tipped me off that something wasn't right. Several other observations followed. Now that I have finished reproducing the system as described by Chang et al., and finished running my experiments, I am convinced all their results are false, and to a degree that I find rather disturbing for published research. Yes, I am claiming their results are fabricated.

I have written up my results, and have submitted a paper to ICME 2013. Here is my reproducible research package so that others can try my code: code.zip.
I am now on a two-week research visit at the Centre for Mathematical Sciences at Lund University, Sweden. I let slip at lunch that ``sparsity'' is now out and ``co-sparsity'' is in --- yet I couldn't exactly state why. (My mind is all up in that music genre.) So, to learn myself, I volunteered to present to the research group on Thursday, ``Co-sparsity: what's it good for?'' And what better way to start than read, M. Elad, P. Milanfar, and R. Rubinstein, "Analysis versus synthesis in signal priors," Inverse Problems, vol. 23, pp. 947-968, 2007. I start from the beginning: what is meant by "analysis" and "synthesis"?

We wish to recover \(\vx \in \mathcal{R}^N\) given observation \(\vy \in \mathcal{R}^M\). Assume the model \(\vy = \MT\vx + \vv\), where \(\MT : \mathcal{R}^N \to \mathcal{R}^M\) is known, and \(\vv\) is noise. Assuming \(\vv\) is distributed \(\mathcal{N}(\zerob,\sigma_\vv^2\MI)\), we can estimate \(\vx\) by maximum a posteriori (MAP): $$ \begin{align} \hat \vx_{MAP}(\vy|\MT,\vv) & = \arg \max_{\vx \in \mathcal{R}^N} P[\vy|\vx]P[\vx] \\ & = \arg \max_{\vx \in \mathcal{R}^N} \exp\left ( -\frac{\|\vy - \MT\vx\|_2^2}{2\sigma_\vv^2}\right )P[\vx] \\ & = \arg \min_{\vx \in \mathcal{R}^N} \frac{\|\vy - \MT\vx\|_2^2}{2\sigma_\vv^2} - \log P[\vx]. \end{align} $$
  • If we assume \(P[\vx]\) uniform, then this becomes the maximum likelihood estimate.
  • Assume for \(\alpha > 0\) and \(p \ge 0\) the prior is defined $$ P[\vx] \propto \exp (-\alpha \|\Omega\vx\|_p^p), \vx \in \mathcal{R}^N $$ where the ``analysis operator'' \(\Omega : \mathcal{R}^N \to \mathcal{R}^L\). With this we see the most probable \(\vx\) lies in the null space of \(\Omega\), if there is one. If \(p \le 1\), the most probable \(\vx\) are the ones most ``sparsified'' by \(\Omega\). If \(p = 2\), the most probable \(\vx\) are those that point along the eigenvector of \(\Omega^T\Omega\) having the smallest eigenvalue. (Is that right?) The ``analysis MAP'' estimate is thus $$ \hat\vx_{MAP,A}(\vy|\MT,\vv) = \arg \min_{\vx \in \mathcal{R}^N} \|\vy - \MT\vx\|_2^2 + 2\sigma_\vv^2\alpha \|\Omega\vx\|_p^p. $$
  • Assume for \(\alpha > 0\) and \(p \ge 0\) the prior is defined $$ P[\vx] \propto \begin{cases} 0 & \lnot \exists \vs \in \mathcal{R}^K(\vx = \MD\vs) \\ \exp (-\alpha \|\vs\|_p^p) & \end{cases} $$ where the ``dictionary'' \(\MD \in \mathcal{R}^{N\times K}\). We see that only \(\vx\) in the column space of \(\MD\) has non-zero probability density. If \(p \le 1\), then the \(\vx\) with sparse representations in \(\MD\) are the most probable. If \(p = 2\), then the \(\vx\) having small ``energies'' in \(\MD\) are the most probable. The ``synthesis MAP'' estimate is thus $$ \hat \vx_{MAP,S}(\vy|\MT,\MD,\vv) = \MD \hat \vs $$ where $$ \hat \vs = \arg \min_{\vs \in \mathcal{R}^K} \|\vy - \MT\MD\vs\|_2^2 + 2\sigma_\vv^2\alpha \|\vs\|_p^p. $$
  • If \(\MD\) is full rank and \(\Omega^{-1} = \MD\), then \(\hat \vx_{MAP,S}(\vy|\MT,\MD,\vv) = \hat\vx_{MAP,A}(\vy|\MT,\vv)\).
Now we are set to explore these two approaches for sparse approximation, and thus unveil the relationships between sparsity and cosparsity.
More than two years ago, I blogged about the paper, K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, "Music genre classification via compressive sampling," in Proc. Int. Soc. Music Information Retrieval, (Amsterdam, the Netherlands), pp. 387-392, Aug. 2010. This paper reports extremely high classification accuracies of GTZAN --- a problematic dataset. At least three papers since make direct comparisons to its 92.7% classification accuracy:

  1. M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, "Unsupervised learning of sparse features for scalable audio classification," in Proc. ISMIR, (Miami, FL), Oct. 2011.
  2. J. Wülfing and M. Riedmiller, "Unsupervised learning of local features for music classification," in Proc. ISMIR, 2012.
  3. C.-C. M. Yeh and Y.-H. Yang, "Supervised dictionary learning for music genre classification," in Proc. ACM Int. Conf. Multimedia Retrieval, (Hong Kong, China), Jun. 2012.
To my knowledge, no one has attempted to reproduce these results. I emailed the authors October 4 2012 asking whether they had discovered any errors in their tests that could have produced such a high classification accuracy. Since I have yet to hear back, it is time to reproduce the results.

Today, I began looking more deeply at the paper. I was struck by six things.
In July 2002, I participated in my second international research conference: ICAD 2002 in Kyoto, Japan. During my visit to Todai-ji in Nara, I tried to pass through the Buddha-nostril-sized hole in a wooden column, but I failed. This, it is said, gives a person bad luck for a decade. So ten years later, with more white in my beard and hair, I return to the very same spot where children have no trouble. This time, I made it! (But not without the help of two Japanese ladies who pulled with all their might to free this previously-unenlightened gaijin.) After passing through, I did feel enlightened --- until I realized that this means I am just a husk of my former fit self. Either way, it goes on my resume: "Nov. 5, 2012: successfully passed through a hole the size of the great Buddha's nostril."

With that, the video of my first presentation at MIRUM (An Analysis of the GTZAN Music Genre Dataset) is now online thanks to Steve Tjoa --- fellow researcher in applying sparse approximation to audio and music signal processing. Thanks Steve!
After a very long series of flights from Copenhagen, I am happy to be in Nara at the 2012 ACM Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies. Today I present two of my papers, hopefully as awake as I can.

The first is "An Analysis of the GTZAN Music Genre Dataset". My one-line summary is: This dataset, used in more than 20% of work on music genre recognition, has the following problems: replicas, mislabelings, and distortions. The index I have created of the contents are here. (If you have information leading to the identification of those missing, please mail me! :)

The second paper is "Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?" My one-line summary is: High accuracy genre recognition systems behave strangely enough to warrant revisiting the idea that any of them can recognize genre. To reproduce my experiments, I make available all my code here.

Blog Roll

About this Archive

This page is an archive of entries from November 2012 listed from newest to oldest.

October 2012 is the previous archive.

December 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.