# May 2012 Archives

## Paper of the Day (Po'D): Semantic gap?? Schemantic Schmap!! Edition

Hello, and welcome to Paper of the Day (Po'D): Semantic gap?? Schemantic Schmap!! Edition. Today's paper provides an interesting argument for what is necessary to push forward the field of "Music Information Retrieval": G. A. Wiggins, "Semantic gap?? Schemantic Schmap!! Methodological Considerations in the Scientific Study of Music," Proc. IEEE Int. Symp. Mulitmedia, pp. 477-482, San Diego, CA, Dec. 2009.

My one line summary of this work is:
The sampled audio signal is only half of half of half of the story.

## Wow, Mbalax

This is now my favorite uncontrolled-chaos-to-Western-ears-but-theoretically-highly-ordered music in the world. At first listen, it reminded me immediately of the composed results of this experiment. By "Kor Leer" the impression had become permanent; and by "Deuram" I knew that I must find and listen to all mbalax music. Aside from Nancarrow, e.g., here, mbalax will strain the best algorithms for automatic rhythm description.

## Music Genre Recognition with Bayesian Classification of Scattering Coefficients

During the weekend, I experimented with the scattering coefficient features for music genre recognition. At first, I was using AdaBoost with 1000 decision stumps, giving me just above 80% accuracy. These features being 469 dimensions makes the training process very slow, so I decided why not test a much quicker approach given by Bayesian classification with Gaussianity assumptions. So, I learned class-dependent means and covariances for each class from the training data, the covariance matrix of all the training data. I then implemented the Mahalanobis distance (MDC) and full quadratic classifiers (FDC), the benefits of which include their simple implementation, and quick training and testing. Furthermore, within a Bayesian framework, we can naturally introduce concepts of confidence, risk, and rejection. But I started simple: equal priors and uniform risk. Below we see the mean classification results from 10 independent trials of 10-fold stratified cross validation.

## Proper respect for the norms of atoms in dictionaries

Until recently, I have always thought about sparse approximation and recovery in terms of algorithms with dictionaries having atoms $$\{\phi_i\}$$ with the same $$\ell_2$$ norm. I find it quite typical in the literature to discuss and treat algorithms with such dictionaries. The assumption of unit normness is often explicitly stated in the very beginning of work in this area, such as in Tropp's "Greed is Good," and Mallat and Zhang's "Matching Pursuit." While it makes the mathematics cleaner, its ubiquity led to my lack of respect for the role of the norm of atoms.

To see what I mean, consider the exact sparse recovery problem: $$\min_\vs \|\vs\|_0 \; \textrm{subject to} \; \vu = \MPhi \vs$$ where the columns of $$\MPhi$$ are made from $$\{\phi_i\}$$. In many papers from the past 15 years, I see that authors (including me :) often say something like the following: the above problem is computationally difficult, and so we can replace the $$\ell_0$$-pseudonorm with its closest convex cousin, the $$\ell_1$$-norm: $$\min_\vs \|\vs\|_1 \; \textrm{subject to} \; \vu = \MPhi \vs$$ which can be solved by numerous methods far less complex than that required to solve the other problem. What is more, we know that there are many cases in which the solution to this solvable problem is identical to that of the harder problem.

However, the justification for posing the first problem as the second implicitly assumes the atoms $$\{\phi_i\}$$ have the same $$\ell_2$$-norm --- otherwise, it is not correct! When the atoms do not have the same norm, the geometry of the problem changes, and bad things happen if we do not take this into account. In short, when the atoms in the dictionary have different norms, the $$\ell_1$$-norm of the solution does not act like its $$\ell_0$$-pseudonorm. As posed above, the minimal $$\ell_1$$-norm solution will likely use atoms that are much longer than all the others, because their weights will be smaller than those for the shorter atoms.

Instead, the correct way to pose the convexification of the exact sparse problem with the $$\ell_1$$-norm is $$\min_\vs \|\MN\vs\|_1 \; \textrm{subject to} \; \vu = \MPhi \vs$$ where $$\MN$$ is a diagonal matrix with $$[\MN]_{ii} := \|\phi_i\|_2$$. Now, the weights of all the atoms are treated equally, no matter their norms in the dictionary. It may seem pedantic, but I have seen this implicit condition lead to some confusion about how to apply particular principles and conditions. I have yet to find such a discussion, however; and so hope to fill this gap in a brief article I am writing.

## Experiments with Multiscale Scattering for Music Genre Classification

I have been experimenting with the approach to feature extraction posed in J. Andén and S. Mallat, "Multiscale scattering for audio classification," Proc. Int. Soc. Music Info. Retrieval, 2011. Specifically, I have substituted these "scattering coefficients" for the features used by Bergstra et al. 2006 into AdaBoost for music genre recognition.

The idea behind the features reminds me of the temporal modulation analysis of Panagakis et al. 2009, which itself comes from S. A. Shamma, "Encoding sound timbre in the auditory system", IETE J. Research, vol. 49, no. 2, pp. 145-156, Mar.-Apr. 2003. One difference is that these scattering coefficients are not psychoacoustically derived, yet they appear just as powerful as those that are.

## More Experiments with MOD

Continuing from my experiments last week, I have decided to test to what degree the problems of MOD are inherited from the problems of OMP as the choice in its approximation step. So as before, I run MOD for 1000 iterations to learn dictionaries of cardinality 128 from 256 measurements generated from 8 length-64 atoms (each sampled from the uniform spherical ensemble). The weights of the linear combination are iid Normal. In these experiments, however, I substitute the oracle support and find the weights by least squares. So, given the $$l$$th measurement is composed of atoms indexed by $$I_l$$, after each dictionary update, I find the new weights by an orthogonal projection onto the span of the atoms indexed by $$I_l$$. In this way, we remove OMP to see the behavior of MOD in the best case scenario: known support. Finding the support is the hardest part anyway.

Below are mean coding errors over all 1000 iterations for 100 independent trials. We see MOD found the correct dictionary less than 10% of the time.

A close-up of where the majority end shows a mean error of about -20 dB. Using instead OMP, and picking 3 times the number of atoms with debiasing, we see a mean error of about -9 dB in the same experimental setup. So, clearly, knowing the support does help.

These results appear better than those reported in B. Mailhé and M. D. Plumbley, "Dictionary Learning with Large Step Gradient Descent for Sparse Representations", Proc 10th International Conference on Latent Variable Analysis and Source Separation (LVA/ICA 2012), Tel-Aviv, Israel, LNCS 7191, pp. 231-238, March 12-15, 2012.

My further experimentation reveals that OMP often finds the support of all measurements when using the real dictionary; and when it is run to twice or three times the expected sparsity, it always finds it (at least in the 100 independent trials I am running). This is not so surprising since this dictionary, sampled from the uniform spherical ensemble, likely has very low coherence. So, this says to me that the failures of MOD in this case are due mostly to the dictionary update, which obviously hinders good coding by OMP, which obviously hinders the dictionary update, which clearly hinders good coding, and so on.

Next up is looking at MOD with shift-invariant dictionaries.

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk