# April 2012 Archives

## Experiments with MOD

Having last week successfully understood MOD, and gained proper respect for the importance of initialization and dealing with evaporating atoms, I ran a set of experiments over the weekend testing MOD with a more difficult problem size. Before, I was having MOD run for 1000 iterations to learn dictionaries of cardinality 5 of length-3 atoms from 10 measurements of 1-sparse signals with OMP as the coding step --- which makes it MP since it only selects the one best atom. We often see after some iterations, the coding error of the dictionary produced by MOD reaches numerical precision. In my new experiments, I am having MOD run for 1000 iterations learning dictionaries of cardinality 128 of length-64 atoms (each sampled from the uniform spherical ensemble) from 256 measurements of 8-sparse signals (non-zero elements distributed Normally), again with OMP as the coding step. (I make sure the collection of 256 sparse signals spans $$\mathcal{R}^{128}$$.) But here, I test different stopping conditions for OMP. First, I have OMP select only 8 atoms for coding a measurements. Second, I have OMP select 16 atoms, and from this I select the 8 with the largest magnitude weights, and then orthogonally project onto the span of these. Finally, I have OMP select 24 atoms, from which I select 8 and project in the same. I perform each test for 100 realizations. Below are the results for each realization.

On the left plot, we show the mean coding error as a function of MOD iteration for each set of 100 realizations. It is clear some learning is going on in all three cases. On the right, we plot the distribution (bars) of the final coding error at iteration 1000, and its cumulative distribution (lines). It appears that we aid MOD by increasing the number of atoms selected by OMP, and then choose from among them. There are clearly three significantly different distributions. However, we do not see MOD reaches anything close to numerical precision.

Now, I wonder, what happens when the sparse vectors are distributed Bernoulli-Rademacher? Are there not any atoms produced by MOD that match any in the dictionary? What happens with a highly correlated, e.g., shift-invariant, dictionary? And how does K-SVD perform? And how does the Olhausen and Field approach perform?

All of these questions are motivated by the interesting work in B. Mailhé and M. D. Plumbley, "Dictionary Learning with Large Step Gradient Descent for Sparse Representations", Proc 10th International Conference on Latent Variable Analysis and Source Separation (LVA/ICA 2012), Tel-Aviv, Israel, LNCS 7191, pp. 231-238, March 12-15, 2012.

## A Concatenative Synthesis Patent Challenge?

I just came across these two interesting issued patents on concatenative synthesis:
Just one word difference in their titles, and the latter patent was filed on the same day as the former one for one year; yet both are granted. That is strange.

## Einstein on the Beach

Two tickets please. I just bought mine for the show next week!

Really, if it wasn't for accidentally 13-year-old-me checking out a Philip Glass CD from my small-town library, I probably wouldn't be doing signal processing.

## A missing step from the Method of Directions?

In theory, things can look really good; but then when it comes to implementation, wrinkles arise. Only after some hair pulling did I realize that there may be a step missing in MOD, which I cannot find in the original paper: K. Engan, S. O. Aase, and J. H. Husøy, "Method of optimal directions for frame design", Proc. ICASSP, Vol. 5, pp. 2443 - 2446, 1999.

## Dictionary learning with the Method of Optimal Directions

In K. Engan, S. O. Aase, and J. H. Husøy, "Method of optimal directions for frame design", Proc. ICASSP, Vol. 5, pp. 2443 - 2446, 1999, the authors propose what is essentially a dictionary learning algorithm inspired by the generalized Lloyd algorithm. Given a set of $$L$$ measurements $$\mathcal{U} := \{\vu_i \in \mathcal{R}^m\}$$, expressed as the columns in the matrix $$\MU$$, we want to learn a set of vectors in $$\mathcal{R}^m$$ that minimize some distortion. So the generalized Lloyd algorithm proposes the following steps.

1. Define the number of atoms $$N$$ desired, and the order $$s$$ to use for each approximation of the elements of $$\mathcal{U}$$, and $$i \leftarrow 0$$.
2. Create an initial dictionary of $$N$$ atoms, expressed as a matrix $$\MPhi^{(0)}$$.
3. With $$\MPhi^{(i)}$$ and any algorithm of your choosing, find an $$s$$-order approximation of each element in $$\mathcal{U}$$ such that $$\tilde\vu_i := \MPhi^{(i)}\vx_i$$ where $$\vx_i \in \mathcal{R}^N$$ contains the $$s$$ non-zero approximation weights and zeros in all other elements; define the set of residuals $$\mathcal{E} := \{\vr_i := \vu_i - \tilde\vu_i \}$$
4. Adjust the dictionary $$\MPhi^{(i+1)} \leftarrow f(\MU,\MPhi^{(i)},\MX)$$
5. $$i \leftarrow i+1$$
6. Go to step 3.
While step 4 can be done in many ways, the method of optimal directions (MOD) prescribes a specific approach.

## Theorem 10 of On polar polytopes and the recovery of sparse representations

In an attempt to figure out why orthogonal matching pursuit (OMP) performs better than the guarantees that have been found, I am working my way through this illuminating paper: M. D. Plumbley, "On polar polytopes and the recovery of sparse representations", IEEE Trans. Info. Theory, vol. 53, no. 9, pp. 3188-3195, Sep. 2007. In particular, I was struck by Theorem 10. But first, some background.

Recall Tropp's exact recovery condition (ERC) here or here, or alternatively here. Suppose we have some matrix $$\MD$$ with unit norm columns (for notational convenience), and measurements $$\vx$$, and we know $$\vx$$ is a linear combination of $$s > 0$$ columns of $$\MD$$, specifically the ones indexed by $$\Phi$$. In other words, $$\vx$$ has an $$s$$-term representation. Orthogonal matching pursuit (or basis pursuit) is guaranteed to find these $$s$$ columns composing $$\vx$$ if $$\max_{i \in \Psi} ||\MD_\Phi^\dagger \vd_i ||_1 < 1$$ where $$\MD_\Phi$$ is a matrix of the $$s$$ columns of $$\MD$$ indexed by $$\Phi$$, and $$\Psi$$ indexes the columns of $$\MD$$ that are not participating in $$\vx$$, ($$\vd_i$$ is the $$i$$th column of $$\MD$$). Note that this means OMP (or BP) can recover any $$s$$-term representation from $$\vx$$ only when the above is true for every $$s$$-size subset of columns from $$\MD$$. When this is true, OMP (or BP) is a "correct algorithm" for $$s$$-term representations in $$\MD$$.

Now Theorem 10 in this paper states that the fact a dictionary satisfies ERC for all $$s$$-term representations is not sufficient to guarantee it satisfies ERC for all representations involving $$k < s$$ terms. The proof is constructive. Define some $$2\times 2$$ dictionary. Without looking, and without computing, we can say exactly which columns compose the measurements when $$s = 2$$. Thus, we have exact recovery; and indeed the ERC above says so since $$\Psi$$ is empty. Now define the dictionary $$\MD = [\vd_1 | \vd_2] := \left [ \begin{array}{cc} 1 & \sqrt{2} \\ 0 & \sqrt{2} \end{array} \right ]$$ and let $$\vs := [1, 0]^T$$. Then the ERC becomes $$| \langle [\sqrt{2}, \; \sqrt{2}]^T, [1, \; 0]^T \rangle | = \sqrt{2} > 1$$ and thus the ERC fails. However, with OMP (or BP), or good eyeballing, we can always recover a $$\vs$$ if $$\vx$$ is composed of one column of this $$\MD$$. The take-home point of this example is: just because the ERC is satisfied for all measurements involving $$s$$ columns of $$\MD$$ doesn't mean it is also satisfied for all measurements involving fewer columns.

That is a strange result, and one deserving to be inspected.

## Computer Music Judges

Just today on BBC News, a nice story about some of the first computerized music judges. More specific details of this interesting event are here.

Here are the winners.

## Scatter is on its way back!

I am all moved into London now, and my first task was to get Scatter up and running again. After several dozen hours of hacking code that was dropped over 3 years ago, "Build succeeded" was music to my ears. Here is its first picture as it awoke from its slumber:

It is buggy, and crashing every now and then. But here is what I hope to have recreated very soon:

## Music genre recognition results

I have finally completed my paper "Three revealing experiments in music genre recognition", submitted to ISMIR 2012, which formalizes my results here, here, here, here, here, and here. I make available here the attendant code for reproducing all experiments and figures.

My one line summary of my work is:
Two of the most accurate systems for automatically recognizing music genre are not recognizing music genre, but something else.
That something else is the subject of further work, not to mention replicating and testing other systems --- starting with this curious work.

Thank you to all commentators and test participants!

## Calling listening test participants

I am running some listening tests now, and thought that, maybe, you know, you might like to participate, or something. Your task is to assign the best genre to each of 30 short music excerpts (12 seconds). The ten genres are: Blues, Classical, Country, Disco, Hip hop, Jazz, Metal, Pop, Reggae, and Rock. Your first 10 answers must be correct, otherwise you will not continue to the remaining 20 excerpts. You can listen to each excerpt as many times as you like, but you cannot go back and change your answers. You also must choose one of the 10 genres for each example. Some genres may appear more than others.

If you are a human:
2. unzip it, but do not look at the contents!
3. open MATLAB (I am looking for a way to compile it as a stand-alone program)
4. in MATLAB change to the "humantest" directory
5. type "testgui" in the MATLAB command window
6. put on headphones and take the test
7. send me the text file placed in the directory, e.g., test_08793.txt, and tell me a little about your experience, e.g., was it easy? Did you enjoy it? Were some excerpts easier to categorize than others? Or maybe the software spit out errors. In that case, send me the errors.
8. Finally, drag the directory to the trash, and do not repeat test.
Thanks!

## These algorithms embody next to nothing related to music genre, part 4

Considering that I have a trained music genre classifier --- the same one as I used yesterday --- it would be great to ask of it, "For what are you listening to make a decision?" In my experiment yesterday, I showed that we may simply remaster a single piece of music in one genre to be classified as one of nine others, thus showing the system's decision making strategy to be very fragile, and ultimately illogical. In today's experiment, I am asking the classifier for help in composing musical excerpts that it finds is "definitely" one genre and not the others. To do this, I have built a system that randomly selects and mixes together 3 loops from the 1,198 loops included with Garage Band. These loops include all sorts of drum patterns, piano and guitar comps, brass hits, synthesizer pads, sound effects, etc. After composing an excerpt, the classifier then listens to it and provides a confidence of it belonging to each of the ten genres. If the difference in confidences between the two highest-rated genres, e.g., Metal and Rock, is larger than what it has seen before, I keep the piece, record the difference in the Metal genre, and continue to search for something even more, e.g., Metal, than my previous iterations. In this way, I hope to build the ideal excerpts of each genre, according to the trained classifier, and thereby uncover its idealized models.

## These algorithms embody next to nothing related to music genre, part 3

Bob Wills & His Texas Playboys is one of the great American Western Swing bands of last century, and have a classic sound that helped define the genre, with two fiddles, steel and electric guitars, and completed by hollering courtesy of Wills. Here is a short excerpt of their dance hit "Big Balls in Cowtown":

## These algorithms do not embody anything related to music genre, part 2

Previously, I discussed the dismal performance of two state-of-the-art music genre recognition system when it comes to being trained on hi-fi musical audio, but classifying lo-fi musical audio --- a transformation that unarguably preserves the genre of the underlying music, and to which any system actually using features related to genre should be robust. Then, I discussed specific instances where these systems, even when trained and tested using hi-fi musical audio, make classifications that are "unforgivable", in the sense that ABBA's "Dancing Queen" is Metal, "A whole new world" from Aladdin and "The Lion Sleeps Tonight" are Blues, and "Disco Duck" is Rock. Regardless of their 80% mean accuracies, such behavior tells me that these algorithms act like a child memorizing irrelevant things ("Metal is loud; Classical music is soft"), instead of the stylistic indicators humans actually use to classify and discuss music ("Hip hop often makes use of samples and emphasizes speech more than singing; Blues typically has strophic form built upon 12 bars in common time; Jazz typically emphasizes variations upon themes passed around as solos in small ensembles; Disco is often danceable with a strong regular beat, and emphasizes bass lines, hi-hats, and sexual subjects"). The mean and variances of MFCCs, or modulations in time-frequency power spectra, do not embody characteristics relevant to classifying music genre; and thus these algorithms do not embody anything related to music genre, part 2.

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk