# June 2011 Archives

## SPARS 2011, day 3

Big things today, with plenaries given by David Donoho and Martin Vetterli. Donoho answered all the questions I have regarding the variability of recovery algorithms on distributions underlying sparse vectors. I just need a few years to understand them. I also need to look more closely at approximate message passing. And Vetterli gave a great talk, discussing the tendency in algorithm development to jump to a solution before solving the outstanding problem, e.g., sampling the real continuous world on a discretized grid.

Now I need to eat dinner, and run some experiments.

## SPARS 2011, day 2

Though the SPARS2011 twitter feed appears miserable, this workshop is jam packed by excellent presentations and discussions. I think too many people are having too much good discussion to have too much time to twitter.

Today at SPARS 2011: Heavy hitters Francis Bach and Rémi Gribonval delivered the two plenary talks. This morning Bach talked on a new subject for me: submodular functions. In particular, he is exploring these ideas for creating sparsity-inducing norms. A motivation for this work is that while the l1 norm promotes sparsity within groups, it does not promote sparsity among groups... or vice versa (it is new to me). But I liked how he described his formalization as "the norm-design business." Someone asked him a question about analyzing greedy methods vs. convex optimization. Bach's answer made me realize that we can more completely understand the behavior of convex optimization methods than greedy methods because convex methods are decoupled from the dictionary. For greedy methods, the dictionary is involved from the get go.

This afternoon, Gribonval talked on "cosparsity", or when a signal is sparsely represented by the dual of a frame instead of the frame itself. His entire talk revolved around looking more closely at the assumption that atomic decomposition and a transform are somehow similar. Or that when we say a signal is sparse, we mean it is sparse in some dictionary; but we can also mean its projection on a frame is sparse. This is then "cosparsity", which brings with it l1-analysis. To be a little more formal, we can considering solving the "synthesis" problem $$\min_\vz || \vz ||_1 \; \textrm{subject to} \; \vy = \MA \MD \vz$$ where we assume $$\vz$$ is sparse; or the "analysis" problem $$\min_\vx || \MG \vx ||_1 \; \textrm{subject to} \; \vy = \MA \vx$$ where we assume the analysis (or transformation) of $$\vx$$ by $$\MG$$, i.e., $$\MG\vx$$, is sparse. Gribonval et al. have done an excellent job interpreting what is really going on with l1-analysis. Instead of wanting to minimize the number of non-zeros in the signal domain, l1-analysis wants to maximize the number of zero in the transform domain. Later on, his student Sangnam Nam presented extraordinary results of this work with their Greedy Analysis Pursuit, which attempts to null non-zeros in the solution. This reminded me a bit about the complementary matching pursuit, but this is quantitatively different. Gribonval joked that "sparsity" may now be "over." The new hot topic is "cosparsity."

There were many other exciting talks too, showing extraordinary results; but now I must go and work on some interesting ideas that may or may not require my computer to run through the night.

## SPARS 2011

And so it begins! A whole week of nothing but sparsity in various forms and guises. My summer has officially started!

The proceedings collect all the accepted one-page submissions, which I find provide very tantalizing details. And for a cool down, I am reading Michael Elad's excellent book Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. It does for sparse signal processing what Hamming's book does for digital filters: completely accessible, drawing together numerous disciplines, and giving a good big picture perspective.

Today began with a bang, featuring Yi Ma. Just as Andrew Ng's Google Talk, Ma amazed me (and I am sure many others) with his examples of the incredible power of Robust PCA for everything from face and text alignment, to extracting the geometry of buildings from 2D pictures without any use of edge or corner detection. All one needs are the pixels, and the rest is done by the assumption that the image can be decomposed into a low rank texture matrix, and a sparse matrix with non-textural items, like a person moving in front of a background. One of my favorite examples was where he took 30 images of Bill Gate's face. Robust PCA aligned them all, corrected for transformations like shearing, and produced a mean image of Bill Gates. Now, I wonder, can we do the same for a piece of classical music, where we create a mean version of a particular Bach Partita from a dozen Glenn Gould recordings?

There were many other fantastic talks and conversations to be had. Because my internet access at this expensive hotel is free only for 30 minutes every 24 hours at a severely limited bandwidth, I must limit my description to that. Tomorrow will be another exciting day in "Sparseland", as Elad calls it.

## CMP in MPTK: Third Results

In a previous entry, I compared our results with those produced by my own implementation of CMP in MATLAB --- which did not suffer from the bug because it computes the optimal amplitude and phases in a slow way with matrix inverses. Now, with the new corrected code, I have produced the following results. Just for comparison, here are the residual energy decays of my previous experiments, detailed in my paper on CMP with time-frequency dictionaries.

Now, with the corrections, I observe the decays. The "MPold" decay is that produced by the uncorrected MPTK. "MP" shows that of the new code. Only in Attack and Sine do we see much difference; and at times in Sine the previous version of MPTK beats the corrected version. (Such is the behavior of greedy algorithms. I will write a Po'D about this soon.) Anyhow, the decays of CMP-$$\ell$$ (where the number denotes the largest number of possible cycles of refinement, but I suspend refinement cycles when energyAfter/energyBefore > 0.999), comports with the decays I see in my MATLAB implementation (see above). So, now I am comfortable moving on.

Below we see the decays and cycle refinements for three different CMPs for these four signals. (Note the change in the y axes.) Bimodal appears to benefit the most in the short term from the refinement cycles, after which improvement is sporadic. The modeling of Sine has a flurry of improvements. It is interesting to note that as $$\ell$$ increases, we do not necessarily see better models with respect to the residual energy. For instance, for Attack, the residual energy for CMP-1 beats the others.

And briefly back to the glockenspiel signal, below we see the decays and improvements using a multiscale Gabor dictionary (up to atoms with scale 512 samples).

## Grab your things, I've come to take you home!

I have solved the mystery that has pushed me for the past week into excruciatingly fun debugging sessions. Yes, I know I mentioned on June 9 that CMP was extremely easy to implement in MPTK. Then came second thoughts as to the behavior of the implementation. And there followed more observations, and rambling observations, and then the videos appeared. And then the music video appeared. Well, now here's another: ex1_MP_atoms_solved.mov

## Don't Give Up

This is my life the past few days. And yet again, I think I have it cornered. The same thing happens for atoms at the Nyquist frequency. Now, how to fix it?

## People, normalize your signals before you write wavs

MPTK works!

ex1_MP_atoms.mov

In my experiments before, the MP reconstruction algorithm was hard clipping all values with magnitude greater than 1. So that is from where the spikes come. Oh, for F's sakes.

## Debugging repository

We have decided to get to the bottom of the unusual behavior of MPTK, since the next steps of our work on CMP depend on it. This entails comparing the results from my MATLAB implementation with those of MPTK (and CMPTK) on the same dictionary. I have decomposed one signal of dimension 1024 samples, with a dictionary of modulated Gaussian windows of scale 64 samples, with a hop size of 8 samples. The MPTK dictionary is defined as follows (couldn't use pre html tags for some reason, so I attach it as a png):

## Paper of the Day (Po'D): Encoding vs. Training with Sparse Coding Edition

Hello, and welcome to Paper of the Day (Po'D): Encoding vs. Training with Sparse Coding Edition. Today's paper is A. Coates and A. Y. Ng, "The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization", In Proc. Int. Conf. on Machine Learning, Bellevue, Washington, USA, Jun. 2011. In this work, which is a follow-up to their AISTATS'11 paper, the authors explain why the dictionary used in encoding the data is not as important in providing a good representation, as the encoding algorithm itself. The authors report competitive results even when the dictionary is populated with random samples from the data.

## Some Experiments with Glockenspeil

Today I have been experimenting with CMPTK and a real audio signal. With this larger signal, the energy errors by which I have been plagued this last week seem to be much more rare.

Below we see the residual energy decay of this example with MP and CMPTK using a dictionary of Gabor atoms (Gaussian window) of only two scales: 128/32 and 4906/64/8192 (scale/hop/FFTsize if different from scale). I run 200 iterations. CMP-$$l$$ is implemented such that all representations at each order undergo at least one cycle. When $$l = 5$$, more refinement cycles can be performed until the ratio of residual energies before and after a cycle is less than 1.002, or less than about 0.009 dB. I also plot in this graph, the "cycle energy decrease," which is the ratio of the residual energy before and after the entire refinement at the iteration. We find a few large spikes of improvement. At the end of 200 iterations, the models produced by CMP have an error 2.2 dB better than that produced by MP.

## A bug or a feature?

I have been thinking about why CMPTK using a MDCT dictionary with Kaiser-Bessel windows produced no increases in energy, unlike all the other dictionaries I have been trying. Maybe it is a confluence of window shape, and real signals being approximated by complex atoms, as well as various approximations being made inside MPTK. So I tried something. (Warning: the following is a rather rambling record of observations, probably what happens inside Dr. House's head.)

## Debugging all day

Today was incredibly concentrated on the strange behaviors of CMP in MPTK. We found some errors are due to quantization, and others to energy approximation (to avoid lots of high dimension inner products). But none of these had that much of an effect. We tried real MDCT atoms instead of complex Gabor atoms; and that seemed to make things worse. We tried to update all the inner products at each step, but that didn't help. We pulled out the big guns and ran step by step through the code with Eclipse and then XCode, trying to access shards of memory containing the frequencies of atoms impenetrably type cast. No matter what we did, it seemed that CMP was making poor atom choices during some of its refinement cycles. This is theoretically unacceptable. It should never ever ever increase the residual energy by an atom replacement.

## CMP in MPTK: Second Results and Second Thoughts

Below are our test signals.

## SPARS 2011

The program for SPARS 2011 has just been announced, and it really looks like a dense one: all sparsity all day for four days straight. The programmed whiskey tasting will serve as an oasis.

## CMP in MPTK: First Results

From June 7 to 17, I am visiting Rémi Gribonval and other colleagues at L'IRISA (Institut de recherche en informatique et systèmes aléatoires) in Rennes, France, where I am on a scientific mission supported by the French Ambassador to Denmark. My goals this first visit are twofold:
1. implement cyclic MP (CMP) within the MPTK framework;
2. empirically study its application to real audio and music signals.
Before this visit, CMP only existed in the MATLAB backwaters of few esoterics; and there, it is only usable for small dictionaries, and only applied to short signals. We have now finished implementing CMP in MPTK, which required only simple modifications --- no doubt due to the well-thought architecture of the MPTK library. And now begins the more difficult process of determining the cost-benefit trade-offs for CMP and other variants. The plot below shows the first result.

Stay tuned for many more results!

## Consider a dictionary of two atoms

Say we have some signal $$\vx \in \mathcal{C}^N$$, and a dictionary of two atoms $$\mathcal{D} = \{\vd_1, \vd_2 \in \mathcal{C}^N : || \vd_i ||_2 = 1 \}$$. And assume that the signal is not in the span of the dictionary, so there is some residual with non-zero norm. We can minimize this norm by an orthogonal projection: $$\vx_\mathcal{D} := a_1^o \vd_1 + a_2^o \vd_2$$, where \begin{align} a_1^o & = \frac{\langle \vx, \vd_1\rangle - \langle \vx, \vd_2\rangle \langle \vd_1, \vd_2 \rangle }{1 - |\langle \vd_1, \vd_2 \rangle |^2} \\ a_2^o & = \frac{\langle \vx, \vd_2\rangle - \langle \vx, \vd_1\rangle \langle \vd_2, \vd_1 \rangle }{1 - |\langle \vd_1, \vd_2 \rangle |^2} \end{align} Now, consider that we run matching pursuit on this signal with this dictionary. Assume that $$\vd_1$$ is the first atom selected; and then necessarily $$\vd_2$$. The resulting approximation of this signal gives the weights \begin{align} a_1 & = \langle \vx, \vd_1\rangle \\ a_2 & = \langle \vx - a_1\vd_1, \vd_2\rangle \end{align} and the residual $$\vr = \vx - a_1 \vd_1 - a_2\vd_2$$. What if we add to the residual $$a_1 \vd_1$$ and recompute the weight of the first atom? And then after that, what if we add to the new residual $$a_2 \vd_2$$, and recompute the weight of the second atom? And what if we do this an infinite number of times? Do the resulting weights, $$\{a_1^{(\infty)}, a_2^{(\infty)}\}$$ converge to those of the orthogonal projection above? And if so, how fast do they converge?

## Tropp's More Useful Exact Recovery Condition

I have thus far reviewed two theorems from Tropp's pièce de résistance "Greed is Good" (GiG) one showing a sufficient condition for exact recovery from OMP, and the other showing the same sufficient condition for exact recovery from BP. We cannot really use these theorems, however, to determine whether or not a given signal $$\vx$$ is recoverable using OMP or BP in some dictionary $$\mathcal{D} = \{ \vd_i : i \in \Omega\}$$. To do so requires we already know the best set of functions in the dictionary for the signal. However, Tropp finds the maximum signal sparsity $$s$$ to which we can push a dictionary to ensure the "exact recovery condition" (ERC) is met --- that is, OMP will recover the correct $$s$$ atoms in its first $$s$$ iterations. Theorem 3.5 of GiG states,

OMP and BP will recover any signal $$s$$ sparse in the dictionary $$\mathcal{D} = \{ \vd_i : i \in \Omega \}$$ if $$\mu_\Sigma(s-1) + \mu_\Sigma(s) < 1$$ where $$\mu_\Sigma(s) := \max_{| \Gamma | = s} \max_{\lambda \in \Omega \backslash \Gamma} \sum_{\gamma \in \Gamma} | \langle \vd_\lambda, \vd_\gamma \rangle |$$ is the cumulative coherence of the dictionary.

## Recovery of Compressively Sensed Sparse Signals in Noise, pt. 2

Continuing from my experiments with recovery of compressively sensed vectors from measurements with noise distributed normally, I now have results for SNR's at 40, 50, 60, and Inf dB. In the image directly below, we see the phase transitions for these algorithms for sparse vectors distributed Bernoulli, i.e., constant amplitude random sign. Not much changes down to 50 dB SNR (BP takes a little dive at 50 dB); but somewhere between there and 40 dB SNR, all the algorithms begin to fail. IRl1 seems much more resilient than the others. It is also strange that adding more measurements makes makes TST, IHT, and SL0 (I am using the robust version) freak out. The greedy methods and BP appear to care less about the extra measurements. And we see IST finally grow above the other algorithms for the first time. No doubt, there are some tuning issues with SL0, as well as those created by Maleki and Donoho. Also, the selection from the ensemble approach by the criterion stated here, seems to really like the solutions by SL0 because they have nearly zero error. I find that if I change 10 to 1, things perform much better. (So it is time to stop this ad hoc business and find the best selection criterion given expected SNR. :)

Below we see the phase transitions for the same experiments, but this time with sparse vectors distributed normally. Very little changes between 50 to Inf dB SNR; but again between there and 40 dB something dramatic happens. The greedy methods and BP care little about the extra measurements, but the performance of all the others appear hindered. The ensemble selection approach is still magnetized to solutions found by SL0.

## Paper of the Day (Po'D): Neural Networks and Principal Component Analysis Edition

Hello, and welcome to Paper of the Day (Po'D): Neural Networks and Principal Component Analysis Edition. Today's paper is P. Baldi and K. Hornik, "Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima," Neural Networks, vol. 2, pp. 53-58, 1989. In the spirit of unsupervised feature learning with single-layer networks, I decided to study one of the earliest of such networks.

## Improved solution selection from an ensemble of compressed sensing recovery algorithms

The other day, I discussed selecting the best solution from a set of solutions produced by several sparse recovery algorithms. I think this idea is a quite natural one since it is unreasonable to assume that one recovery algorithm can be better than all the rest in all conditions. We have seen several times that different algorithms behave quite differently when, e.g., we change the distribution underlying the sensed sparse signals. When we add noise to the measurements, then the performance of these algorithms change as well.

## Tropp's Exact Recovery Condition for Basis Pursuit

Yesterday, I looked at Tropp's exact recovery condition (ERC) for orthogonal matching pursuit (OMP). Today, I continue with his extrapolation to basis pursuit (BP), otherwise known as the principle of relaxing the definition of strict sparsity by the $$\ell_1$$-norm. Assume, as yesterday, we have a $$m$$-dimensional signal $$\vx$$ with a unique and $$s$$-sparse representation in a dictionary $$\mathcal{D} = \{ \vd_i : i \in \Omega\}$$ of $$N$$ atoms, notated in matrix form as $$\MD$$. Also, all $$s$$ atoms participating in this representation are the columns of the matrix $$\mathbf{\Phi}_\vx$$; and I notate the indices of the atoms not participating as $$\Psi$$. Assume for this signal $$s$$ is as small as possible.

Theorem 3.3 in J. Tropp, "Greed is good: Algorithmic results for sparse approximation," IEEE Trans. Info. Theory, vol. 50, pp. 2231-2242, Oct. 2004, states
The ERC $$\max_{i \in \Psi} ||\mathbf{\Phi}_\vx^\dagger \vd_i ||_1 < 1$$ is a sufficient condition for recovering the sparsest representation of $$\vx$$ from a dictionary $$\mathcal{D}$$ by solving the problem $$\min_{\vs \in \mathcal{C}^N} || \vs ||_1 \; \textrm{subject to} \; \vx = \MD \vs.$$
It is amazing the same condition that applies to OMP also applies to BP --- but I bet one is sharper than the other. Now, let's look at the proof.

It's about what you have displayed on your bookshelf. A recent and informal survey has shown that music listeners actually only listen to a small percentage of their music collection. VLDCMCaR (pronounced vldcmcar) can help with that. I don't have a hard time believing this to be the case. Back in the US, I have waiting for me a wide and varied collection of about 1,000 music CDs. Most of these are contemporary music, with a lot of experimental and electronic. Of all these, I would say there is less than 50 I would listen to regularly, ten of which are the proverbial take-to-a-desert-island. All the others I think of references; but I feel still compelled to burn them all onto my digital music listening device (which I swear gets heavier the more data I have loaded on it). I know it is an irrational compulsion; but I have always been inclined to collect (to the annoyance of my wife. BTW, yesterday I found on the street a deed to a car signed 1922 in Copenhagen).

For most of the CDs I own, I can recall a specific experience that makes the piece on it special. For instance, when listening the first time to Stockhausen's Mantra, I wasn't digging it. Twenty minutes later I was blown away by his use of repetition in making clear a delicate phrasing that I so wanted to replicate in my own work. I was a Stockhausen convert. Then I saw the CD player had been repeating the first 1-minute track. Seventeen years later, that first minute sticks with me today, and is one of my favorites. Along with the first 8 seconds of Brahms 1st symphony. And then there is that 10 seconds in Bruckner's 9th; and those few seconds in Mahler's 3rd that sounds exactly like Star Trek. Maybe I should create a compilation, like The Greatest Hits of the Second Viennese School.

## Tropp's Exact Recovery Condition for Orthogonal Matching Pursuit

Today, I am looking at Theorem 3.1 in J. Tropp, "Greed is good: Algorithmic results for sparse approximation," IEEE Trans. Info. Theory, vol. 50, pp. 2231-2242, Oct. 2004. (I see he did this when he was a student member of IEEE!) Consider we have some signal $$\vx \in \mathcal{C}^N$$, and a dictionary of functions $$\mathcal{D} = \{ \vd_i \in \mathcal{C}^N : i \in \Omega \}$$. Assume that $$\mathcal{D}$$ contains a subset of functions indexed by $$\Lambda_\vx$$ such that $$\vx$$ has an exact $$m$$-sparse solution $$\vx = \sum_{\lambda \in \Lambda_\vx} s_\lambda \vd_\lambda = \mathbf{\Phi}_\vx \vs$$ where all $$m$$ coefficients of $$\vs$$ are non-zero, and the set of functions $$\{\vd_\lambda : \lambda \in \Lambda_\vx\}$$ are linearly independent, i.e., $$\mathbf{\Phi}_\vx$$ has full rank. Furthermore, assume $$m$$ is as small as it can go for this pair of signal and dictionary. From this, define the indices of the unused atoms as $$\Psi := \Omega \backslash \Lambda_\vx$$, and make those atoms columns in the matrix $$\mathbf{\Phi}_\Psi$$.

## Recovery of Compressively Sensed Sparse Signals in Noise

The other day I presented some preliminary results. Now I have rerun the experiments using the same algorithms, and a correction to iterative re-weighted $$\ell_1$$ minimization (IRl1). I am still using up to four reweightings, and the initialization $$\epsilon = 0.1$$ and $$\MW = \MI$$; but for the noisy case it is now solving the following convex optimization problem: $$\min || \MW \vx ||_1 \; \textrm{subject to} \; || \vy - \MA\vx ||_2 \le ||\MA\vx||_2 10^{-\textrm{SNR}/20}.$$ (The noise signal has a standard deviation $$||\MA\vx||_2 10^{-\textrm{SNR}/20}$$. This made me a bit uncomfortable thinking it will result in no solution, but we will always have a solution as long as $$\MA$$ is full-rank.)

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk