Hello, and welcome to the Paper of the Olden Days (Po'OD): Hierarchical and Parallel Matching Pursuit Edition. H. Feichtinger, A. Türk, and T. Strohmer, "Hierarchical parallel matching pursuit," in Proc. SPIE Int. Soc. Opt. Eng., vol. 2302, pp. 222-232, 1994. <footnote>This paper begins in an interesting way: "The by now well-known matching pursuit method of S. Mallat as well as the recently proposed orthogonal matching pursuit ..." I think it is always dangerous to begin a research exposition by stating the subject of study is popular, or well-known ... and thus an attempt to indirectly infer one's relevance. And with the context of the Po'OD, I am assuming that "well-known" does not mean matching pursuit was front page news in the NY Times; or that it hosted Saturday Night Live. But this sentence gives me historical pause to think, "At that time I was just a stupid kid 18 years old and incapable of matching my socks, let alone think about decomposing signals into atoms, which aren't really atoms. </footnote>

# May 2010 Archives

*Added May 29: The below is a satirical record of my research day, which is meant to be humorous. In no way does it reflect how my research day actually proceeds (except for the visits to the Wittenborg 7100). I originally wrote this a few years ago when I was writing my dissertation, but I updated it here, and included some recent results of a simulation I am running for my submissions to Asilomar 2010.*

13:15:01 -- Ok, I am going to live-blog me writing some research papers. First, for Asilomar 2010, due June 1!!!

13:15:02 -- quick check of CNN headlines, oh! And some YouFace photos from Halloween. Crazy people I don't even know.

Continue reading Live-blogging a portion of my research day.

Here is an interesting presentation of some "compressed sensing" applied to extremely downsampled (either uniformly or randomly) uniformly oversampled audio signals. (

**WARNING:**This website design is dreadful with its white text on a black background --- which I can only look at, let alone read, for a few moments at a time until I pass that critical time where the persistence of vision causes my peripheral world to collapse in on itself like half my head is straddling the Schwarzschild radius of a star once held in eloquent stasis by an unbiased hydrostatic equilibrium, thus ruining my day relativity speaking. I recommend copying and pasting the text into a LaTeX template, and compiling into two column format with 12 pt font.)
Continue reading "Compressed Sensing" and Undersampled Audio Reconstruction.

Hello, and welcome to the Paper of the Day (Po'D): Environmental Sound Recognition with Time-frequency Features Edition. Today's papers address the general problem of detecting the activities within some environment using acoustic data:

- S. Chu, S. Narayanan, and C.-C. J. Kuo, "Environmental sound recognition with time-frequency audio features," IEEE Trans. Audio, Speech, Lang. Process., vol. 17, pp. 1142-1158, Aug. 2009.
- S. Chu, S. Narayanan, and C.-C. J. Kuo, "Environmental sound recognition using MP-based features," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Las Vegas, NV), pp. 1-4, Mar. 2008.

Hello, and welcome to Paper of the Day (Po'D): Speech Recognition by by Sparse Approximation Edition. Today's paper describes using sparse approximation to aid in the automatic recognition of spoken connected digits: J. Gemmeke, L. ten Bosch, L.Boves, and B. Cranen, "Using sparse representations for exemplar based continuous digit recognition", in Proc. EUSIPCO, pp. 1755-1759, Glasgow, Scotland, Aug., 2009 (pdf). I see that this method is related to that in: P. Leveau, E. Vincent, G. Richard, and L. Daudet, "Instrument-specific harmonic atoms for mid-level music representation," IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. Instead of recognizing words, the authors are recognizing musical instruments from essentially atomic "exemplars" learned from sparse approximations of a training dataset. (And they even use the Viterbi algorithm!) Also closely related is non-negative matrix factorization of musical spectra, for example: C. Fevotte, N. Bertin, and J.-L. Durrieu, "Nonnegative Matrix Factorization with the Itakura-Saito Divergence. With Application to Music Analysis," Neural Computation, vol. 21, no. 3, Mar. 2009.

Hello and welcome to Paper of the Day (Po'D): Compressed Sensing and Coherent Dictionaries Edition.Today's paper was touted by Igor Carron and his excellent, tell-mom-to-wait-while-I-just-read-this-new-entry blog, Nuit Blanche: E. J. Candès, Y. C. Eldar, and D. Needell, "Compressed sensing with coherent and redundant dictionaries," arXiv:1005.2613v1, May 2010. This paper is so new, it is in the process of being distributed to peers for review as we speak.

Consider that we model some real signal \(\vx \in \mathcal{R}^K\) as a linear combination of atoms from some overcomplete dictionary \(\MD \in \mathcal{R}^{K\times N}\). Let us assume our dictionary is "well-selected," which means that we assume there exists for our signal a non-empty set of solutions $$\mathcal{S}^* := \{\vs^* \in \mathcal{R}^N: || \vx - \MD\vs^* ||_2 \le \epsilon, \vs^* \;\text{is compressible}\}$$ where \(\epsilon > 0\), and the compressibility part means, e.g., that \(||\vs^*||_0 \ll K\), or less strictly, that we can order the magnitudes of each \(\vs^*\) such that their decay is bounded from above by a decaying exponential function. This generally means that for the dictionary, \(N \gg K\), which implies that the dictionary coherence is not zero (the maximum non-diagonal terms in \(\MD^T\MD > 0\)), and that there is a linear dependence among its columns. Now, consider that we do not observe \(\vx\) directly, but instead measure it using a fat random (zero mean, unit row variance) matrix \(\mathbf{\Phi} \in \mathcal{R}^{M\times K}\) where \(M < K\). Thus, our model becomes $$\vy = \mathbf{\Phi}\vx + \vn = \mathbf{\Phi}\MD\vs + \vn$$ where \(\vn \in \mathcal{R}^M\) is some zero-mean noise with power \(\epsilon^2\). Notice that our observation \(\vy\) is in a smaller dimensional space than the column space of the dictionary. In this sense \(\vx\) is "compressively sensed."

Now some important questions are the following:

The good news here is that as long as we know our signal is compressible in some dictionary, and the magnitudes in \(\MD^T\vx\) decay quickly e.g., \(\MD^T\vx\) is itself sparse), we can recover the possibly non-sparse signal \(\vx\) using \(\ell_1\) analysis within some maximum error no matter how large is the coherency of the dictionary, e.g., \(1\).

The authors do provide one instance of where the magnitudes in \(\MD^T\vx\) do not decay quickly, but the original signal is sparse and still recoverable from projections onto random vectors. Unfortunately this example is artificial. Consider \(\vx\) to be a modulo periodic Dirac comb with impulses spaced every \(\sqrt{n}\) samples (assuming \(n\) is a perfect square). Thus, \(\vx\) is equally sparse in the standard basis (i.e., as itself), and in the Fourier series basis. (In this case we are assuming \(\epsilon = 0\).) Now, if we concatenate these two bases into the dictionary \(\MD\) of \(2n\) atoms, we see that \(\MD^H\vx\) will not have quickly decaying magnitudes. It will have \(2\sqrt{n}\) of the same magnitudes (considering all atoms are unit norm). However, by the theorem above, if we make \(\Upsilon > 2\sqrt{n}\) we can

Though we have good news above, the bad news is that this approach says nothing about the second question: how can we find some of the efficient solutions in \(\mathcal{S}^*\)? This is the problem of sparse representation, and as a sparse approximologist specializing in the domain of high-dimensional audio signals, this is the question I am most interested in solving.

Instead of \(\ell_1\) analysis though, we can consider \(\ell_1\) synthesis, e.g., Basis Pursuit by a convex optimization with \(K\) constraints: $$\hat \vs = \arg \min_{\vs' \in \mathcal{R}^N} || \vs' ||_1 \; \text{subject to} \; \vx = \MD \vs'$$ and hopefully here, \(\hat \vs \in \mathcal{S}^*\). But consider what happens when we multiply our model from the left by a random sensing matrix \(\mathbf{\Phi} \in \mathcal{R}^{M\times K}\). The problem above becomes equivalently $$\hat \vs = \arg \min_{\vs' \in \mathcal{R}^N} || \vs' ||_1 \; \text{subject to} \; \mathbf{\Phi}\vx = \mathbf{\Phi}\MD \vs'$$ where now we have \(M < K\) constraints. This is the same problem as before as long as \(M\) is large enough (which depends on the expected sparsity of \(\vs\), but we have reduced the dimensionality of the problem. Thereby we may more efficiently find a good solution to our model. This approach is exactly that which is empirically considered in: M. Christensen, J. Østergaard, and S. H. Jensen, "On compressed sensing and its application to speech and audio signals," in Proc. Asilomar Conf. Signals, Systems, and Computers, (Pacific Grove, CA), Nov. 2009. (This paper was my first Po'D.)

Consider that we model some real signal \(\vx \in \mathcal{R}^K\) as a linear combination of atoms from some overcomplete dictionary \(\MD \in \mathcal{R}^{K\times N}\). Let us assume our dictionary is "well-selected," which means that we assume there exists for our signal a non-empty set of solutions $$\mathcal{S}^* := \{\vs^* \in \mathcal{R}^N: || \vx - \MD\vs^* ||_2 \le \epsilon, \vs^* \;\text{is compressible}\}$$ where \(\epsilon > 0\), and the compressibility part means, e.g., that \(||\vs^*||_0 \ll K\), or less strictly, that we can order the magnitudes of each \(\vs^*\) such that their decay is bounded from above by a decaying exponential function. This generally means that for the dictionary, \(N \gg K\), which implies that the dictionary coherence is not zero (the maximum non-diagonal terms in \(\MD^T\MD > 0\)), and that there is a linear dependence among its columns. Now, consider that we do not observe \(\vx\) directly, but instead measure it using a fat random (zero mean, unit row variance) matrix \(\mathbf{\Phi} \in \mathcal{R}^{M\times K}\) where \(M < K\). Thus, our model becomes $$\vy = \mathbf{\Phi}\vx + \vn = \mathbf{\Phi}\MD\vs + \vn$$ where \(\vn \in \mathcal{R}^M\) is some zero-mean noise with power \(\epsilon^2\). Notice that our observation \(\vy\) is in a smaller dimensional space than the column space of the dictionary. In this sense \(\vx\) is "compressively sensed."

Now some important questions are the following:

- When and how well can we recover \(\vx\) from \(\vy\), using the knowledge that \(\mathcal{S}^*\) is not empty?
- When and how can we find any element from \(\mathcal{S}^*\)?

The good news here is that as long as we know our signal is compressible in some dictionary, and the magnitudes in \(\MD^T\vx\) decay quickly e.g., \(\MD^T\vx\) is itself sparse), we can recover the possibly non-sparse signal \(\vx\) using \(\ell_1\) analysis within some maximum error no matter how large is the coherency of the dictionary, e.g., \(1\).

The authors do provide one instance of where the magnitudes in \(\MD^T\vx\) do not decay quickly, but the original signal is sparse and still recoverable from projections onto random vectors. Unfortunately this example is artificial. Consider \(\vx\) to be a modulo periodic Dirac comb with impulses spaced every \(\sqrt{n}\) samples (assuming \(n\) is a perfect square). Thus, \(\vx\) is equally sparse in the standard basis (i.e., as itself), and in the Fourier series basis. (In this case we are assuming \(\epsilon = 0\).) Now, if we concatenate these two bases into the dictionary \(\MD\) of \(2n\) atoms, we see that \(\MD^H\vx\) will not have quickly decaying magnitudes. It will have \(2\sqrt{n}\) of the same magnitudes (considering all atoms are unit norm). However, by the theorem above, if we make \(\Upsilon > 2\sqrt{n}\) we can

*recover exactly*\(\vx\) using \(\ell_1\) analysis with a number of measurements on the order of \(\Upsilon \log (2n/\Upsilon)\). The second that \(\vx\) displays a difference in the amplitude of any one of its impulses, or equivalently that \(\vx\) takes on a non-zero mean, this guarantee will vanish since the tail of \(\MD^H\vx\) will no longer be non-zero. So, in brief, this "little surprise" is indeed "little" but conceptually interesting. Note too that there exist an infinite number of ways to represent our comb signal in this dictionary, the two sparsest ways being \(\sqrt{n}\) impulses*or*\(\sqrt{n}\) sines. The sparsity of the next (infinite number of) sparest solutions to \(\vx = \MD\vs\) jumps to \(2\sqrt{n}\). Then after that all other solutions have a sparsity in the range \([\sqrt{n} + 1 + n, 2n]\). So there are large differences between the two sparsest solutions, and all others.Though we have good news above, the bad news is that this approach says nothing about the second question: how can we find some of the efficient solutions in \(\mathcal{S}^*\)? This is the problem of sparse representation, and as a sparse approximologist specializing in the domain of high-dimensional audio signals, this is the question I am most interested in solving.

Instead of \(\ell_1\) analysis though, we can consider \(\ell_1\) synthesis, e.g., Basis Pursuit by a convex optimization with \(K\) constraints: $$\hat \vs = \arg \min_{\vs' \in \mathcal{R}^N} || \vs' ||_1 \; \text{subject to} \; \vx = \MD \vs'$$ and hopefully here, \(\hat \vs \in \mathcal{S}^*\). But consider what happens when we multiply our model from the left by a random sensing matrix \(\mathbf{\Phi} \in \mathcal{R}^{M\times K}\). The problem above becomes equivalently $$\hat \vs = \arg \min_{\vs' \in \mathcal{R}^N} || \vs' ||_1 \; \text{subject to} \; \mathbf{\Phi}\vx = \mathbf{\Phi}\MD \vs'$$ where now we have \(M < K\) constraints. This is the same problem as before as long as \(M\) is large enough (which depends on the expected sparsity of \(\vs\), but we have reduced the dimensionality of the problem. Thereby we may more efficiently find a good solution to our model. This approach is exactly that which is empirically considered in: M. Christensen, J. Østergaard, and S. H. Jensen, "On compressed sensing and its application to speech and audio signals," in Proc. Asilomar Conf. Signals, Systems, and Computers, (Pacific Grove, CA), Nov. 2009. (This paper was my first Po'D.)

After writing about probabilistic orthogonal matching pursuit (PrOMP) a few days ago, I thought it might be a good idea to look at a previous approach that actually uses MP: S. E. Ferrando, E. J. Doolittle, A. J. Bernal, and L. J. Bernal, "Probabilistic matching pursuit with Gabor dictionaries", Signal Processing, vol. 80, no. 10, pp. 2099-2120, Oct. 2000. Boy, was I wrong.

After writing about probabilistic orthogonal matching pursuit (PrOMP) a few days ago, I decided to implement and test it starting from my old dissertation code. The MATLAB code I used to generate the plots below is here, for those interested in reproducing my research.

Continue reading Some Experiments with Probabilistic Orthogonal Matching Pursuit.

Like Christmas morning when I was young, I awoke to two presents in my inbox stocking: both of my submissions to EUSIPCO 2010 have been accepted, which I make available in draft form below.

- B. L. Sturm and G. Defrance, "Detection and Estimation of Arrivals in Room Impulse Responses by Greedy Sparse Approximation" (draft), Proc. European Signal Process. Conf., Aalborg, Denmark, Aug. 2010.
- B. L. Sturm, M. Morvidone, and L. Daudet, "Musical Instrument Identification using Multiscale Mel-frequency Cepstral Coefficients" (draft), Proc. European Signal Process. Conf., Aalborg, Denmark, Aug. 2010.

My world was shaken this morning when I read a proof that (1) "absence of evidence

Anyhow, starting from the author's definition of evidence, the author proceeds $$\begin{align} P(H | D) & > P(H | \sim D) \\ 1 - P(\sim H | D) & > 1 - P(\sim H | \sim D) \end{align} $$ by the fact that, e.g., \(P(H | D) = 1 - P(\sim H | D)\). This means that $$ \begin{align} P(\sim H | D) & < P(\sim H | \sim D) \\ P(\sim H | \sim \sim D) & < P(\sim H | \sim D) \end{align} $$ since \(D \equiv \sim \sim D\). The author stops here, believing he or she is finished (and has even made it available for sale printed on a shirt. Oh the hubris!). However, the author has only shown that the probability of \(\sim H\) given the absence of evidence \(\sim D\) is greater than the probability of \(\sim H\) given the absence of the absence of evidence \(\sim \sim D\) supporting \(H\), i.e., evidence \(D\) supporting \(H\). This of course does not match the maxim (1). So I do not agree that the author has proven that "absence of evidence is evidence of absence," even though there are several other claims on the Internets to the contrary: see here, and here, and here.

The author also has the following simplified explanation, which shows some errors in his or her thinking.

My world remains unshaken, but still I wonder, maybe I missed something obvious? It is the inductivist in me.

**is**evidence of absence." I had always thought this maxim was: (2a) "absence of evidence**is not**evidence of absence;" or less strongly, (2b) "absence of evidence is not necessarily evidence of absence." Here are some examples of statements I think are logically correct, and provide examples of (2a,b):- The fact that we have not seen a tea pot orbiting the Earth (have we been looking?) does not falsify the claim that a tea pot is orbiting the Earth. (Surely there must be one on the Mir space station!)
- The fact that we have not received and decoded interstellar messages from extraterrestrials does not support the notion that extraterrestrials do not exist.
- We have yet to directly observe a Higgs boson. Thus, there are no Higgs bosons.
- The fact that we have not found weapons of mass destruction in Iraq does not necessarily mean there were no weapons of mass destruction in Iraq. (Have we searched all sand dunes? Maybe there is a futuristic bomb that we cannot detect. Have we searched in the past hour? Maybe some just crossed the border.)

- Define \(H\) as some hypothesis claimed to be true.
- Thus, we call \(\sim H\) "absence", i.e., the hypothesis is false.
- Let \(D\) be evidence of \(H\), which means \(P(H | D) > P(H | \sim D)\).
- Finally, we call \(\sim D\) "absence of evidence".

*Before we even start with the proof, I feel uneasy with four things. First, is the interpretation of \(\sim D\) as being "absence of evidence". If \(D\) is evidence in support of \(H\), is \(\sim D\) evidence supporting \(\sim H\) (e.g., we have recorded many white swans and no black swans to support our claim "all swans are white")? Or does \(\sim D\) mean we have yet to collect data that supports either \(H\) or \(\sim H\) (e.g., we have not even gone to the field yet, but we have a list of historical housing prices)? Second, we have not properly defined the probability measure for \(H\) and \(D\), and their joint probability measure so that we can justifiably use conditional probability. For instance, I have no idea of the sample space from which \(D\) comes, which is related to my first problem: if \(P(D) + P(\sim D) = 1\), then I would interpret \(\sim D\) as the collection of all the other data in the world, including baseball statistics from 1910. Third, I am concerned by the lack of definition of a posterior for \(D\) and \(\sim D\). The fact that we are to believe in \(H\) given the maximum likelihood rule is not so good if \(P(D)\) is extremely small (as I will show below). And if \(P(D) = 0\), then any conditioning on \(D\) will be meaningless. And finally, it seems a more proper definition of "evidence" is that if \(P(H|D) > P(\sim H|D)\), then \(D\) is evidence of \(H\).*Anyhow, starting from the author's definition of evidence, the author proceeds $$\begin{align} P(H | D) & > P(H | \sim D) \\ 1 - P(\sim H | D) & > 1 - P(\sim H | \sim D) \end{align} $$ by the fact that, e.g., \(P(H | D) = 1 - P(\sim H | D)\). This means that $$ \begin{align} P(\sim H | D) & < P(\sim H | \sim D) \\ P(\sim H | \sim \sim D) & < P(\sim H | \sim D) \end{align} $$ since \(D \equiv \sim \sim D\). The author stops here, believing he or she is finished (and has even made it available for sale printed on a shirt. Oh the hubris!). However, the author has only shown that the probability of \(\sim H\) given the absence of evidence \(\sim D\) is greater than the probability of \(\sim H\) given the absence of the absence of evidence \(\sim \sim D\) supporting \(H\), i.e., evidence \(D\) supporting \(H\). This of course does not match the maxim (1). So I do not agree that the author has proven that "absence of evidence is evidence of absence," even though there are several other claims on the Internets to the contrary: see here, and here, and here.

The author also has the following simplified explanation, which shows some errors in his or her thinking.

First of all, I don't see the "absence of evidence" here. Is it the fact that we did not observe a skirt, but pants? That is still evidence because it provides support against the hypothesis we see a woman, not a man. We would have an absence of evidence, however, if we observed instead a naked person. This shows the necessity of defining the entire sample space (in the first definition above), which makes no mention that it is possible for people to be seen naked. Finally, the second problem with this scenario is, as said above, the lack of priors. What if there were 5 billion females, and 1 male in our sample space? What does that do to our confidence that since we have not observed a skirt, but pants, we have here a man?From this, we claim that observing a person wearing a skirt provides evidence that the wearer is a woman. Now, we observe someone not wearing a skirt. Based on the preceding, we decide that that person is more likely a man than a woman.

- Women and men wear skirts, and things other than skirts.
- More women than men wear skirts.

My world remains unshaken, but still I wonder, maybe I missed something obvious? It is the inductivist in me.

BLS edit (02122010): added description of \(n\) parameter in algorithm.

Hello and welcome to Paper of the Day (Po'D): Probabilistic Matching Pursuit Edition. I learned of today's paper from Igor Carron and his excellent, check-more-often-than-the-New-York-Times blog, Nuit Blanche: A. Divekar and O. K. Ersoy, "Probabilistic Matching Pursuit for Compressive Sensing," Electrical and Computer Engineering Technical Report, Purdue University, May 2010. They have made available some attendant code as well.

Hello and welcome to Paper of the Day (Po'D): Probabilistic Matching Pursuit Edition. I learned of today's paper from Igor Carron and his excellent, check-more-often-than-the-New-York-Times blog, Nuit Blanche: A. Divekar and O. K. Ersoy, "Probabilistic Matching Pursuit for Compressive Sensing," Electrical and Computer Engineering Technical Report, Purdue University, May 2010. They have made available some attendant code as well.

Continue reading Paper of the Day (Po'D): Probabilistic Matching Pursuit Edition.

Welcome to Paper of the Day (Po'D): Enhancing Sparsity by Reweighted \(\ell_1\)-norm Minimization Edition: E. J. Candès, M. B. Wakin, and S. P. Boyd, "Enhancing sparsity by reweighted minimization," J. Fourier Analysis Applications, vol. 14, no. 5-6, pp. 877-905, 2008. This paper presents an intriguing way to make \(\ell_1\)-norm minimization act more in the favor of sparsity than it does, at the cost of more computational complexity. However, with fast convex optimization algorithms at our disposal this may not be so serious for some applications.

I just received the following response from an anonymous reader:

I just read your "Solved Problems in Audio Speech Processing?" blog entry and I have some comments. Compressed sensing is not the answer --- it does not offer efficient compression anywhere near what we already have, as the associated bitrate is far from the rate-distortion bound. It can only serve as an intermediate step between raw samples and reconstruction using a sparse representation. The sparse representation itself, however, offers a much more compact description, and the people who think compressed sensing is the answer to anything like this haven't understood compression at all. ... As for people complaining about compression artifacts, they are forgetting that any reconstruction of their work is subject to loss, including CDs and analog ones. What mp3 and AAC offer is more efficient use of the available bandwidth, and they can be scaled to yield the same as CD quality --- if you are willing to pay the price. This is not a compression technology problem.I agree that compressed sensing is much more a method of acquisition than it is a method for compression. Furthermore, many audio signals have non-compressible elements, e.g., whispering speech, breathiness in the flute, the snare drum, etc. Thus, I don't see compressed sensing as providing any sort of perceptually and bitrate competitive compression over even \(\mu\)-Law quantized Huffman-encoded audio signals --- not to mention the embarrassingly high computational complexity at the decoder involved in solving the convex optimization problems at a cost of \(\mathcal{O}(1000 n\log n)\), and assuming the signal is sparse.

We received several excellent responses to my response to Igor's question.

Reader Cuss makes an interesting point about lossy audio compression. He says,

Reader Cuss makes an interesting point about lossy audio compression. He says,

Continue reading Solved Problems in Audio and Speech Processing?.

The deadline for posters and demos to HAID 2010
is approaching:

HAID 2010 will take place on September 16-17 2010 in Copenhagen, Denmark.

The aim of the poster session is to showcase work in progress or late breaking results relevant to any of the themes of the workshop. Submissions should be 2 page papers (instructions on submission). Accepted submissions will be published by Aalborg university press with an ISBN number and authors are expected to bring a poster describing their work to HAID10.

This year HAID 2010 will feature Søren Bech from Bang and Olufsen, Vincent Hayward from UPMC, France and Davide Rocchesso from IUAV, Italy as keynote speakers.

Technologies to enable multimodal interaction are now sufficiently mature that research is turning away from pure hardware development and looking towards interaction and design issues to improve usability and the user experience.

Robust solutions exist to display audio and haptic feedback in many forms - for instance as speech and non speech sounds and through tactile and force feedback sensations. Furthermore, it has been demonstrated that the novel interactions supported by these modalities can confer benefits for all users. However, many questions remain concerning the appropriate use of haptics and audio in interaction design: how can we design effective haptic, audio and multimodal interfaces? In what new application areas can we apply these techniques? Are there design methods that are useful? Or evaluation techniques that are particularly appropriate?

HAID 2010 is a direct successor to the successful workshop series inaugurated in Glasgow in 2006, in Seoul in 2007, in Jyväskylä in 2008 and in Dresden in 2009. The aim of HAID 2010 is to bring together researchers and practitioners who share an interest in finding out how the haptic and audio modalities can be used together in human computer interaction. The research challenges in the area are best approached through user-centred design, empirical studies or the development of novel theoretical frameworks.

We invite your posters and demonstrations on these topics, and look forward to seeing you in Copenhagen in September 2010!

**June 1st, 2010**.HAID 2010 will take place on September 16-17 2010 in Copenhagen, Denmark.

The aim of the poster session is to showcase work in progress or late breaking results relevant to any of the themes of the workshop. Submissions should be 2 page papers (instructions on submission). Accepted submissions will be published by Aalborg university press with an ISBN number and authors are expected to bring a poster describing their work to HAID10.

This year HAID 2010 will feature Søren Bech from Bang and Olufsen, Vincent Hayward from UPMC, France and Davide Rocchesso from IUAV, Italy as keynote speakers.

Technologies to enable multimodal interaction are now sufficiently mature that research is turning away from pure hardware development and looking towards interaction and design issues to improve usability and the user experience.

Robust solutions exist to display audio and haptic feedback in many forms - for instance as speech and non speech sounds and through tactile and force feedback sensations. Furthermore, it has been demonstrated that the novel interactions supported by these modalities can confer benefits for all users. However, many questions remain concerning the appropriate use of haptics and audio in interaction design: how can we design effective haptic, audio and multimodal interfaces? In what new application areas can we apply these techniques? Are there design methods that are useful? Or evaluation techniques that are particularly appropriate?

HAID 2010 is a direct successor to the successful workshop series inaugurated in Glasgow in 2006, in Seoul in 2007, in Jyväskylä in 2008 and in Dresden in 2009. The aim of HAID 2010 is to bring together researchers and practitioners who share an interest in finding out how the haptic and audio modalities can be used together in human computer interaction. The research challenges in the area are best approached through user-centred design, empirical studies or the development of novel theoretical frameworks.

We invite your posters and demonstrations on these topics, and look forward to seeing you in Copenhagen in September 2010!

Igor from the excellent compressed sensing research repository blog Nuit Blanche has asked the following question:

Audio signals are deceptively simple. For each channel of audio we have a one-dimensional signal. When I contrast this with the two-dimensional three color signals that are images, I often feel that I have chosen the easier subject. In image signal processing, I see a lot of excellent work in image segmentation, object recognition, and 3-dimensional renderings from many images, just to name a few. The use of computers in rendering realistic scenes at the resolution required by movie theaters is amazing, and instead of watching a computer-generated movie as a movie, I am often watching the movement of hair, the gaits of the characters, the motions of waves, or fires, and the effects used to make it appear that a real camera was used. The great strides in image signal processing are readily apparent, and in no small part due to how visually focused Western culture is.

I said above that audio signals are deceptively simple. Even though they are one-dimensional, real acoustic signals are superpositions of complex phenomena. An acoustic scene contains a mixture of different components, each of which is treated differently by our organic audio signal processing. The properties of this organic signal processing has been fully taken advantage of to enable the transparent compression of audio to very small bitrates. For wideband music signals, this means we are able to reduce the datarate about 10 times without creating artifacts noticeable to the non-expert. Does the future of audio compression mean we will arrive at 100 times compression for such signals? Not so, says information theory. We have nearly reached the minimum datarate prescribed by the universe. Thus, audio compression and speech compression, like denoising Lena, are solved problems.

So if compression is dead, what is alive in audio? There are several areas of exciting and active work. One area is the reproduction of sound fields using wavefront synthesis. This technique attempts to recreate the acoustic sound field using, usually, several dozen (maybe even hundreds) speakers. With such a system, the "sweet spot" associated with N.1 sound systems is a thing of the past. Everyone inside the speakers is at a sweet spot because they are all receiving acoustic waves as if they are listening to a real music band. Such a system does not take advantage of human perception, in the that we can by using head-related transfer functions and headphones. It is purely based on acoustics. Perhaps someday we will all have in our houses wavefront synthesis sound systems.

Another area of work is in source separation, and blind source separation. Since a music signal is a complex mixture of multiple sources, can we separate and isolate the sources using \(m\) channels? Blind source separation is an even harder problem because we start with no information on what is playing, where and when, and how it is mixed. However, since humans are capable of tuning into a single instrument or human voice, we know it can be done by a computer. It is just a matter of training on the right features, and perhaps a fusion of several sensing, e.g., hearing and vision.

In line with source separation is the problem is the automatic transcription of music. Speech transcription may be a solved problem, as is music recognition (identifying a piece of music from a segment), but not polyphonic music transcription. This is one of the ultimate problems in music signal processing: going from the sampled waveform to the abstract high-level of common practice notation. In line with this is the more abstract problem of making sense of the music by segmentation, and labeling as chorus, verse, solo, etc. And then there is the associated problem of determining those parts of a piece of music that best describe it, e.g., the chorus, or the main melody. More and more we move from the solutions of the basic problems at the low-level, to solving high-level problems that are human-centered, and involve massive amounts of data, the duration of which exceeds our first-world lifespans.

I was talking to a friend recently and we were wondering about what was going on in audio that was new. In effect, while all our research goals are important and contributing to the betterness of specific technologies, from the outside it looks like the world of audio is pretty much dead as the sensors are all available and, given some money, can be of very high quality. Let me take the example of imagery for instance, while many people are still looking at ways to denoise Lena which I would consider a dead end, there is a community around the SIGGRAPH conferences that is pretty dynamic in that you really see some inventive things going on there every year and you really get a sense that they will be new to the general public once they are adopted. Is there something similar in the world of audio, or has everything been tried and diffused in the public sphere. I am not quite sure the question makes sense, but I would really appreciate if you could clue me in :)This question makes excellent sense, and it is one that I consider on a weekly basis, at 10h13 each Friday: "What is new or upcoming in audio research?"

Audio signals are deceptively simple. For each channel of audio we have a one-dimensional signal. When I contrast this with the two-dimensional three color signals that are images, I often feel that I have chosen the easier subject. In image signal processing, I see a lot of excellent work in image segmentation, object recognition, and 3-dimensional renderings from many images, just to name a few. The use of computers in rendering realistic scenes at the resolution required by movie theaters is amazing, and instead of watching a computer-generated movie as a movie, I am often watching the movement of hair, the gaits of the characters, the motions of waves, or fires, and the effects used to make it appear that a real camera was used. The great strides in image signal processing are readily apparent, and in no small part due to how visually focused Western culture is.

I said above that audio signals are deceptively simple. Even though they are one-dimensional, real acoustic signals are superpositions of complex phenomena. An acoustic scene contains a mixture of different components, each of which is treated differently by our organic audio signal processing. The properties of this organic signal processing has been fully taken advantage of to enable the transparent compression of audio to very small bitrates. For wideband music signals, this means we are able to reduce the datarate about 10 times without creating artifacts noticeable to the non-expert. Does the future of audio compression mean we will arrive at 100 times compression for such signals? Not so, says information theory. We have nearly reached the minimum datarate prescribed by the universe. Thus, audio compression and speech compression, like denoising Lena, are solved problems.

So if compression is dead, what is alive in audio? There are several areas of exciting and active work. One area is the reproduction of sound fields using wavefront synthesis. This technique attempts to recreate the acoustic sound field using, usually, several dozen (maybe even hundreds) speakers. With such a system, the "sweet spot" associated with N.1 sound systems is a thing of the past. Everyone inside the speakers is at a sweet spot because they are all receiving acoustic waves as if they are listening to a real music band. Such a system does not take advantage of human perception, in the that we can by using head-related transfer functions and headphones. It is purely based on acoustics. Perhaps someday we will all have in our houses wavefront synthesis sound systems.

Another area of work is in source separation, and blind source separation. Since a music signal is a complex mixture of multiple sources, can we separate and isolate the sources using \(m\) channels? Blind source separation is an even harder problem because we start with no information on what is playing, where and when, and how it is mixed. However, since humans are capable of tuning into a single instrument or human voice, we know it can be done by a computer. It is just a matter of training on the right features, and perhaps a fusion of several sensing, e.g., hearing and vision.

In line with source separation is the problem is the automatic transcription of music. Speech transcription may be a solved problem, as is music recognition (identifying a piece of music from a segment), but not polyphonic music transcription. This is one of the ultimate problems in music signal processing: going from the sampled waveform to the abstract high-level of common practice notation. In line with this is the more abstract problem of making sense of the music by segmentation, and labeling as chorus, verse, solo, etc. And then there is the associated problem of determining those parts of a piece of music that best describe it, e.g., the chorus, or the main melody. More and more we move from the solutions of the basic problems at the low-level, to solving high-level problems that are human-centered, and involve massive amounts of data, the duration of which exceeds our first-world lifespans.

While reading the paper, M. Babaie-Zadeh, V. Vigneron, and C. Jutten, "Sparse Decomposition over Non-full-rank dictionaries," Proc. ICASSP, 2009, I came across the following phrase:

Perhaps though, the designation "overcomplete" is like the term "optimized," where it is meaningless without a reference. Optimized with respect to what? Overcomplete with respect to what? If we have a space described by $$\mathcal{A} = \text{span}\{\va_i \in \mathcal{R}^m : ||\va_i|| = 1\}_{i=1, \ldots, k}, k < m$$ and we construct the dictionary matrix $$\MA = [\va_1, \ldots, \va_k, (\va_1+\va_2)/||\va_1+\va_2||]$$ then \(\MA\) may not be full rank, but it is overcomplete for all \(\va \in \mathcal{A}\). It doesn't appear to me that this is what the authors mean though, especially considering that in the introduction they state, "If \(m > n\), the dictionary is overcomplete."

Just because a dictionary is fat does not make it overcomplete. Fatness does mean that \(\MA\vs = \vx\) is an underdetermined system, but not necessarily that there always exists a solution to this system. Completeness means that there will always exist a solution to \(\MA\vs = \vx\), and overcompleteness means that there exists an infinite number of solutions. In other words, a fat matrix does not necessarily mean it is an overcomplete dictionary; but overcompleteness implies a fat and full rank dictionary. Thus, "overcomplete" (and thus full rank) is a stronger condition on a dictionary than "fat."

Am I missing something with regards to "non-full rank overcomplete dictionaries," or is this truly an oxymoron?

For a non-full rank overcomplete dictionary ...For some moments this shook my confidence in understanding what it means for a dictionary to be overcomplete, and for the dictionary matrix to have full rank. How can one call a dictionary overcomplete while there exist vectors that don't exist in its column space? That sounds to me like an incomplete dictionary. In other words, I have always thought that for a dictionary \(\MD\) of \(n\) atoms of dimension \(m\), \(\text{rank}(\MD) = m\) implies completeness, and overcompleteness when \(n > m\). And if a fat dictionary is not full rank, i.e., \(\text{rank}(\MD) < m\), then it cannot be complete, not to mention overcomplete. In other words, one cannot have an overcomplete dictionary that is not full rank.

Perhaps though, the designation "overcomplete" is like the term "optimized," where it is meaningless without a reference. Optimized with respect to what? Overcomplete with respect to what? If we have a space described by $$\mathcal{A} = \text{span}\{\va_i \in \mathcal{R}^m : ||\va_i|| = 1\}_{i=1, \ldots, k}, k < m$$ and we construct the dictionary matrix $$\MA = [\va_1, \ldots, \va_k, (\va_1+\va_2)/||\va_1+\va_2||]$$ then \(\MA\) may not be full rank, but it is overcomplete for all \(\va \in \mathcal{A}\). It doesn't appear to me that this is what the authors mean though, especially considering that in the introduction they state, "If \(m > n\), the dictionary is overcomplete."

Just because a dictionary is fat does not make it overcomplete. Fatness does mean that \(\MA\vs = \vx\) is an underdetermined system, but not necessarily that there always exists a solution to this system. Completeness means that there will always exist a solution to \(\MA\vs = \vx\), and overcompleteness means that there exists an infinite number of solutions. In other words, a fat matrix does not necessarily mean it is an overcomplete dictionary; but overcompleteness implies a fat and full rank dictionary. Thus, "overcomplete" (and thus full rank) is a stronger condition on a dictionary than "fat."

Am I missing something with regards to "non-full rank overcomplete dictionaries," or is this truly an oxymoron?

Here is a story I rather like to tell in front of a cozy fire place that shows that when the disciplines of computer science and signal processing meet, the interaction is not always complete.

In the paper, R. Agrawal, C. Faloutsos, and A. Swami, "Efficient similarity search in sequence databases," in Proc. Int. Conf. Foundations Data Org. Algo., (Chicago, IL), pp. 69-84, Oct. 1993, the authors propose a method of similarity search in time series of the same size ("whole matching" instead of "subsequence matching"). They use the discrete Fourier transform (DFT) to transform each time series, and approximate its shape by keeping the first 3-5 complex DFT coefficients, which they justify by the observation that many real world signals have power spectral densities that drop off like \(f^{-\alpha}\), with \(\alpha > 1\). They define the similarity between two time series as the Euclidean distance between their frequency-domain features. And the Parseval equality guarantees that by comparing subsets of elements of the DFT instead of the entire transform, we can only make two time-series appear more similar, not less. (Thus, no false dismissals.) This approach reduces the amount of computation in comparing signals, and turns "exact" search to "similarity" search by looking at shapes instead of details.

Then, four years later, the paper, D. Rafiei and A. Mendelzon, "Efficient retrieval of similar time sequences using DFT," in Proc. Int. Conf. Found. Data Org., (Kobe, Japan), pp. 249-257, Nov. 1998, the authors show:

In the paper, R. Agrawal, C. Faloutsos, and A. Swami, "Efficient similarity search in sequence databases," in Proc. Int. Conf. Foundations Data Org. Algo., (Chicago, IL), pp. 69-84, Oct. 1993, the authors propose a method of similarity search in time series of the same size ("whole matching" instead of "subsequence matching"). They use the discrete Fourier transform (DFT) to transform each time series, and approximate its shape by keeping the first 3-5 complex DFT coefficients, which they justify by the observation that many real world signals have power spectral densities that drop off like \(f^{-\alpha}\), with \(\alpha > 1\). They define the similarity between two time series as the Euclidean distance between their frequency-domain features. And the Parseval equality guarantees that by comparing subsets of elements of the DFT instead of the entire transform, we can only make two time-series appear more similar, not less. (Thus, no false dismissals.) This approach reduces the amount of computation in comparing signals, and turns "exact" search to "similarity" search by looking at shapes instead of details.

Then, four years later, the paper, D. Rafiei and A. Mendelzon, "Efficient retrieval of similar time sequences using DFT," in Proc. Int. Conf. Found. Data Org., (Kobe, Japan), pp. 249-257, Nov. 1998, the authors show:

"... an improvement of the known DFT-based indexing technique for fast retrieval of similar time sequences. We use the last few Fourier coefficients in the distance computation without storing them in the index since every coefficient at the end is the complex conjugate of a coefficient at the beginning and as strong as its counterpart. ... We show analytically that this observation can accelerate the search time of the index by more than a factor of two."Well, that bit about the conjugate symmetric transform is nearly right. By assuming the time-series are all real valued, then we don't have to search over the larger domain of analytic signals. It is amazing that it took about five years to come to that discovery; but hey, maybe in 1993 the DFT was still very exotic and not generally well-understood.

Continuing on with my experiments (see here, and here) with LoCOMP, I now have some model performance comparisons between LoCOMP and OMP using interference adaptation.
The plots below show the model orders (decomposition iteration) for a given signal-to-residual-energy ratio (SRR), \(20\log_{10} ||\vx||/||\vr||\), as a function of the interference weighting \(\gamma\) used in the atom selection. (I hold the weighting constant, which I don't think is ideal.) The three signals used are shown here.

Hello, and welcome to the Paper of the Day (Po'D): Enhancing Sparsity in Linear Prediction by Iteratively Reweighted \(\ell_1\)-norm Minimization Edition. Today's paper comes from ICASSP 2010: Daniele Giacobello, Mads Christensen, Manohar N.
Murthi, Søren Holdt Jensen, and
Marc
Moonen:

*Enhancing Sparsity in Linear Prediction of Speech by Iteratively Reweighted \(\ell_1\)-norm Minimization,*IEEE Int. Conf. Acoustics, Speech, Signal Process., Dallas, TX, USA, March 2010. A previous paper of theirs on a related subject was a Po'D for March 12, 2010.Bob L. Sturm, Associate Professor

Audio Analysis Lab

Aalborg University Copenhagen

A.C. Meyers Vænge 15

DK-2450 Copenahgen SV, Denmark

Email: bst_at_create.aau.dk