# March 2010 Archives

## Paper of the Day (Po'D): Demixing Pursuit Edition

Hello and welcome to Paper of the Day (Po'D): Demixing Pursuit Edition. Today's interesting paper comes from a conference in 2006 devoted to blind source separation (BSS): S. Lesage, S. Krstulovic, and R. Gribonval, Underdetermined source separation: Comparison of two approaches based on sparse decompositions,'' in Proc. Int. Conf. Independent Component Analysis Blind Source Separation, (Charleston, South Carolina), pp. 633-640, Mar. 2006.

The authors compare two approaches to BSS using sparse approximation, both of which know the mixing matrix. This is in contrast to Gribonval's stereo matching pursuit (reviewed Monday) where the mixing matrix (and number of sources) is estimated from the decomposition parameters, and each source is reconstructed from the relevant atoms. The authors propose two methods that assume near perfect knowledge of the mixing matrix (and thus the number of sources). In the first case, multichannel matching pursuit,'' the authors decompose all mixtures in parallel with MP and a monochannel dictionary. After decomposition each atom is assigned to a source based on maximizing the magnitude projection of its weights (across mixtures) onto each row of the mixing matrix (where $$\MX = \MA\MS$$ are the mixtures, and $$\MA$$ is the mixing matrix). They also propose a variant of this by relaxing the assumption of one atom one source,'' so that an atom can belong to several sources. In the second case, demixing pursuit,'' a monochannel dictionary is turned into a dictionary of directional'' multichannel atoms created with the mixing matrix in mind. In contrast to the first approach, the mixing matrix is incorporated into the dictionary. MP is then run on the matrix of mixtures, and the sources are reconstructed as before.

The authors perform several tests on mixtures of three signals: piano,'' drums,'' and cello'' (I still have no idea if the instruments are playing together, as in a wedding band, or if they are from separate pieces, as in different cars in the Parisian metro). They consider two different dictionaries: a complete short-term Fourier transform dictionary (monoresolution), and an overcomplete multiresolution time-frequency dictionary. The authors also perform smoothing'' in the sparse approximation process, which essentially performs several decompositions with the signal slightly shifted relative to the dictionary, and then an averaging of the decomposition parameters (which sounds a lot like the method proposed in P. J. Durka, D. Ircha, and K. J. Blinowska, Stochastic time-frequency dictionaries for matching pursuit,'' IEEE Trans. Signal Process., vol. 49, pp. 507-510, Mar. 2001). For three distortion measures (source-artifact, source-interference, and source-distortion), the authors find that the demixing pursuit performs more competitively with the larger dictionary as the approximation model order increases. The authors also explored the effect of imprecise knowledge of the mixing matrix by adjusting all directions by a small angle $$[-\pi/16,\pi/16]$$. The source-interference distortion measure shows the greatest variability in this case.

## Paper of the Day (Po'D): Dual Matching Pursuit Edition

Hello, and welcome to the Paper of the Day (Po'D): Dual Matching Pursuit Edition. Today's interesting paper comes from ICASSP 2004: P. Sugden and N. Canagarajah, "Underdetermined noisy blind separation using dual matching pursuits," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Montreal, Quebec, Canada), pp. 557-560, May 2004.

The authors propose an alternative approach to stereo matching pursuit (reviewed yesterday) for separating sources at the atomic level. Instead of performing MP on both mixtures using a dictionary of two-channel time-frequency atoms, the authors here perform MP with an overcomplete (single channel) time-frequency dictionary $$\mathcal{D} = \{ \vg \in \MR^K : ||\vg||_2 = 1\}$$ on each mixture separately. So here, the $$i$$th channel of the mixture $$\MY$$ can be modeled as $$\lim_{L \to |\mathcal{D}|} \Norm \vy_i - \sum_{l=1}^{L} \beta_{l}^{(i)} \vg_l^{(i)} \Norm = 0$$ and thus $$\MY \approx \left [ \sum_{l=1}^{L_1} \beta_{l}^{(1)} \vg_{l}^{(1)} \Biggl | \sum_{l=1}^{L_2} \beta_{l}^{(2)} \vg_{l}^{(2)} \Biggl |\cdots \right ]$$ bordering on notation overload.

Since uncorrelated sources have a low probability of sharing elements localized in both time and frequency,'' the authors assume that there will exist atoms repetitions across channels that belong to each source., i.e., two atoms one source,'' instead of Gribonval's one (stereo) atom one source'' (made justifiable by the atom selection across the mixtures). Thus, if an atom is repeated across the mixtures (which I assume means the atoms will have the same support and modulation frequency, but not necessarily the same shift) then an estimate of the panpot parameter'' for the source associated with the atom, which pertains to a $$\theta$$ in the mixing matrix $$\mathbf{\Theta}$$, will be given by the arc tangent of the ratio of the weights of the two channels in a stereo case. As done by Gribonval, finding the distribution of these individual estimates can reveal the number of sources, and their (static) locations in the stereo field, and then generate the separated sources up to a scale factor. The authors also show how using complete dictionaries learned for the different classes of source signals expected can greatly enhance the separability of the sources from their mixtures within the underdetermined and sparse context.

While it is true that uncorrelated sources can be considered non-interacting (and thus well-separable) in the time-frequency domain, this begins to break down with multiple musical instruments playing together. And certainly so for short-time wideband phenomena like transients. However, I predict that this atomic approach (using MP and a time-frequency dictionary as in this paper of Gribonval's) will fail miserably for a mixture of constant amplitude sinusoids that are completely non-overlapping in the time-frequency domain. It is dangerous to assume an atom (and its attendant parameters) of a time-frequency model of a signal built by MP is a real aspect of the signal, and is not an artifact of the decomposition process. Atoms will be given amplitudes that are overestimated, and this will degrade the estimation of the mixing matrix in the method proposed in this paper (and Gribonval's). Even though here an atom in the model of one channel is not considered a real aspect of a source unless there is a similar atom in the other channels, it is not taking into consideration the problem of greed in MP. But who cares about separating stationary signals composed of sinusoids? Perhaps the breakdown of greedy iterative descent approximation methods looks bad on paper and contrived examples (except for the well-known pre-echo artifacts), but in the end, above the level of the atom, the artifacts may average themselves into irrelevance.

## Paper of the Day (Po'D): Stereo Matching Pursuit Edition

Hello, and welcome to the Paper of the Day (Po'D): Stereo Matching Pursuit Edition. Today's interesting paper comes from almost eight years ago: R. Gribonval, Sparse decomposition of stereo signals with matching pursuit and application to blind separation of more than two sources from a stereo mixture,'' in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 3, (Orlando, FL), pp. 3057-3060, May 2002.

Consider a matrix of $$I$$ real length-$$K$$ sampled audio signals $$\MX := [\vx_1| \vx_2 | \cdots | \vx_I ]$$ mixed into a two-channel stereo'' signal $$\MY = [\vy_l | \vy_r ]$$, modeled in the z-domain by $$\MY(z) = \MX(z) \Lambda \mathbf{\Theta}(z) + \MN(z)$$ where $$\Lambda := \textrm{diag}(\lambda_1, \lambda_2, \ldots, \lambda_I)$$ is a diagonal square matrix of real scalars that amplify each of the sources, and $$\mathbf{\Theta}(z) := \left [ \begin{array}{ccc} \cos \theta_1 & \sin \theta_1z^{-n_1} \\ \cos \theta_2 & \sin \theta_2z^{-n_2} \\ \vdots & \vdots \\ \cos \theta_I & \sin \theta_Iz^{-n_I} \\ \end{array} \right ]$$ denotes the stereo placements ($$\theta_i \in [0,\pi/2]$$) and integer time-delays ($$n_i \in \MZ$$) of the sources. The matrix $$\MN = [\vn_l | \vn_r ]$$ is noise independent of the sources and mixing. In this model we are assuming that the only filtering of the sources is an integer time-delay and a constant gain for each channel. The challenge is to discover all of the $$(3+K)I + 2K$$ unknown components of this model, i.e., $$\MX, \Lambda$$, and $$\mathbf{\Theta}(z)$$, using just the $$2K$$ measurements $$\MY$$. This is clearly an underdetermined problem no matter what $$I \ge 1$$ is. In this paper, Gribonval proposes an estimation method  that uses greedy iterative descent sparse approximation with a parameterized dictionary of stereo atoms.

Define a dictionary $$\mathcal{D} = \{ \vg \in \MR^K : ||\vg||_2 = 1\}$$ of time-localized atoms such that $$\text{span} \; \mathcal{D} = \MR^K$$, and consider that we can model each column of $$\MX$$ in terms of these atoms, which means $$\lim_{H \to |\mathcal{D}|} \Norm \vx_i - \sum_{k=1}^H \alpha_{i,k} \vg_k \Norm = 0.$$ With this, our model of $$\MY$$ becomes $$\MY(z) = \sum_{i=1}^I \lambda_i \sum_k \alpha_{i,k} [ \cos \theta_i \vg_k(z) | \sin \theta_i z^{-n_i} \vg_k(z) ] + \MN(z).$$ Now consider defining a dictionary of stereo'' atoms from those in $$\mathcal{D}$$: $$\mathcal{D}_s(z) := \{ \vs(z) := [\cos \phi\vg(z) | \sin \phi z^{-m} \vg(z) ] : \phi \in [0,\pi/2], |m| \le m_{\max}, \vg \in \mathcal{D}\}$$ which trivially spans the space in which $$\MY \in \MR^{K\times 2}$$ exists. This means we can model $$\MY$$ in terms of these atoms as we did with each $$\vx_i$$, i.e., $$\lim_{L \to |\mathcal{D}_s|} \Norm \MY - \sum_{l=1}^L \beta_{l} \vs_l \Norm_F = 0.$$ If we assume that
$$\sum_{l=1}^L \beta_{l} \vs_l(z) = \sum_{l=1}^L \beta_{l} [\cos \phi_l \vg_l(z) | \sin \phi_l z^{-m_l} \vg_l(z) ]$$ $$\approx \sum_{i=1}^I \lambda_i \sum_k \alpha_{i,k} [ \cos \theta_i \vg_k(z) | \sin \theta_i z^{-n_i} \vg_k(z) ]$$ then we can begin to tease out the various unknowns in our model of the stereo signal. Notice too that the noise signal is no longer being considered as we will assume it is not represented in the $$L$$th-order model of $$\MY$$. If we assume that each atom in the approximation of $$\MY$$ corresponds to one atom in the $$k$$th source in the model of $$\MY$$,
then we can estimate the number of sources, $$\hat I$$, by looking for clusters in the sets $$\{\phi_l\}$$ and $$\{m_l\}$$, and then find a set of disjoint partitions to estimate each source
$$\widehat{\lambda_i \vx_i} = \sum_{l \in \mathcal{L}_i} \beta_l \vg_l$$ where $$\mathcal{L}_i$$ contains the indices of the $$i$$th cluster. The problem now is to generate the decomposition of $$\MY$$.

To solve this, Gribonval proposes stereo matching pursuit, which goes as follows using the $$(M-1)$$th-order residual $$\MR_\MY^{(M-1)}$$, where $$\MR_\MY^{(0)} = \MY$$.

1. Find the atom and its delay having the largest magnitude projection onto the residual: $$(\vg_M, m_M) = \arg \max_{\vg \in \mathcal{D}, |m| \le m_{\max}} || \MP_{\vg,m} \MR_\MY^{(M-1)} ||$$ where $$\MP_{\vg,m}\vy = [p_l, p_r]$$ is the orthogonal projection of $$\vy$$ onto the space spanned by $$\vg$$ in the left channel, and its $$m$$-delayed version in the right channel.
2. Set $$\beta_M = || \MP_{\vg_M,m_M} \MR_\MY^{(M-1)} ||$$ and $$e^{j\phi_M} = (p_l + jp_r)/\sqrt{p_l^2 + p_r^2}$$.
3. Compute the new residual: $$\MR_\MY^{(M)}(z) = \MR_\MY^{(M-1)}(z) - \beta_M [ \cos \phi_M \vg_M | \sin \phi_M z^{-m_M} \vg_M].$$
4. Repeat.
Gribonval's experiments used the much-loved LastWave implementation of matching pursuit, with a discretized dictionary of complex Gabor atoms. (Of historical note, he says that 2000 iterations of stereo MP on a sampled audio signal of length 19,200 samples (8 kHz) took about 30 minutes on a Pentium III 750 MHz laptop. I remember the long waits accompanying the decomposition process; but the interface where one can drag around the atoms and play them back was great! While sparse approximation methods are still far from real-time, there have been some incredible advances since 2002.) His experiment with a mixture of cello,'' drums,'' and piano,'' shows that it can be demixed very well with respect to the best linear demixing.

I wonder though whether the mixture of these sources is completely artificial. If they are from the same music composition, e.g., where they are playing at the same tempo, and are in the same key, etc., then their correlation might break the assumption that each atom corresponds to one source. Also, since the clustering is sensitive to the number of atoms used to represent each source, I wonder how this method would fare if piano'' was replaced by a single sinusoid, which presumably would require far fewer atoms to represent than the other wideband sources. Regardless, with reference to the problem I stated before, we see here that under the condition of sparsity in the time-frequency domain, as well as correlation between channels, it is not unreasonable to assume one atom one source'' in a sparse approximation of a mixture.

## If you happen to be in Germany...

...and have nothing planned by the end of the month, perhaps you could try to pop by the

International symposium for Evolution of Emotional Communication
in Reisenburg April 28 - May 1.

I attended the previous one, held in Hanover 2007, where I enjoyed lots of great talks on communication of both humans and animals.

Unfortunately, I won't  make it this year as I am struggling to get on top of things needing to be done. A to-do list that I fear may match that of Mike Gayle (who ended up with 1277 items in case you wonder...)

## A Puzzle of No Return? (part two)

What I mean more generally is this: To what extent can I expect a specific sparse approximation process and dictionary to hinder the separation of several sources from a mixture decomposed into atoms? Will a sparse approximation of a mixture make impossible an agglomeration of atoms into the separated sources? More specifically, can I expect to take a matching pursuit decomposition of a speech and music mixture using an overcomplete scale-time-frequency dictionary, and reconstruct the two sources using disjoint subsets of the book? This is like taking a jigsaw puzzle of a nature scene and building up just the trees, or just the clouds, from the pieces.

It is a naive assumption that each atom belongs to either the speech or music signal, especially for greedy pursuits that are prone to represent errors made in the model rather than anything real in the signal. But I wonder how bad of an assumption it is. Looks like I have got a new reading list for Paper of the Day (Po'D) next week!

N. Cho, Y. Shiu, and C.-C. J. Kuo, "Audio source separation with matching pursuit and content-adaptive dictionaries (MP-CAD)," in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, NY), pp. 287-290, Oct. 2007.

B. V. Gowreesunker and A. H. Tewfik, "Blind source separation using monochannel overcomplete dictionaries," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., (Las Vegas, NV), pp. 33-36, Mar. 2008.

R. Gribonval, "Sparse decomposition of stereo signals with matching pursuit and application to blind separation of more than two sources from a stereo mixture," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 3, (Orlando, FL), pp. 3057-3060, May 2002.

S. Lesage, S. Krstulovic, and R. Gribonval, "Underdetermined source separation: Comparison of two approaches based on sparse decompositions," in Proc. Int. Conf. Independent Component Analysis Blind Source Separation, (Charleston, South Carolina), pp. 633-640, Mar. 2006.

P. Sugden and N. Canagarajah, "Underdetermined noisy blind separation using dual matching pursuits," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Montreal, Quebec, Canada), pp. 557-560, May 2004.

## A Puzzle of No Return?

Here is an interesting problem. Consider the sum of two real signals $$x(t)$$ and $$y(t)$$ of the same length, $$z(t) := x(t) + \lambda y(t)$$, where $$\lambda$$ is real and non-zero. Let us decompose $$z(t)$$ by some pursuit (e.g., OMP) using an overcomplete dictionary of unit norm functions $$\mathcal{D} = \{d_\gamma\}_{\gamma \in \Gamma}$$, thus producing the model $$z(t) - \sum_{k \in\Gamma_z} \alpha_k d_{\gamma_k}(t) = R^{|\Gamma_z|}z(t).$$ What are the conditions on the dictionary, and the signals $$x(t)$$ and $$y(t)$$, such that we can "recover" the separate models of $$x(t)$$ and $$y(t)$$ from that of $$z(t)$$, i.e., $$x(t) - \sum_{k\in \Gamma_z^x \subset \Gamma_z} \alpha_k d_{\gamma_k}(t) = R^{|\Gamma_z^x|}x(t)$$ $$y(t) - \sum_{k\in \Gamma_z^y\subset \Gamma_z} \alpha_k d_{\gamma_k}(t) = R^{|\Gamma_z^y|}y(t)$$ where $$\Gamma_z^x \cup \Gamma_z^y = \Gamma_z$$, $$\Gamma_z^x \cap \Gamma_z^y = \oslash$$, and $$|| R^{|\Gamma_z|}z(t) ||^2 \ge || R^{|\Gamma_z^x|}x(t) ||^2 + \lambda^2|| R^{|\Gamma_z^y|}y(t)||^2$$?

Is that too much to ask?

## Welcome Compressed(ive) Sensing(ampling)

Thank you Igor for sending us a shout-out on Nuit Blanche. I truly admire the energy and effort you put into your blog to keep it a fresh and informative resource! I don't know of many good research blogs in signal processing, except for yours, and Terence Tao's in mathematics. (Suggestions are welcome for research blogs that are updated at least twice a week!)

I am trying to contribute to this (very new) blog in a way that mixes these two approaches, while contributing to and helping organize my own research endeavors. However, we will not be just focusing on dictionary design, nor even sparse approximation; but on many things related to sound, technical and fun. No promises yet; but we will see how it goes.

## Paper of the Day (Po'D): From Signal Processing to Cognition Edition

Hello, and welcome to the Paper of the Day (Po'D): From Signal Processing to Cognition Edition. Today's interesting paper comes from 2005: L. K. Hansen, P. Ahrendt, and J. Larsen, "Towards cognitive component analysis," in Proc. Int. Interdisciplinary Conf. Adaptive Knowledge Representation Reasoning, (Espoo, Finland), pp. 148-153, June 2005.

The discipline of linear algebra, one of the great accomplishments in mathematics of the last century, offers exceptionally powerful tools for working with and understanding data. Consider that we have any real matrix $$\MX \in \MR^{N\times K}$$ of $$K$$ independent observations (each with zero mean) of $$N$$ data types (e.g., sample values in time). It could be, for instance, $$N$$ features derived from $$K$$ people saying the word "hello." One of the principal results of linear algebra, singular value decomposition (SVD), provides a way to transform these "hellos" such that we can describe them as a linear combination of orthonormal (uncorrelated)  sources ordered by "importance", i.e., $$\MX = \MU(\Sigma\MA)$$ where $$\Sigma \in \MR^{N\times K}$$ is a positive diagonal matrix of decreasing weights, $$\MA \in \MR^{K\times K}$$, and $$\MU \in \MR^{N\times N}$$ is a dictionary of "hello" features satisfying $$\MU^T\MU = \MU\MU^T = \MI$$. Each row in the $$k$$th column of $$\Sigma\MA$$ shows what columns of $$\MU$$ are used to create the data of the $$k$$th "hello," starting with the first column in $$\MU$$, i.e., the feature that has the most variability in all the data, i.e., the feature that varies the most in all the "hello" data of $$\MX$$.

Principal component analysis (PCA) uses SVD to find the orthonormal matrix $$\MP \in \MR^{K\times N}$$ such that the transformation $$\MP\MX$$ will have a diagonal covariance, i.e.,
$$\MP\MX ( \MP\MX )^T = \MP\MX \MX^T\MP^T = \textrm{diagonal}.$$ This means that we have a linear transformation that decorrelates the data types of all of our observations. The matrix we are looking for is exactly $$\MP = \MU^T$$, given by the SVD of $$\MX$$. The rows of $$\MP$$ are known as the principal components of $$\MX$$. Again, they describe the directions of variability of the data from greatest to least. This decorrelation in essence strips away the second-order dependencies between the data types in each observation, and assumes that there is no higher-order statistics between the data types. The implication here is that the data is distributed as a zero-mean multivariate distribution with some specific covariance matrix, and this is all the information that is needed to completely specify the statistics of our data, i.e., each random vector comes from a multivariate Gaussian distribution.

While PCA works well (in a least-squares sense) as long as the underlying data can be modeled by a multivariate Gaussian random vector, any data with higher-order statistical dependencies will not be well-represented (in a least-squares sense) by the principal components found through PCA. In such cases one turns to Independent Component Analysis (ICA). Much in the same way as PCA, ICA finds for our set of "hello" measurements  $$\MX$$ a set of "independent components" $$\MS \in \MR^{N\times K}$$, as well as a mixing matrix $$\MM \in \MR^{N\times N}$$, such that $$\MX = \MM\MS$$ without making assumptions about the types of distributions of the components, suffice it to say that they are independent and not distributed as a multivariate Gaussian. If our "hello" data is not distributed as a multivariate Gaussian, then the principal components of the data will not signify much. In this case then, we would attempt to learn the independent components of the data, to find those distinguishing features of "hello."

Together, PCA and ICA provide excellent methods for digging into data and discovering its significant features, which can then be used for data analysis. However, an interesting notion is that somehow our brains could be performing the same tricks in order to make sense of the massive amount of information reaching our senses while only having a limited throughput in the nervous system. When coupled with the idea of sparsity, or minimal energy, the results appear to have implications for how our brains work. This article provides a vision offering an interesting perspective about data analysis and the processes in the black box that is the trained brain. The authors discuss the observation that when data, e.g., text, social networks, music, is clustered in an unsupervised way using an analysis based on independent components, the clusters resemble those assembled by a trained human. This is what the authors define "cognitive component analysis," which here is not defined in a mathematical sense. Do these principal components and independent components signify something of higher-order, say at a cognitive level?

## Paper of the Day (Po'D): Music Genre Classification Edition

Hello, and welcome to the Paper of the Day (Po'D): Music Genre Classification Edition. Today's paper comes from ICASSP 2005: A. Meng, P. Ahrendt, and J. Larsen, Improving Music Genre Classification by Short-time Feature Integration,'' in Proc. ICASSP, vol. 5, pp. 497-500, Philadelphia, PA, 2005.

Automatic classification of music by genre has applications in organizing music databases, and music recommendation systems, both of which are attractive commercially. From a signal processing perspective, this problem is interesting because one must move from a low-level sampled audio signal to the perceptual and cognitive high-level-hard-to-define-but-know-it-when-I-hear-it domain of genre. To create a good music genre classifier then requires a means to tease out of a music signal those features that provide the best tell-tale signs of "Rock and Roll," "Pop," "Jazz," "Classical," and "Grindcore." Taking a page from how humans perceive and understand music, we expect that the best classifier will make its decision using  features from time-scales long (e.g., rhythm) and short (e.g., spectral centroid).

The authors propose a music genre feature model that includes descriptors from time scales between 30 ms (first 6 MFCCs - not sure if 0th coefficient is included or not) to descriptors computed over 740 ms (mean and variance of MFCCs; power spectral density of MFCCs in 4 bands; autoregression of the PSD of each MFCC with model order selected as to minimize classification error (on a test set I presume); % of frames with high number of zero-crossings; % of frames with low short-time energy) to descriptors computed over almost 10 seconds (mean, variance, and autoregressive models of previous descriptors - except ratios; beat spectrum computed from MFCCs; beat histogram computed from audio samples).

The authors consider early and late information fusion schemes. Early fusion of descriptors occurs in the construction of autoregressive models of low-level descriptors computed over several frames; as well as the estimation of the means, variances, and power spectral densities of each descriptor. The authors also reduced the dimensionality (to about 20) of some of the features by PCA. The late fusion comes from considering the votes of several trained classifiers. The authors trained two different classifiers: 1) single-layer neural network; 2) Gaussian classifier (Bayes' rule with Gaussian posterior?).

To test the performance of their features and the two types of feature fusion, the authors used two data sets, one consisting of 100 pieces distributed equally in four genres: hard rock, jazz, pop, and techno; and one consisting of 354 30-second music samples acquired from Amazon.com, distributed equally in six genres: classical, country, jazz, rap, rock, and techno. Training and test data were created by using 30-second segments. For the first dataset, the authors confirmed its labels by a human genre classification experiment.

The authors performed several tests using combinations of all the features presented. Classification error for dataset 1 was the best (5%) using autoregressive models of MFCCs features computed over 10 seconds, which represents one way to fuse low-level features into long time-scale features. The performance of these same features for the second database was much lower, around 30% classification error.

Although I am not convinced music genre is something that can and should be defined, I can see its benefits for organizing music data. I believe a better approach is letting the data define itself by clustering, instead of imposing labels like "country" and "jazz". However, it is clear that any clusters found by considering only features from short time-scales without considering longer time scales will have limited usefulness.

## Paper of the Day (Po'D): Sparse Domain (Dis)similarity Edition

Hello, and welcome to the Paper of the Day (Po'D): Sparse Domain (Dis)similarity Edition. Today's paper comes from Fall of 2009: R. Mazhar, P. D. Gader, and J. N. Wilson, "Matching-Pursuits Dissimilarity Measure for Shape-Based Comparison and Classification of High-Dimensional Data", IEEE Trans. Fuzzy Systems, vol. 17, no. 5, pp. 1175 - 1188, Oct. 2009.

The ability to compare signals and their contents is fundamental to every system using pattern recognition, e.g., voice recognition, musical source classification, fingerprint search, plagiarism detection, etc. In many situations, there is no lack of features with which to compare signals (data); but the difficult part is finding that small set of features that are in some way better than all others. The smaller this set of features gets, while still providing, e.g., good separation between classes and little separation within classes, then the less training data we need to provide in order to generate a good and generalized classifier. And so, if we can reduce a very high-dimensional vector, like sampled audio, to a small number of features that provides good description, then we can improve classifiers not only in performance (e.g., accuracy), but also efficiency (e.g., computational overhead, speed of classification). And if we can kill two birds with one stone, by having the features be uniquely invertible, that is even better. This could be the case using methods of sparse approximation.

The first portion of this article proposes a measure for comparing two signals using sparse approximations found by matching pursuit (MP). Given two signals in the same space, $$\vx_1, \vx_2 \in \mathbf{R}^K$$, and the $$n_1$$-order representation of $$\vx_1$$ found by MP using some real dictionary $$\MD$$, i.e., $$\{\MH_1(n_1), \va_1(n_1), \vr_1(n_1)\} \leftarrow \mathcal{P}_\MD\{\vx_1\}$$ where $$\MH_1(n_1)$$ is a matrix of $$n_1$$ atoms selected from the dictionary, $$\va_1(n_1)$$ is a vector of $$n_1$$ weights, and $$\vr_1(n_1) = \vx_1 - \MH_1(n_1) \va_1(n_1)$$ is the $$n_1$$th-order residual, the matching pursuit dissimilarity measure (MPDM) between $$\vx_1, \vx_2$$ is defined $$D_\alpha( \vx_1, \vx_2) := \left [ \alpha \Norm \vr_1(n_1) - (\vx_2 - \MH_1(n_1)\vb) \Norm_2^2 + (1-\alpha) \Norm \va_1(n_1) - \vb \Norm_2^2\right ]^{1/2}$$ where $$0 \le \alpha \le 1$$, and $$\vb$$ are the $$n_1$$weights found recursively by $$[\vb]_j = \bigl \langle \vx_2, [\MH_1(n_1)]_{*j} \bigr \rangle - \sum_{i=1}^{j-1} [\vb]_i \bigl \langle [\MH_1(n_1)]_{*i}, [\MH_1(n_1)]_{*j} \bigr \rangle, \;j = 1, \ldots, n_1.$$ All this is saying is let's take $$\vx_2$$ and decompose it with the atoms found by MP in decomposing $$\vx_1$$ preserving their order, and then compare the two models and their errors. The weight $$\alpha$$ determines which of the two terms is more important: the distance between the residuals ($$\alpha > 0.5$$) --- which the authors claim corresponds to the comparison of shapes --- or the distance between the model weights ($$\alpha < 0.5$$) --- which the authors claim corresponds to a comparison of intensities. Whenever $$D_\alpha( \vx_1, \vx_2)$$ is large, then the two signals are judged dissimilar with respect to $$\alpha$$. This measure is not symmetric since $$D_\alpha( \vx_1, \vx_2) \ne D_\alpha( \vx_2, \vx_1)$$ in general; but this can be satisfied if the distance measure is defined $$\overline{\delta_\alpha}(\vx_1,\vx_2) := \frac{1}{2} \left [ D_\alpha( \vx_1, \vx_2) + D_\alpha( \vx_2, \vx_1) \right ].$$ (If symmetry is required --- usually a good idea when comparing signals, but not necessary (see the Itakura-Saito distance) --- then this measure requires that both signals be decomposed by MP, contrary to the author's claim that only one signal needs to be decomposed. The authors apparently did not use this symmetric version.)

The second part of this article presents a classification method using the MPDM, and the compares its results to other classifiers on a dataset of landmine sensor measurements. To create their classifier, the authors use an unsupervised competitive agglomeration algorithm with the MPDM to find the optimal number clusters. From each of the clusters they generate prototypes (e.g., 60 mine and 50 non-mine prototypes). Classification is then done via fuzzy clustering (nearest neighbors) with class weights given by estimated conditional probabilities. Somewhere the prototypes are assigned a label of mine and not-mine, but I can't tell where, and if that is automatic or not. For the MPDM they set $$\alpha = 0.7$$ for some reason. The authors also use an existing state-of-the-art detection method, a system based on support vector machines, and the same fuzzy clustering approach they use but with the Euclidean distance instead of the MPDM. The authors test all four classifiers on a set of real (798 samples) and simulated (10,000 samples) landmine sensor data, and compare their performance. Their results appear to demonstrate their classifier using the MPDM provides a healthy (fitting word choice) advantage over the other classifiers.

Though the authors appear to show improvement in using the MPDM to compare signals, I am not convinced by this approach. First of all, the authors make several unsubstantiated claims, and misstatements. For instance, The MP approximation of a signal contains accurate shape-based information about the signal.'' I am not sure what they mean by accurate,'' but it has been repeatedly shown, during the past 16 years, that MP can create signal decompositions that are rife with errors from its short-sighted greedy approach. (Also see S.~Jaggi, W.~C. Karl, S.~Mallat, and A.~S. Willsky, High resolution pursuit for feature extraction,'' Tech. Rep., Laboratory for Information and Decision Systems, MIT, Cambridge, MA, Nov. 1995.) The authors also claim that ... the maximum information [in a MP decomposition] is contained in its first few MP coefficients.'' I don't know what they mean by information,'' and how they can claim a certain subset of model elements contain the maximum of that quantity. They also claim that their classification method using the MPDM does not require prior knowledge about the problem domain.'' However, their dictionary selection represents a use of prior knowledge since the dictionary depends on the expected characteristics of the data for a given problem'' (their words). Even in their experiments they created dictionaries from the labeled training data, which is obviously using prior knowledge.

On top of these, MP is non-linear, and slight changes in the signal input can create very different decompositions. MP is not shift-invariant unless the dictionary is made so, and remains sufficiently incoherent to any noise in the signal. Because of this, I cannot see how their measure takes into consideration any shifts in the data. If I only circularly shift $$\vx_1$$ to produce an $$\vx_2$$ --- which I assume amounts to a displacement of the mine location with reference to where the sweep begins --- then the MPDM will not be small because it is not comparing shapes, but only shapes in a particular location with a particular orientation.

On the whole, I believe much more should be fleshed from this paper in a formal mathematical setting. For instance, why not use the least-squares projection of each signal onto the column space of $$\MH_1(n_1)$$? There is nothing special about MP that OMP does not have, except for its egregious mistakes at the price of computational simplicity. Also, the selection of the parameter $$\alpha$$ needs a better approach than just arguing from prior knowledge about the problem domain.''

## Paper of the Day (Po'D): Time-frequency Distributions Edition

Hello, and welcome to the Paper of the Day (Po'D): Time-frequency Distributions Edition. Today's paper comes from Fall of 2009: S. Ghofrani and D. C. McLernon, "Auto-Wigner-Ville distribution via non-adaptive and adaptive signal decomposition", Signal Proc., vol. 89, no. 8, pp. 1540-1549, Aug. 2009.

It is typical when working with signals populating the world of non-stationary processes, like sampled audio, that one is interested in its time-frequency content.'' In other words, we can ask of a signal its frequency, or more generally its bandwidth, at some exact moment in time. Practically, this is useful for designing a parametric model of some signal. The authors of this paper propose estimating the instantaneous (and mean) frequency and bandwidth of a signal from what is called the auto-Wigner-Ville distribution,'' which they compute in two ways: 1) from the Gabor transform; and 2) from a matching pursuit (MP) decomposition.

We can look at the Wigner-Ville distribution (WVD) of an analytic function $$x(t)$$ in two ways: 1) $$WV_x(t,\omega) := \int_{-\infty}^\infty \bigl [ x(t + \tau/2) x^*(t - \tau/2) \bigr ] e^{-j\omega \tau} d\tau$$ which is the Fourier transform of the product of $$x(t)$$ with its time-reversed version; and 2) $$WV_x(t,\omega) := \langle x(t+\tau/2), e^{j\omega\tau} x(t - \tau/2) \rangle$$ which is just the projection of $$x(t+\tau/2)$$ onto a modulated and time-reversed version of itself. (The $$\tau/2$$ part keeps the function symmetric, and consequently $$WV_x(t,\omega)$$ always real --- but not always positive.) It has been shown that the WVD of a signl provides the most compact time-frequency description of its content, in that it has the least amount of smearing of "energy" in the time-frequency plane (I think from L. Cohen, "Time-frequency distributions -- A review," Proc. IEEE, vol. 77, pp. 941-981, July 1989). This is in contrast to, for instance, the short-term Fourier transform, where smearing in time is a by-product of windowing the data. The WVD does not use a window.

As the WVD involves the product of a signal with itself (it is quadratic), there exist what are called "cross-terms" that diminish its usefulness for time-frequency analysis. In other words, it becomes hard to interpret the results. Considering that we have a signal composed of several components: $$x(t) = \sum_{n=0}^\infty g_n(t)$$ its WVD will be $$WVD_x(t,\omega) = \sum_{n=0}^\infty WVD_{g_n}(t,\omega) + \mathop{\sum \sum}_{m=0, n=0, m\ne n}^\infty WVD_{g_n,g_m}(t,\omega)$$ where we see the latter double sum contains the WVDs of the interactions between components. Keeping only the first sum results in the "auto-WVD" (AWVD). It is with this function that the authors derive closed-form expressions estimating the instantaneous (and mean) frequency and bandwidth of a signal.

The first step given a signal is to find its unknown components with which to calculate the AWVD. The authors use two methods: 1) the Gabor transform; and 2) MP with a dictionary of Gabor atoms. The first method produces an extremely redundant signal expansion which we can sample and then add together the WVD of the modulated and shifted synthesis windows weighted by the expansion coefficients. With the second method, we get an extremely sparse signal expansion with which we can do the same thing (which we call wivigrams''). In the former case, we are restricted to one analysis window size; but in the latter case, we can use a dictionary of several resolutions. However, we must make a choice with MP when to stop the decomposition --- i.e., what the order of the model is. The authors appear to limit the order to around 32 atoms.

The figure above shows some of their results. At top is the WVD of a signal with two components that are modulated in frequency. To the left is the AWVD constructed from the Gabor transform, and below that are three estimates of the instantaneous (and mean) frequency. Finallly, on the right, is the AWVD constructed from MP, and the estimates found from this.

I do not quite understand the use of finding the instantaneous or mean frequency of a signal with more than one component; but regardless, the authors do not question the relevance of the atoms of a MP decomposition to providing good, or at least useful, information about the signal and its short-term statistics. This may not be apparent in the first few dozen iterations of the algorithm, but this negative behavior of MP and other greedy approaches to building signal models is well-documented.

## Paper of the Day (Po'D): Bags of Frames of Features Edition

Hello, and welcome to the Paper of the Day (Po'D): Bags of Frames of Features Edition. Today's paper comes from Fall of 2007: J-.J. Aucouturier, B. Defreville, and F. Pachet, "The bag of frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music", Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881-891, Aug. 2007.

This is interesting work because complex polyphonic acoustic textures are one of the final domains of audio signal processing. It will be excellent when we can take a monophonic recording of polyphonic music and separate the instruments for transcription, analysis, processing, and other transformations. Something like this is demonstrated by Melodyne, but for very a limited palette of sounds.

This article looks critically at why automatic polyphonic music recognition by bags of frames of features (BFF) performs so dismally compared with automatic recognition of acoustic environments by the same approach. A BFF, aside from being an acronym for the opposite of frienemy, is an accumulation of observations that does not preserve any time-domain relational information between frames. It is akin to looking at a text and counting up the occurrences of a word without considering its position relative to other words. Comparing features in this way with no consideration of time-domain relationships appear to be enough to give a computer listener better-than-human recognition (91% correct recognition with less than 3 seconds of sound compared with 35% correct recognition for humans) of environments, like "street," "factory," "footbal game," and "South Central Los Angeles." (Just kidding about the last one, but the first three reminded me of my time living in South Central Los Angeles.) However, even though this same approach is common in the automatic analysis of polyphonic music --- e.g., to recognize source, classify genre, mood, etc. --- the authors note the apparent existence of a performance "glass ceiling," which is around 70%. Furthermore, they observe that preserving temporal relationships through, e.g., training Markov models built from Gaussian mixture models (GMMs), do not significantly enhance these results, even though perceptual research emphasizes the importance of dynamics. The authors investigate possible reasons behind this performance difference by looking at measures of audio similarity.

They approach this by examining the "temporal homogeneity" and "statistical homogeneity" of sounds from each class. The temporal homogeneity attempts to gauge the self-similarity of a signal by folding it over (presumably adding) onto itself a number of times, and then adding repetitions so that the signal length does not change. When a signal is temporally homogeneous, then the BFFs computed from its folded variants (in this paper they use Mel-frequency Cepstral Coefficients (MFCCs)) should be as discriminative as those computed from the original signal. The statistical homogeneity attempts to find the generalizability of statistical models built from BFFs as they are have use smaller orders of GMMs. If a signal is statistically homogeneous, then the features from a reduced number of GMMs of the BFFs should be just as discriminative as those of the entire set of GMMs of the BFFs. (Or at least, I think that is what the authors mean.)

The authors find polyphonic music audio is not as temporally homogeneous as soundscapes. (Though I wonder about works by LaMonte Young, which arguably could be called minimalistically polyphonic.) Perceptually this temporal inhomogeneity makes complete sense. A chattering crowd folded over on itself will sound like a larger chattering crowd, and won't lose its identity as coming from a crowd. But taking a Chopin mazurka for piano and folding it over on itself will immediately make it unRomantic (talk about killing the mood), and maybe then its features will be mistaken for those from a work by Conlon Nancarrow. (MFCCs are also non-linear, and so the MFCCs of a sum of signals will not be the sum of the MFCCs.) Though the source may be unrecognizable as a piece of music, its folded version will probably not lose its pianoninoness. The authors also find that soundscapes are much more statistically homogeneous than polyphonic music (which is also not surprising, and may be a a necessary result of it being temporally inhomogeneous).

The authors have clearly made the case that all frames of features are not equally helpful for automatically discriminating between sources --- i.e., one must choose carefully which frames of features to put in each bag; but for soundscapes, all frames of features appear more equal than for polyphonic music. Non-relevant frames (estimated here to be above 60% of the total) can diminish the performance of recognizers for polyphonic music, and this motivates the need for features that describe higher-levels of musical audio signals, such as at the note level.

## Paper of the Day (Po'D): Matching Pursuit in Parallel

Hello, and welcome to the Paper of the Day (Po'D): Sparse Approximation in Parallel Edition. Today's paper comes from the March 2010 issue of IEEE Signal Processing Magazine: Laurent Daudet, "Audio Sparse Decompositions in Parallel: Let the Greed be Shared!", vol. 27, no. 2, pp. 90-96, Mar. 2010. This issue of the magazine deals with the use of multicore computing platforms for parallelized applications.

This particular article is of interest because, even though MPTK offers amazing decomposition speed for certain dictionaries, the audio sparse approximation community is in need of generalizable and near-real-time decomposition algorithms. For instance, Ravelli's audio codec using matching pursuit (MP) with an overcomplete dictionary of MDCT bases of eight scales requires about 100X the the audio signal duration -- and that is using bases with fast transforms. It is difficult to create fast algorithms because of the size of the problems, not to mention the rich complexity of audio signals. Sampled audio signals typically have dimensions larger than millions. And thus dictionaries must grow in size in order to accommodate such signals, depending of course on one's application.

In this article, Daudet demonstrates the obvious. Even though an audio signal exists in a high dimensional space, there is nothing stopping us from segmenting it when we are more interested in describing it locally than globally. Daudet presents three methods: 1) each core performs MP on disjoint segments; 2) each core performs a modified MP on overlapping segments with message passing between adjacent segments; 3) each core performs a modified MP in a stationary frame through which the signal moves. This latter approach can be likened to an assembly line, but for atomic decomposition. As the signal rolls on by, each person at his station is responsible for removing a certain number of atoms from the signal; and the likelihood of a person removing an atom is smaller the more he has to stretch his arms. Daudet compares simulations of all of these algorithms with different configurations to the standard sequential MP (presumably using the same atoms). Daudet finds that PLoMP performs closely to sequential MP in terms of signal-residual energy ratio.

Unfortunately, several important details are missing from this paper, e.g., frame size, frame shift, weighting window shape, not to mention computation time for the different approaches. However, this paper does well as a proof-of-concept in showing that we can take the serial decomposition algorithm of MP (which is not a "divide and conquer" algorithm) and make it parallel, the ease of which is limited by the support of the largest atom in the dictionary. Now, if we can just as easily make orthogonal MP parallel ...

## Paper of the Day (Po'D): Performance of Greedy Algorithms

Hello, and welcome to the Paper of the Day (Po'D): Sparse Approximation Edition. Today's paper comes from Vladimir N. Temlyakov and Pavel Zheltov, "On performance of greedy algorithms," submitted Jan. 27, 2010 to Journal of Approximation Theory. (Technical report available here.)

This paper looks at the performance of "orthogonal greedy algorithms" (e.g., orthogonal matching pursuit) and "pure greedy algorithms" (e.g., matching pursuit) in recovering sparse signals from linear combinations of elements from low coherence dictionaries. In other words, let us assume some discrete signal $$\mathbf{x}$$ of finite support $$N$$ is a linear combination of $$m \le N$$ atoms (of same dimension as $$\mathbf{x}$$) selected from the dictionary $$\mathbf{D}$$ such that $$\mathbf{x} = \mathbf{D} \mathbf{a}$$ where $$\mathbf{a}$$ has only $$m$$ non-zero entries. Now, what conditions must this system meet so that we can recover $$\mathbf{a}$$ from $$\mathbf{x}$$ using an orthogonal greedy algorithm? Clearly, if the dictionary is an orthonormal basis, then there will be no problem no matter what $$m$$ is. This is because there is no interaction between atoms in building $$\mathbf{x}$$, i.e., every atom pair has a zero inner product. However, when the dictionary becomes a frame, e.g., a union of two bases, then we have ourselves a more difficult problem.

One of the figures of merits of a dictionary is its coherency, defined: $$M = \sup_{\mathbf{d}_i,\mathbf{d}_j} \frac{| \langle \mathbf{d}_i,\mathbf{d}_j \rangle|}{||\mathbf{d}_i|| ||\mathbf{d}_j||}, \mathbf{d}_i,\mathbf{d}_j \in \mathbf{D}, \mathbf{d}_i \ne \mathbf{d}_j.$$ In words, this is the largest cosine distance between all pairs of atoms from the dictionary, or likewise the smallest angle between all atom pairs. As $$M \to 1$$, then we are approaching higher and higher redundancy in the dictionary, and there will begin to exist signals $$\mathbf{x}$$ in some subspace for which orthogonal greedy algorithms cannot recover their sparse representations $$\mathbf{a}$$. This paper puts a limit on the approximation error of a signal with sparsity $$m$$ in terms of the coherency $$M$$ of the dictionary $$\mathbf{D}$$ using an orthogonal greedy algorithm.

Defining the $$k$$th-order approximation error of $$\mathbf{x}$$ as $$\mathbf{r}(k)$$, and $$\sigma_k(\mathbf{x})$$ as the best $$k$$-term approximation error, the authors show that the $$k+s$$th-order approximation error using an orthogonal greedy algorithm using a dictionary of coherency $$M$$ has the upper bound
(Theorem 5): $$||\mathbf{r}(k+s) ||_2 \le 7 M s || \mathbf{r}(k) ||_2 + \sigma_s^2(\mathbf{r}(k)), \; \textrm{if} \; k+s \le \frac{1}{2M}, k \le s.$$
The difficult matter is evaluating the term $$\sigma_s^2(\mathbf{r}(k))$$; but regardless, we can see from this expression how much larger than the minimum error the orthogonal greedy algorithm can take us.

Another interesting aspect of this paper is its demonstration of a particular dictionary with low coherency that cannot be used by an orthogonal greedy algorithm to recover any sparse signal at the limit $$m \le \frac{1}{2} \left ( \frac{1}{M} + 1\right ).$$ In fact, they show that an orthogonal greedy algorithm does not find a single one of the original atoms used to build $$\mathbf{x}$$.

From my dissertation work, I am particularly enticed by papers offering new glimpses of greedy algorithms for sparse approximation. However, I have yet to see how I can apply such work to audio signal decomposition. We cannot know a priori the sparsity of a given audio signal; but I do know from experience that what works best to produce signal models that are useful for my purposes (low order, small error, and a good reflection of the short-time signal statistics) are dictionaries with high coherency. I don't ever start with the assumption that there exists a "true" or "correct" model; but I do know that some atoms are less correct than others.

## Paper of the Day (Po'D): Recovering Sparse Speech Excitations

Hello, and welcome to the Paper of the Day (Po'D): Signal Processing Edition. Today's paper comes from more of our colleagues: Daniele Giacobello, Mads Christensen, Manohar N. Murthi,  Søren Holdt Jensen, and Marc Moonen: Retrieving Sparse Patterns Using a Compressed Sensing Framework: Applications to Speech Coding Based on Sparse Linear Prediction, IEEE Sig. Proc. Letts., vol. 17, no. 1, pp. 103-106, Jan. 2010.

The usual approach to linear prediction in speech coding is to minimize the squared norm of the error signal given a predictor order $$P$$, i.e., $$\mathbf{a}_\textrm{2opt} = \arg \min_{\mathbf{a} \in \mathbf{R}^P} || \mathbf{x} - \mathbf{X}\mathbf{a} ||_2$$ where $$\mathbf{X}$$ is a matrix of rows composed of elements of the signal $$\mathbf{x}$$. Though this does guarantee that the error will be distributed more uniformly than the original signal, it does not guarantee that the error will be sparse, i.e., that the number of zero-valued samples in $$\mathbf{r} = \mathbf{x} - \mathbf{X}\mathbf{a}_\textrm{2opt}$$ will be small. The authors propose instead to find a predictor that will recover the "sparse patterns" of speech, i.e., the glottal impulses in voiced speech. Thus, instead of minimizing the $$\ell_2$$-norm of the residual, they minimize the $$\ell_1$$-norm, which has been shown capable of recovering sparse solutions for underdetermined systems: $$\mathbf{a}_\textrm{1opt} = \arg \min_{\mathbf{a} \in \MR^P} || \mathbf{x} - \mathbf{X}\mathbf{a} ||_1.$$ We can return to an approximation of the original signal with either of these weights by filtering the error signal, i.e., $$\hat{\mathbf{x}} = \mathbf{H} (\mathbf{x} - \mathbf{X}\mathbf{a}_\textrm{1opt}) = \mathbf{H} \mathbf{r}$$ where $$\mathbf{H}$$ is the $$N\times N$$ synthesis matrix with rows constructed of the impulse response of the all-pole filter given by $$\mathbf{a}_\textrm{1opt}$$ (and truncated to length-$$N$$). We can see in this form that if we consider $$\mathbf{r}$$ to be sparse, and $$\mathbf{H}$$ to be a dictionary, then we have succeeded in representing the non-sparse signal $$\mathbf{x}$$ in a sparse way! However, we do not just want sparsity in the solution, but we want to guarantee the error will be less than some maximum $$\epsilon > 0$$, i.e., we want the error signal to satisfy $$\mathbf{r}_\textrm{best} = \arg \min_{\mathbf{r}\in\MR^N} || \mathbf{r} ||_1 \; \textrm{subject to} \; || \mathbf{x} - \mathbf{H} \mathbf{r} ||_2^2 \le \epsilon.$$ Using an $$\ell_1$$-norm minimization of the residual for speech coding has been proposed before and revisited numerous times. However, what the authors do here is bring it to a compressed sensing framework. Given that we assume our error signal $$\mathbf{r}$$ has some sparsity, we can construct a $$M \times N$$ measurement matrix $$\mathbf{\Phi}$$ with $$M \ll N$$ such that the equation above is solvable with fewer constraints. The best solution is then given with high probability by $$\mathbf{r}_\textrm{best} = \arg \min_{\mathbf{r}\in\MR^N} || \mathbf{r} ||_1 \; \textrm{subject to} \; || \mathbf{\Phi}\mathbf{x} - \mathbf{\Phi}\mathbf{H} \mathbf{r} ||_2^2 \le \epsilon.$$ Now the problem is reduced to one with $$M \ll N$$ constraints. There are, of course, caveats. We must assume the sparsity $$K$$ of the error signal; however, due to the limitations of the human voice, we can reasonably gauge this given $$N$$ samples. Second, based on this sparsity, or measurement matrix should have $$M = O(K \log N)$$ to give a high probability the best solution $$\mathbf{r}_\textrm{best}$$ will be found. Their experimental results indicate that for voiced speech they are not only recovering sparse solutions with signal errors close to an optimal method, but they are doing so 1000 times quicker! Good work fellows.

## Equations!

$$\int {1\over x}\,dx = \ln(x)+C$$ and $\sum_{i=1}^n i = {n(n+1)\over 2}.$ We now have equation ability, complements of jsMath. See here for more information. You just need to make sure you are not in richtext edit mode.

## Paper of the Day (Po'D): Sound Edition

Hello, and welcome to the Paper of the Day (Po'D): Sound Edition. Today's paper comes from ICASSP 2009. Since ICASSP 2010 is next week, I figure I better get started looking at the papers I printed from last year:

H. D. Tran and H Li, Sound event classification based on feature integration, recursive feature elimination and structured classification, in Proc. ICASSP, pp. 177-180, Taipei, Taiwan, Apr. 2009.

The automatic classification of data is not just useful for surveillance, but also the automatic description of any large amount of sound data, such as the collection of sound data that exists on the Internet. In this paper, the authors explore the effectiveness of some feature selection algorithms to increase the classification of several different examples of environmental audio signals, such as cry, scream, door slam, clapping, speech, etc. An oft-used approach to the problem of sound classification is the use of feature vectors that contain large numbers of different features encompassing all sorts of characteristics of signals in the time- and frequency-domain, as well as higher-level descriptors such as tempo, pitch, voiced/unvoiced, etc. With these ever-expanding feature vectors, however, comes a price paid in terms of the number of training samples needed to build a classifier that is general enough to be useful. This is colorfully known as the curse of dimensionality.'' Besides throwing in every descriptor one can think of, a much better approach is to find those descriptors among the many that give a large separation between groups and a small separation within groups. The price paid however, is finding those relevant subsets of features. The authors show in this paper that a feature selection algorithm designed for analyzing genetic material can be used to reduce their 226-dimensional feature vectors to ones containing dozens of features, while increasing performance of a SVM classifier. They further augment this technique with a multi-level classifier that becomes more specific with each level. It is unclear from this paper, however, how the authors labeled their dataset, and the criteria for doing so.

## Beetlemania

| 1 Comment
I got this link to an article about how researchers use bark beetles' own sounds to protect trees

' "I thought, 'What would be the nastiest, most offensive sound?' To me, that would be Rush Limbaugh or heavy metal," McGuire said.'

Hmmm....

## Paper of the Day (Po'D): Signal Processing Edition

Hello, and welcome to the Paper of the Day (Po'D): Signal Processing Edition. Today's paper comes from our colleagues Mads Christensen, Jan Østergaard, and Søren Holdt Jensen: On Compressed Sensing and Its Application to Speech and Audio Signals, in Proc. Asilomar Conf. Signals, Systems, and Computers, 2009.

In this paper, the authors look at the contribution of compressed sensing to finding sparse approximations of audio and speech signals using redundant dictionaries. The main reason this work is exciting is because it demonstrates how we can significantly reduce the dimensionality of the problem of finding sparse and efficient representations for high-dimensional data, such as sampled audio. This means that I can keep working with large and redundant dictionaries, use an optimization method to build a representation based on minimizing the \ell_1-norm of the solution, and still be home in time for dinner. I do not need to rely only on greedy short-sighted methods of sparse approximation. The magic here is in how one can use compressed sensing to reduce the number of constraints involved in solving the problem -- but only as long as the signal being decomposed is sparse in the given dictionary. And there is the rub: since the sparsity of the signal is unknown a priori, we must guess it to ensure our measurement matrix will permit recovery of the solution. However, it appears that in the world of sparse approximation, even if we misjudge the sparsity, the solution will not be significantly worse than otherwise. That is good news!

## PhD Announcement: EBRAMUS

The following is an interesting announcement I just received.
=====

EBRAMUS is a consortium of European research centres based in France, Germany, Belgium, Great Britain, and Poland (for further information: http://leadserv.u-bourgogne.fr/ebramus).

EBRAMUS offers a unique interdisciplinary graduate programme to study the behavioural, functional, structural, and plastic effects of music on cognitive functions such us language, memory, learning, and motor behaviour through an integrative and interdisciplinary approach. Its goal is to train PhD students in the multidisciplinary aspects of music in rehabilitation, learning, and facilitation of cognitive processes with behavioural and neuroscience (EEG, fMRI, etc.) methods.

The programme invites applications for ten 3-year PhD scholarships and one 2-year Post-doc Fellowship (see website for detailed presentations). Candidates are accepted to realize one of the projects that are part of the following topics of the programme:
1. Music and prosody in the rehabilitation of auditory processing and language disorders.
2. Music in the rehabilitation of memory and learning
3. Music and motor rehabilitation
The PhD will work with one main advisor and spend 1 to 6 months in at least one other lab of the consortium.

See attached document for further details (eligibility criteria).
To view full application details, go to http://leadserv.u-bourgogne.fr/ebramus
For contact EBRAMUS at: EBRAMUS@gmail.com

| 1 Comment
Here is an interesting article on the subject of music perception and different musical styles, i.e., atonality and everything else. Barring the ludicrous statements that often come with science popularizations (for example, it is stated in this article that something can be more random than something random), it has always seemed to me that invoking the concepts of predictability and expectation to explain the "popularity" -- or level of appreciation -- of certain musical styles is, at best, naive.

First of all, does the popularity of a musical piece, not to mention an entire musical style, really say something concrete about how the human brain works, or rather something concrete about the zeitgeist? Just because a piece is "popular" does not mean it is because it fits the natural workings of the brain. History is an important component to how music is received and judged, and often this aspect of music is neglected. Second of all, just because atonal music was panned by critics and audiences "who have found it difficult to follow" -- and to this day is remains criticized by many (Who cares if you listen?) -- does not mean that such music is unpopular because it doesn't fit with some fundamental process of the human brain. I am sure some said critics cannot "follow" classical Indian music either, but that does not mean Indian classical music is contrary to the processes of the brain. Third, this idea of expectation and prediction, and the idea that an audience member is sitting and predicting, and when predictions go awry then the "boos" start to fly, this concept feels to me a bit Western-centric. Not all musical traditions are as focused on modulation and the goal of resolution as Western classical music. Fourth, but certainly not finally, the author claims that composers of popular music owe their success to strict formulas: "... many traditional composers such as Mozart, Bach and Beethoven subconsciously followed strict musical formula to produce music that was easy on the ear by ensuring it contained patterns that could be picked out by the brain." Whatever "strict musical formulae" are, I don't think Mozart, Bach, and Beethoven followed them. Each of these composers are not remembered for any formulaic music, and definitely not for producing music that was "easy on the ear."

Bob L. Sturm, Associate Professor
Audio Analysis Lab
Aalborg University Copenhagen
A.C. Meyers Vænge 15
DK-2450 Copenahgen SV, Denmark
Email: bst_at_create.aau.dk