April 2010 Archives

Continuing my experiments with LoCOMP yesterday, here are my comparisons of the time-domain distributions of atoms in the \(n=25\) order models of each signal constructed by five different greedy methods. (Here is my code I used to produce the plots below.)
After writing about low-complexity orthogonal matching pursuit (LoCOMP) a few days ago, I decided to implement and test it starting from my old dissertation code, and furthermore to incorporate some interference adaptation. This task was not as trivial as I first thought, but the end results are interesting. The MATLAB code I used to generate the plots below is here, for those interested in reproducing my research.
Hello, and welcome to Paper of the Day (Po'D): Low-Complexity Orthogonal Matching Pursuit Edition. Today's papers are exciting:
  1. B. Mailhé, R. Gribonval, F. Bimbot, and P. Vandergheynst, "Local orthogonal greedy pursuits for scalable sparse approximation of large signals with shift-invariant dictionaries," in Proc. Signal Process. Adaptive Sparse Structured Representations, (St. Malo, France), Apr. 2009.
  2. B. Mailhé, R. Gribonval, F. Bimbot, and P. Vandergheynst, "A low complexity orthogonal matching pursuit for sparse signal approximation with shift-invariant dictionaries," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Taipei, Taiwan), pp. 3445-3448, Apr. 2009.
Prepare yourself for a full palm to face application, because I didn't, and the bruise, it hurts. Here is a recent Popular Science article describing a "New Software Processor [that] Can Transcribe Music From Any Performance."

I am going to describe my first few moments with this article. I see the title and think, "what is a Software Processor?" I guess the software, "it does processes," like LOL cats does speek. As with so many science popularizations, there are words here with precise meanings but put together in strange ways.

Then I get to "Can Transcribe Music from Any Performance." That is revolutionary since this "software processor" can transcribe a performance in the most reverberant of places, and even in outer space. At this moment my skepticism bells are tolling like Montmartre on a dimanche morning. How come I haven't heard of this revolutionary work? Is this going to be another plug for Melodyne?

Then I reach the by-line: "If only Mozart had this." With this I know I am exactly in the placidly naive waters of the media, where emphasis is put on overreaching claims, and absurd scenarios. What an insult to Mozart. Had he to wait for the computer to boot up every time, the world would be at least 41 symphonies and a Mass short: "Hey Tony! How do I turn off the swing quantization? And how can I see the rest icons in this Finale?"

Then almost simultaneously, I reach the photograph of an obviously staged conductor with cellists unnaturally obedient, and smokey purple lights, and my god, green lasers. At this point my initial to present magnitude respect was hovering at -100 dB.

music1.jpg
Adding lasers to any image maximally increases the respect I give to the set of all other images.


And then I read further to find that the paper that proposes and tests this "new method to generate sheet music based on the sounds of individual notes" was my Paper of the Day (Po'D) on April 15, 2010. That this novel work received such revolutionary mention like this is completely face-palmable. Even in the Popular Science article, it is said, "As of now, it only works for one instrument at a time, but the researchers think the method can be scaled to include many instruments playing at once," which contradicts its very headline. Furthermore, if Mozart had to wait for dictionary learning and sparse approximation, the world would be short one Mozart, and many of the composers that followed.

Perhaps the funniest part of this entire thing is that the authors of the research article show that their approach is not as accurate as that of P. Leveau, E. Vincent, G. Richard, and L. Daudet, "Instrument-specific harmonic atoms for mid-level music representation," IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. This clearly means that the approach of Leveau et al. "Can Transcribe Music from Any Performance NEVER Recorded!" I will contact Popular Science immediately. Using Electronic Email. And my software mailer.

Video games can never be art

| 4 Comments
I know several of the students here are inspired and motivated by "video games," a subject that is often times taken less seriously than it should be. I know this personally: when I was trying to convince my parents to buy me a Nintendo 64 "because it will increase my hand-eye coordination," I was laughed at. Nonetheless, the following argument is of interest for many reasons: http://blogs.suntimes.com/ebert/2010/04/video_games_can_never_be_art.html

Chief among them is that this presents a "teachable moment" for anyone professionally engaged in practicing thorough critical thinking during this time, for instance Medialogy students who are deep in analyzing a problem, exploring solutions, designing and performing tests, recording observations and conclusions, and writing a coherent and convincing report.

Taking Ebert's arguments about why video games can never be art, can you identify the weakest points, and the strongest? Does he succeed or fail in convincing you of his position, and why? How can you strengthen his weakest points, if at all? How can you defeat his strongest points, if at all? Are his arguments water tight, or are they flooded by semantics?

Here is something more advanced: can you find his fallacious reasoning? There is a lot here. For instance, Ebert says, "One obvious difference between art and games is that you can win a game." First of all, stating something is obvious does not mean it is obvious. Second of all, is a game essentially defined by its being "winnable"? What is a game when it is in the process of being played? Is Tetris then not a game? This particular sentence above is an example of a fallacy called "denying the antecedent":
  1. If X is winnable, X is a game.
  2. Art is not winnable.
  3. Therefore, Art is not a game.
For the students who are enraged that someone calls into question the artfulness of your passion, practice separating your emotions from your arguments. The truth of a claim does not depend on how passionate you and others are about it (another fallacy: appealing to emotions). If you are mad as hell, try to defend his position and see the debate field from his perspective. Is his position purely indefensible? Are you taking his claim as a personal attack? And if so, why?

Another interesting thing, dare I say "fun," is perform the same exercise with some of the 3,305 comments left. There are plenty of unsupported blanket claims like, "What I see here is blind ignorance. I see statements from a man who is hopelessly attached to his own values and interests. Anything can be art. Anything." There is a lot of foaming and frothing vitriol too, where people are really taking this as a personal blow to their self-worth. For instance, "I don't know why I'm so intent that you consider the media I hold dear as art, but I do."

Here is one sentence from a comment that is 100% correct: "Just because your old pretentious brain doesn't understand games, doesn't mean they aren't art." Even though this statement is completely true (a quality of some thing does not depend on the structure and function of a single man's brain), what help is it?

Identify what is wrong with this response: "Mr. Ebert, I respect you as a movie critic. I have your books and love your passion for movies. But regarding video games, I believe you should first play games like ..." How much experience must one have in order to make a claim? This argument is called "The Courtier's Reply" fallacy. I cannot defend an argument by claiming a challenge comes from someone who just doesn't know, or understand, or who hasn't read x, or been to y, etc.

These skills are essential to develop and hone, and are applicable to your everyday life --- from reading a newspaper and watching the news, to arguing and getting along with friends and family, to preparing excellent problem reports, to job interviews, to starting and running a business, to not being taken advantage of by con artists, etc., etc., etc. And on the plus side, you may be able to say to your professor, "I do not buy your argument because it is fallacious." Sure you can always say that, but you will be better prepared to say why.

So what do you think? What are Ebert's weakest arguments, and strongest arguments? How would you counter him? Why should you counter him? What does it mean for something to be called art or not art?

Peer Review

| No Comments
I am always excited to receive an invitation to peer review, as I see it not only a way to separate the wheat from the chaff, or to help transform chaff into wheat, but it is also an opportunity for me to mature in critical and comprehensive thinking, and to practice constructive criticism. It can be a lot of work when I have to delve into unfamiliar subject matter, review references, and the previous work of the authors; but typically it still not nearly as much work as went into the submission in the first place. An additional benefit is when I see how my observations match up with those of the other reviewers when the summary decision is sent.

During my PhD, one of my advisors asked me to submit some or our work to a journal even though I was not completely happy with the results, or convinced in its application. "Why submit this when I know I would reject it?" He answered sagely, "Without criticism the rejection will not come from people who have distance from our problem." And sure enough, both of us were right. The paper was rejected, and the review comments were extremely useful. That was the first time I realized rejection is not always counter productive. That is why I really liked the idea behind the on-line journal Rejecta Mathematica, which unfortunately has only one issue in a seemingly rejected subspace of the world wide cobwebs.

Sure, sometimes peer review can fail due to grudges, or politics, or whatever nonscientific. I have heard of conference papers being rejected outright, and then submitted unmodified to another conference to win a best paper award. There are also papers with significant results sitting and waiting for a long time, only to be rejected for no apparent reason. But there are also plenty of crazy people believing that the Institution of Science™ is conspiring against them to marginalize their revolutionary ideas, c.f., The Journal of Theoretics. I remember when I looked at their inaugural issue in 1999. I was stunned such things could be put on the Internets and taken for Real Science©. I am happy to see they have closed shop.

Peer review can be outwardly thankless of course. My name will only appear as, "the anonymous reviewers who have substantially improved the quality of this manuscript." I may get to list on my CV the journals and conferences for which I have served the peer review position; but I can't say anything more specific. Regardless, I feel in a cost-benefits analysis that I reap just as much as I sow. Though frustrated I may get with overzealous notation, incomplete experimental descriptions, inadequate references and review of past work, poor grammar, illegible microscopic won't-print-in-gray-scale figures, a lack of a coherent explanation of relevance, and the Word document format, peer review is as much good for the authors as it is for me.

The author would like to thank the anonymous reviewers who have helped improve this post.
Reblog this post [with Zemanta]
Consider these two problems formulations:
for some \(\vx \in \mathcal{R}^K\) and some fat full rank \(\MD \in \mathcal{R}^{K\times N}\), and some positive error measure \(J\), and some \(\delta \ge 0\), and some integer \(0 \le C \le N\), solve either:
  1. $$\min_\vs || \vs ||_0 \; \text{such that} \; J(\vx, \MD\vs) \le \delta$$
  2. $$\min_\vs J(\vx, \MD\vs) \; \text{such that} \; || \vs ||_0 \le C$$
These two problems are the same in that they both look for a model, or representation, or decomposition, of a vector from a set of other vectors, \(\vx = \MD\vs\); but these two problems are different in that the first focuses on reducing the model order within a maximum error, and the second focuses on reducing error with a maximum model order. With the zero-norm, we don't care about the particular values in the solution, just that they are zero or not zero.

Hello, and welcome to Paper of the Day (Po'D): Cyclic Matching Pursuit Edition. Today's paper is: M. G. Christensen and S. H. Jensen, ``The cyclic matching pursuit and its application to audio modeling and coding," in Proc. Asilomar Conf. Signals, Syst., Comput., (Pacific Grove, CA), Nov. 2007. Note that cyclic matching pursuit (CMP) is not related to complementary matching pursuit (CMP); but in this post I will refer to the former by CMP, and to the latter as TOCMP (the other CMP). Good short acronyms are hard to come by.
Hello, and welcome to Paper of the Day (Po'D): Complementary Matching Pursuit Edition. Today's papers are: G. Rath and C. Guillemot, "A complementary matching pursuit algorithm for sparse approximation," in Proc. European Signal Process. Conf., (Lausanne, Switzerland), Aug. 2008; and G. Rath and C. Guillemot, "On a simple derivation of the complementary matching pursuit," Signal Processing, vol. 90, pp. 702-706, Feb. 2010. There also appears to be a paper submitted: "Complementary Matching Pursuit Algorithms for Sparse Approximation."

On Friday we reviewed the problem of finding the sparsest solution to an underdetermined system of equations, and saw that it can be cast in a subtly different form: instead of looking for the sparsest solution starting from nothing, we can look for it starting from the solution with the smallest \(\ell_2\)-norm of all possible solutions, i.e., $$\vs^* := \MD^T(\MD\MD^T)^{-1}\vx$$ i.i.e., the solution that points in no way into the nullspace of the dictionary, i.i.i.e, the Jean-Claude Van Damme of solutions to $$\vx = \MD\vs$$ where the dictionary is fat yet full rank. As described earlier, \(\vs^*\) will probably not be sparse. Thus, we will attempt to sparsify it using a greedy approach. The complementary matching pursuit (CMP) proposes to sparsify it using MP and a special dictionary. Thus, we move from performing MP in a \(K\)-dimensional space, to an \(N\)-dimensional space.

Consider the usual showbiz story of a short vector wanting to be much, much taller. Our characters are: 1) a real \(K\)-dimensional signal \(\vx \in \mathcal{R}^K\); 2) a full rank fat dictionary of \(N > K\) real \(K\)-dimensional atoms \(\MD \in \mathcal{R}^{K\times N}\); and 3) and some \(\vs \in \mathcal{R}^N\) such that $$\vx = \MD\vs.$$ To produce this show, there are an infinite number of possibilities to cast a solution \(\vs\) since \(N > K\); but among these is ``the one,'' \(\vs_0\), the sparsest solution, i.e., the one with the largest number of elements that are zero, i.i.e., the solution that points into the fewest dimensions, i.i.i.e., the Jack Nicholson of solutions. We want to hire Jack Nicholson of course, but he is prohibitively expensive. We cannot find ``the one'' directly because the computational complexity is combinatorial. This is due to having to look at all \(n\)-tuples of the dictionary of \(N\) atoms. An interesting question is:

Can we hire Jean-Claude Van Damme and transform him into Jack Nicholson?
Hello, and welcome to Paper of the Day (Po'D): Harmonic Dictionaries for Note-Event Detection Edition. Today's paper comes from the March 2010 issue of IEEE Transactions on Audio, Speech, and Language Processing, a special issue devoted to signal models and representations of environmental sounds: J.J. Carabias-Orti, P. Vera-Candeas, F. J. Cañadas-Quesada, and N. Ruiz-Reyes, "Music Scene-Adaptive Harmonic Dictionary for Unsupervised Note-Event Detection," IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 473-486, Mar. 2010.

Copenhagen Image and Signal Processing Graduate school

| No Comments
It seems quite hard for my Ph.D. students to find courses in the Copenhagen area related to audio signal processing and applications.
Trying to help them, I stumbled upon the Copenhagen Image and Signal Processing Graduate School.
The school is lead by Professor Lars Kai Hansen, who is certainly one of the professors in this area I really admire, both for his research achievements, motivation and enthusiasm.
Next step for us is to find out how to be part of this exciting initiative :-)
Perhaps no other acoustic signal has been analyzed as much as the speaking human voice, and it is obvious why considering the enormous practical benefits to communication that come with understanding it, i.e., speech coding, speech recognition, and speech synthesis. Our cellular telephones are possible in part because the speaking voice is a very well-behaved and extremely predictable source, with a mystifyingly simple principle in its production: there is a wideband source, and there is a separate filter. Nothing could be more linear!

While speech coding, speech recognition, and speech synthesis are solved problems (i.e., many successful commercial products exist), there are aspects of human voice processing that continue to warrant research. For instance, one area of research is emotion and distress detection, and other high-level aspects of language and speech understanding. There are also important reasons for detecting in the voice the presence of some medical pathology, e.g., throat cancer, instead of using invasive procedures. There is also the sung lyrics transcription. But one area of voice research I have yet to see thoroughly explored is in the "extended voice."

There are no excuses for poor figures in a published scientific article. When I say "poor" I mean any of the following:

  1. The figure is so small my eyes strain to read it.
  2. The axes are unlabeled, or are given some label that does not make sense with the units, e.g., "Time" with an right-hand abscissa of 10x5.
  3. The text in the figure is smaller than the text in the caption, requiring me to find a magnifying glass to read it.
  4. The figure caption is vague and uninformative.
  5. The figure becomes unclear in grayscale because to me the colored lines all look the same.
An article should function as an advertisement of research, as reflected in the working methodology of reproducible research. And as with any advertisement, it should not be annoying. But I have learned that many people are not aware of how easy it is to automatically create readable figures in MATLAB. As an example, I created the following code to show how. Note that there is no need to invoke the figure editor of MATLAB and manually adjust and change all the details. Creating such code will make it much easier to share with other scientists (another tenet of reproducable research), and make quick adjustments when necessary. And I am in a happier mood when I read a paper with figures that advertise rather than antagonize.

% should always do this first to get rid of possibly interfering data
clear all;

% create a new figure and store handle
screenposx = 300;   % in units of pixels
screenposy = 300;
screenwidth = 600;
screenheight = 400;
% The normalized units setting means we do not need to worry about pixels or inches when creating an axis
handle_figure = figure('Position',[screenposx screenposy screenwidth screenheight], 'Units','Normalized');

% set the figure printing properties.
% NOTE: if printing to EPS, only the last 2 values matter as they define the figure size
leftmargin = 0.1;  % there are in inches
bottommargin = 0.1;
figurewidth = 6;
figureheight = 4;
set(handle_figure,'PaperOrientation','portrait', 'PaperPosition',[leftmargin bottommargin figurewidth figureheight]);

% create axes and store handle
axeswidth = 0.86; % normalized units; 1 spans entire width
axesheight = 0.86;
axesmarginleft = 0.12;
axesmarginbottom = 0.13;
% always make the fontsize larger than the extremely small MATLAB default
handle_axes = axes('FontSize',14, 'FontName','Helvetica', 'position', [axesmarginleft axesmarginbottom axeswidth axesheight]);

t = [-pi:0.01:pi];
x = sin(t);
y = cos(t);

% plot data and store handles
handle_sin = plot(t,x);
hold on;
handle_cos = plot(t,y);
% add the xy axis
line_x = line([-pi pi],[0 0]);
line_y = line([0 0],[-2 2]);

% now we set plot properties so they look nice. Note: [0 0 0] = black, [1 1 1] = white
set(handle_sin,'LineWidth',1,'Color',0*ones(3,1));
set(handle_cos,'LineWidth',2,'Color',0.6*ones(3,1));
set(line_x,'Color','k','LineStyle','--');
set(line_y,'Color','k','LineStyle','--');

% Note how I set the black line to a small width, and the gray line to a thicker width.

% set axis properties by adding nice labels and units
set(handle_axes,'XTick',[-pi:pi/4:pi], 'YTick', -1:0.25:1, 'XTickLabel',{'-pi', '-3pi/4', '-pi/2', '-pi/4', '0','pi/4','pi/2','3pi/4','pi'},'YTIckLabel', {'-1','-3/4','-1/2','-1/4','0', '1/4','1/2','3/4','1'});
% note that we can use Latex-like symbols
xlabel('angle \omega');
ylabel('f(\omega)');
axis([-pi pi -1.05 1.05]);
grid on;

% create a legend
handle_legend = legend([handle_sin handle_cos],'Sine','Cos');
% reposition legend so that it does no interfere
% first we get the position property of the legend to retain its size
handle_pos = get(handle_legend,'Position');
% position the legend at a different location, keeping its original width and height
set(handle_legend,'Position',[0.6 0.2 handle_pos(3) handle_pos(4)]);

% finally, print a nice EPS
print(handle_figure,'-depsc2','sine.eps');

Et voilá! (Based on your version of MATLAB, you might have to adjust the fontsize and the margins.) sine.png Now what would be extremely nice is a way to use LaTeX to turn those "pi/2" into \(\pi/2\) without having to use the "text" command.
Reblog this post [with Zemanta]

How do you recognize vowels?

| 1 Comment
One of my groups of undergraduate students here is creating a very interesting application involving vocal training for not-necessarily good singers. Pitch is of course something of direct importance for singing (and enjoying listening to said singin), but so is the quality of the tone. I suggested to them that, with the tools they already have at hand (Max/MSP), they look at making their system discriminate between good and bad tones. To make the problem more concrete I suggested they first recognize which vowel is being sung. I was thinking that the easiest way to do this (without invoking any signal processing that they do not know) would be to use Max to find the first \(n\) peak frequencies and their powers relative to the fundamental of windowed segments to try to capture the shape of the formant, and then use nearest neighbor classification, or codebook vector quantization. This would not be very gender neutral of course, but might give adequate results.

I used MATLAB to see what kind of performance I could achieve for their small database (3 females, 3 males, five different vowels each). By using signal segments 100 ms in duration (with a Hamming window and some zero padding in the discrete Fourier transform), and taking parameters of the first seven partials, I achieved the following mean classification performance for their five vowel types:
'a' 'i' 'o' 're' 'u'
'a' 0.4119 0.0116 0.2268 0.2650 0.0847
'i' 0 0.6006 0.0388 0.1958 0.1648
'o' 0 0.1844 0.4553 0.1394 0.2209
're' 0 0.2693 0.1507 0.2951 0.2850
'u' 0 0.1006 0.1817 0.1670 0.5507
I generated this "confusion table" from nearly 4,000 independent random trials, using a single nearest neighbor classifier with respect to the Euclidean distance between each pair of fourteen-element feature vectors. I kept the training data and testing data separate. What we see here is that when the classifier is presented a labeled 'a' sound, 41% of the time it is classified correctly, 23% of the time it is classified as 'o', and 27% of the time it is classified as a 're' sound. The best performance is seen with 'i' sounds, where it is labeled correctly 60% of the time. In other words, this classifier is not very good! Clearly these features are not capturing the formant shapes.

Just to compare to an industry standard, I looked at using the first 13 Mel-frequency cepstral coefficients with the same classifier and testing approach. I obtained the following performance:
'a' 'i' 'o' 're' 'u'
'a' 0.7314 0.0081 0.0373 0.1041 0.1190
'i' 0 0.9501 0.0002 0.0452 0.0044
'o' 0 0.0085 0.7474 0.1688 0.0752
're' 0 0.0398 0.1253 0.6346 0.2004
'u' 0 0 0.1930 0.0278 0.7792
We can see that this classifier has the hardest time classifying 're' with a low correct rate of 63%, but correctly labels 'i' 95% of the time! The others have accuracies greater than 73%. Considering that the data are extremely noisy, with people talking in the background, blowing in the microphone, and other strange paranormal disturbances, not to mention that to my own ears it is sometimes very difficult to determine which vowel is being sung, I would consider this performance quite acceptable and satisfying. I am just at a loss for how to effectively explain MFCCs (and formants and autoregressive processes and linear prediction, and so one) without the attendant math; and I do not think Max/MSP has an object that will spit out MFCCs.

My MATLAB code is available upon request.
This years' Interactive Sonification Workshop was a very nice one-day event with several interesting talks, spanning from sonification of RNA-structures and rowers' movements to more artistic installations, such as sounds illustrating how many people there are present when you arrive home. Tony Stockman's  keynote gave a good overview of the abilities that seeing as well as sight damaged people have to navigate and interpret the surroundings from sound alone. Tony also noted the many missed opportunities for sound design in everyday life, mentioning environmental sounds and my (favorite?) example of how microwave ovens alert without giving any useful information on the current state of the food.

The opportunities for sound design in urban environment was also a topic the next day, as the  COST Sonic Interaction Design  meeting was visited by the chair of another COST action:  COST TD0804 Soundscape of European Cities and Landscapes. According to Jian Kang, investigations indicate that below ca 60dB, people's perception of the sound environment is not related to actual sound level. That is, reducing the SPL does not necessarily lead to that the environments are perceived as more quiet or pleasant. Instead, the action proposes to use the opportunities for really designing soundscapes, a transition from sound reduction to soundscape creation.  For those interested, there will be a conference on soundscape and urban planning in Stockholm Sep 30-Oct 1, 2010.

Currently, the urban soundscape is a mix of road noise of the European motorway and the energetic singing from spring-giddy birds. (Although how they got that spring feeling I dunno. The weather could definitely have been better....)

Over and out.

In the classic paper cited often as ``Mallat and Zhang'' (S. Mallat and Z. Zhang, "Matching pursuits with time-frequency dictionaries," IEEE Trans. Signal Process., vol. 41, pp. 3397-3415, Dec. 1993), the authors present a mathematically simple way to proceed in calculating the inner products of the new intermediate residual signal \(\vr(n+1)\) with all the dictionary elements given the previous residual signal \(\vr(n)\) and the selected atom \(\vh_n\). We can thus find the next atom simply by $$\vh_{n+1} := \arg \max_{\vd \in \mathcal{D}} | \langle \vr(n+1), \vd \rangle | = \arg \max_{\vd \in \mathcal{D}} |\langle \vr(n), \vd \rangle - \langle \vr(n), \vh_n \rangle \langle \vh_n, \vd \rangle|.$$ This comes about because we define the \(n\)th-order residual signal as $$\vr(n+1) := \vr(n) - \langle \vr(n), \vh_n \rangle \vh_n.$$ The implication of this is that all one needs to do in an implementation of matching pursuit (MP) to find each atom is to first, compute and store the set of inner products \(\{ \langle \vr(0), \vd \rangle : \vd \in \mathcal{D}\}\), and perhaps the dictionary Gramian \(\MG := \MD^H\MD\); then after each atom selection, update the set of inner products of the new residual and each dictionary element by 1 multiply, 1 addition, and 1 table-lookup into \(\MG\). Thus, we have significantly reduced the computational complexity of the already computationally burdensome MP by avoiding the direct computation of the set \(\{ \langle \vr(n), \vd \rangle : \vd \in \mathcal{D}\}\) at each iteration.

On paper at least; and until you attempt to implement MP yourself using large (i.e., useful) dictionaries for high-dimensional signals with the complexity of audio data.

In regards to research in audio signal processing, this particular optimization (in particular precomputing and storing the dictionary Gramian) has been referenced numerous times. For instance, in 1997
  1. M. M. Goodwin and M. Vetterli, "Atomic decompositions of audio signals," in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, New York), pp. 4-8, Oct. 1997
and in 1999 for a dictionary of damped sinusoids
  1. M. M. Goodwin and M. Vetterli, "Matching pursuit and atomic signal models based on recursive filter banks," IEEE Trans. Signal Process., vol. 47, pp. 1890-1902, July 1999
and in 2004, 2005, and 2006
  1. P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, D. Martinez-Muñoz, and F. Lopez-Ferreras, "Transient modeling by matching pursuits with a wavelet dictionary for parametric audio coding," IEEE Signal Process. Lett., vol. 11, pp. 349-352, Mar. 2004.
  2. P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, D. Martinez-Muñoz, and J. Curpián-Alonso, "New matching pursuit based sinusoidal modelling method for audio coding," IEEE Proc. - Visual Image Signal Processing, vol. 151, no. 1, pp. 21-28, 2004.
  3. P. Vera-Candeas, N. Ruiz-Reyes, M. Rosa-Zurera, J. C. Cuevas-Martínez, and F. López-Ferreras, "Adaptive signal models for wide-band speech and audio compression," in Pattern Recognition and Image Analysis, vol. 3523/2005, pp. 571-578, Springer Berlin, 2005.
  4. P. Vera-Candeas, N. Ruiz-Reyes, J. C. Cuevas-Martínez, M. Rosa-Zurera, and F. López-Ferreras, "Sinusoidal modeling using perceptual matching pursuits in the bark scale for parametric audio coding," IEE Proc.-Visual Image Signal Process., vol. 153, pp. 431-435, Aug. 2006.
and most recently in 2010
  1. N. Ruiz-Reyes and P Vera-Candeas, "Adaptive signal modeling based on sparse approximations for scalable parametric audio coding," IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 447-460, Mar. 2010.
But this optimization is completely disregarded in the documentation of the de facto standard implementation of MP for audio signals, MPTK (S. Krstulovic and R. Gribonval, "MPTK: Matching pursuit made tractable," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 3, (Toulouse, France), pp. 496-499, Apr. 2006). Why?

The reason is patently obvious: when creating a multiresolution and overcomplete dictionary that is useful for audio decomposition (e.g., one that provides shift-invariant decompositions), the dictionary will be huge (hundreds of millions), and thus the Gramian of the dictionary will be that size squared. At such sizes it does not make sense to precompute and store the set of inner products of all pairs of the atoms -- even when the vast majority of these inner products could be zero (which increases as the coherence decreases). The memory requirements and access time outweighs the benefit. It may be that there exists a closed-form solution for the inner product of two real discrete Gabor atoms, as shown in Mallat and Zhang, or other special atom shapes. But there is no such formulation for dictionaries of learned atoms. Instead, the frequency-domain approach taken in MPTK is, I believe, one great way to reduce the computational complexity of MP for audio data using shift-invariant scale-time-frequency dictionaries; and it is even amenable to implementing orthogonal MP (B. Mailhé, R. Gribonval, F. Bimbot, and P. Vandergheynst, "Local orthogonal greedy pursuits for scalable sparse approximation of large signals with shift-invariant dictionaries," in Proc. Signal Process. Adaptive Sparse Structured Representations, (St. Malo, France), 2009).

But I wonder if there is a way to take advantage of the structure and sparsity of the dictionary Gramian matrix to make the ``only-on-paper'' optimization above a possibility...
Hello, and welcome to Paper of the Day (Po'D): Transients Detection Edition. Today's paper comes from the March 2010 issue of IEEE Transactions on Audio, Speech, and Language Processing, a special issue devoted to signal models and representations of environmental sounds: V. Bruni, S. Marconi, and D. Vitulano, "Time-scale Atoms Chains for Transients Detection in Audio Signals," IEEE TASLP, vol. 18, no. 3, pp. 420-433, Mar. 2010.

In this paper, the authors propose a means for locating transients in audio signals. They first propose a model of an audio signal as a superposition of ``three distinct components:'' a tonal, or locally stationary, component; a transient component; and a stochastic component that is unexplained by the tonal and transient components. The authors suggest modeling an individual transient as a piecewise linear function having at least three singular points, corresponding to onset, end of attack, and end of decay. Locating these singular points then is equivalent to detecting and delimiting transients. The authors propose to do so by following the maximum moduli of continuous-time wavelet transforms from atom ``chains'' going from the smallest scale to large scales. For a portion of a signal that is tonal, the maximum moduli will decay faster across scales; but for a portion with a transient, it will decay much slower due to the presence of singularities. The authors claim that their algorithm does not require any interaction from the user in order to tune it.

Unfortunately, I cannot find any web page with sound examples, or testable software, accompanying this research, so I am not sure about the usefulness of this work. I did find this web page, but it does not help much. Reproducible research must be reproducible.
Reblog this post [with Zemanta]

Séminaire à Paris

| No Comments
This morning I attended the séminaire "Méthodes Mathématiques du Traitement d'Images" at the Laboratoire Jacques-Louis Lions at Paris 6, where Prof Laurent Daudet presented a talk titled, "Compression et indexation échelonnables de la musique." Prof Daudet has produced much fine work in sparse representations of audio signals, and is an authority in this area.

Today Prof Daudet discussed the problems inherent to producing representations and models that are rich in describing the characteristics of audio signals --- finding "the music subspace," as he called it. Some approaches, such as time-frequency transforms, come with a trade-off in resolution; sparse approximation with an overcomplete dictionary is capable of "overcoming" such a trade-off, though in an artificial way. (The "uncertainty principle" is not broken, but sparse approximation is not tied to use a single time-frequency-domain resolution.) Daudet presented three attempts of using sparse approximation with overcomplete dictionaries to create rich representations of audio signals, that are useful in a number of ways, including source detection and separation in polyphonic music, simultaneous audio coding and description, and efficient and hierarchical similarity search in audio signals.

The first example Daudet discussed is the "mid-level" representation of polyphonic music audio signals created by sparse approximation with harmonic and chirped time-frequency dictionaries learned from the steady-states of isolated instrument sources. This work is well-documented in: P. Leveau, E. Vincent, G. Richard, and L. Daudet, "Instrument-specific harmonic atoms for mid-level music representation," IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. Leveau et al. show that matching pursuit with such a trained dictionary can perform a rough automatic transcription of a polyphonic musical signal. Instead of atoms, the decomposition results in a set of molecules pertaining to specific instruments; and which resembles a piano-roll notation.

The second example Daudet presented is the simultaneous compression and description of audio signals using matching pursuit with an overcomplete dictionary consisting of a union of MDCT basis of eight different scales. This work is well-documented in: E. Ravelli, G. Richard, and L. Daudet, "Union of MDCT bases for audio coding," IEEE Trans. Audio, Speech, Lang. Proc., vol. 16, pp. 1361-1372, Nov. 2008. Ravelli et al. show that adapting a signal representation using a multiscale dictionary and greedy matching pursuit can improve the quality of a parametric signal model or a transform coder at low bit rates. At high-bit rates however, it is performs similarly to a transform coder, but with a humongous computational complexity (decomposition and compression is 60-100 times the signal duration). However, what is interesting with this approach is that it creates a "content-aware representation," as Daudet called it. With the signal so described in a sparse domain, we can determine higher-level characteristics such as chords, genre, and rhythm. This is the subject of the article: E. Ravelli, G. Richard, and L. Daudet, "Audio signal representations for indexing in the transform domain," IEEE Trans. Acoustics, Speech, Lang. Process., vol. 18, pp. 434-446, Mar. 2010. Daudet argued that this all leads to "semantic audio coding." Now that audio coding is a solved problem, we are moving toward better acquisition systems providing perfect or versatile description of audio signals for facilitating useful search and retrieval of data.

The final example Daudet presented was our work in efficient and hierarchical similarity search in audio signals à la a sparse domain approach. In this work we show how sparse and multiresolution models of signals can be used to search for similar content at various levels of detail. This moves us from correlating waveforms, to correlating larger-scale structures in signals by comparing pairs of atoms, starting with the "most significant" as determined by a sparse approximation. This work is a part of a recent submission (which can be made available to those interested): B. L. Sturm and L. Daudet, "Audio similarity in a multiscale and sparse domain," Signal Process. (submitted), Dec. 2009.

On top of these, some recent papers I am looking forward to reviewing in a paper of the day context are:
  1. M. Davies and L. Daudet, "Fast sparse subband decomposition using FIRSP (Fast iteratively reweighted sparsifier)," in Proc. EUSIPCO, 2004.
  2. M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies., "Sparse representations in audio and music: from coding to source separation.," Proc. IEEE, 2010. Accepted for publication.
Here is a quiz testing your ability to identify sound effects from Star Wars. I got 50%. Apparently all of these sound effects were created by Ben Burtt, who also created effects for Wall-E.

I had a brief experience designing a sound effect track for an animation in a start-up company, on which I had a lot of fun experimenting with different materials and effects. Many of my colleagues were astonished that I used the classic humming on wax paper in a comb trick to create the fly sound. I think the above quiz would be more fun if one had to identify the sound sources involved.

Moogerfoogers®

| No Comments
This is a great April Fool's day gag. At first I was surprised because I thought the auto-tune procedure was destructive; then I said, "I have got to see the white paper behind this." And then I saw the shipping date, and it all came together.



Hello and welcome to Paper of the Day (Po'D): Content-adaptive Dictionaries Edition. Today's paper comes from a pile of papers I have been meaning to read more closely. And it completes my 5-paper survey of papers related to the problem I posed last week: N. Cho, Y. Shiu, and C.-C. J. Kuo, ``Audio source separation with matching pursuit and content-adaptive dictionaries (MP-CAD),'' in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., (New Paltz, NY), pp. 287-290, Oct. 2007. The authors propose a method to separate speech and music source types from a single channel mixture without reference to any model of mixing other than a simple sum. They approach this problem by assuming that there exists two signal spaces only slightly overlapping that describe speech and music signals. They create a ``content-adapted dictionary'' (CAD) to span one of these spaces by combining atoms of a generalized overcomplete time-frequency dictionary \(\mathcal{D}\) into units learned from musical training samples decomposed by matching pursuit. For example, given some training source signal \(\vx_i\), and the set of atoms in its sparse approximation (of some order), \(\mathcal{D}_i = \{ \vg_{\gamma_i} \in \mathcal{D} \}\), the corresponding atoms in the CAD are all those that span \(\mathcal{D}_i\), i.e., $$\mathcal{H}_i := \{\vh_i = \MD_i \vb : \vb \in \MR^{|\mathcal{D}_i|}, ||\vh_i|| = 1\}.$$ With several of these spaces learned, their union produces the music subspace that is hopefully minimally overlapping with the speech subspace. Performing sparse approximation (e.g., MP) of a mixture using this CAD will then extract the musical source type from the mixture.

The authors test their approach on mixtures of speech and clarinet music excerpts. They generate the CAD using 40 isolated clarinet notes taken from the RWC music database, built from a Gabor dictionary of 8000 atoms having some number of scales starting with 92.8 ms, and atoms at each scale having 800 modulation frequencies uniformly spread between 0 and the Nyquist frequency. (I assume then that the CAD contained 40 ``atoms,'' which are really subspaces onto which the intermediate residuals are projected. I still don't know the order of the approximations.) Their plot of the power spectral density of (one atom from ?) each \(\mathcal{H}_i\) show a dictionary with nice harmonic content, like those seen in, e.g., S. A. Abdallah and M. D. Plumbley, ``Unsupervised analysis of polyphonic music by sparse coding,'' IEEE Trans. Neural Networks, vol. 17, pp. 179-196, Jan. 2006. With respect to the signal-to-distortion ratio, the proposed approach significantly outperforms other separation techniques based on independent subspace analysis, and non-negative matrix factorization. The authors provide a preliminary result using two CADs, one for music and one for speech, which appears to increase the performance of the separation. The obvious extension of this approach is that taken in P. Leveau, E. Vincent, G. Richard, and L. Daudet, ``Instrument-specific harmonic atoms for mid-level music representation,'' IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. In this work CADs are learned for specific musical instruments --- but more so for polyphonic music transcription than source separation and reconstruction.

Happy Easter!
Hello and welcome to Paper of the Day (Po'D): Even More Sparse Separation Edition. Today's paper comes from ICASSP 2008: B. V. Gowreesunker and A. H. Tewfik, ``Blind source separation using monochannel overcomplete dictionaries,'' in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., (Las Vegas, NV), pp. 33-36, Mar. 2008.

The authors show how sources can be separated from mixtures by performing greedy sparse approximation using a single channel dictionary of time-localized atoms on linear combinations of the mixtures. Consider the model \(\MY = \MX \MA\), where \(\MX\) are the sources and \(\MA\) is the mixing matrix. And assume that each source can be described in a sparse way over a dictionary \(\MD\), i.e., \(\MX = [ \MD\vc_1 | \MD\vc_2 | \MD\vc_3 | \ldots ]\) where most of the weights in \(\vc_i\) are zero. This makes the model of the mixtures (considering just three sources) $$\MY = [ \MD\vc_1 | \MD\vc_2 | \MD\vc_3 ] \MA = [ \MD\vc_1a_{1,1} + \MD\vc_2 a_{2,1} + \MD\vc_3 a_{3,1} | \MD\vc_1a_{1,2} + \MD\vc_2 a_{2,2} + \MD\vc_3 a_{3,2} ].$$ Now, if we then create a linear combination of the mixtures, $$\vm :=\MY \left [ \begin{array}{c} 1 \\ \lambda \end{array}\right ] = \MD \left [ \vc_1a_{1,1} + \vc_2 a_{2,1} + \vc_3 a_{3,1}+ \lambda (\vc_1a_{1,2} + \vc_2 a_{2,2} + \vc_3 a_{3,2}) \right ]$$ we see that a proper choice of \(\lambda\) can cancel one of the sources, e.g., \(\lambda = -a_{2,1}/a_{2,2}\) will remove the second source. A sparse approximation of this mixture can then be used to recover the first and third sources up to a scale factor. Without knowledge of the mixing matrix, however, we can only make guesses. The authors thus propose to decompose several mixtures of the mixtures, and looking across decompositions to find frequently occurring atoms; but here my first confusion arises because they appear to use knowledge of the mixing matrix, which isn't really ``blind source separation.''

Their algorithm works as follows (in the two channel case):
  1. Choose \(\lambda\) and create a linear combination of mixtures.
  2. Perform sparse approximation process with some dictionary on this mixture of mixtures (to what order?).
  3. For the set of atoms that have non-zero weights, find the weights for the orthogonal projection of the two mixtures (thus doing well to ignore the optimism inherent to the weights of a greedy pursuit).
  4. Categorize each atom to one of the two sources by the magnitude inner product between its pair of weights and each row of the mixing matrix.
  5. Reconstruct sources.
The authors test their algorithm on a mixture of three artificially-sparse source signals created by a linear combination of dictionary atoms (I wonder what kind of dictionary and its coherency, and how many atoms in each source?). The authors know the mixing matrix, and choose three values of \(\lambda\) based on it to cancel one of the sources from each mixture of mixtures. They compare the number of ``correct'' (and ``incorrect'') atoms in the sparse decompositions with those in the synthesis of each source, and find that their algorithm appears to recover a higher percentage of correct atoms than does stereo MP (reviewed Monday) as a function of approximation error (which I assume is the \(\ell_2\)-norm of the residual). (It is a bit unfair to make such a comparison with MP since its error will always be greater than after performing an orthogonal projection, unless the sources are trivially sparse.)

The authors also test their algorithm on the mixtures of three speech signals, again with knowledge of the mixing matrix, and find that it produces comparable and better results to stereo MP with respect to the signal-to-interference and signal-to-artifact ratios. Here my second confusion arises. The authors mention that the dictionary they use in this case (for their algorithm, but maybe not for stereo MP) is generated by K-SVD (M. Aharon, M. Elad, and A. Bruckstein, ``K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation,'' IEEE Trans. Signal Process., vol. 54, pp. 4311-4322, Nov 2006). I remember from the talk by Vikrham at ICASSP that he said the dictionary was trained on the three sources. Here though, I think we completely leave the ``blind source separation'' problem when we use a dictionary trained on the exact sources in the mixtures. I would argue though that the results in this paper show how well we can expect source separation by sparse approximation to perform when we know the mixing matrix, and have a very good dictionary.

Blog Roll

About this Archive

This page is an archive of entries from April 2010 listed from newest to oldest.

March 2010 is the previous archive.

May 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.