Hello, and welcome to the Paper of the Day (Po'D): Signal Processing
Edition. Today's paper comes from more of our colleagues: Daniele Giacobello, Mads Christensen, Manohar N. Murthi, Søren Holdt Jensen, and Marc Moonen:

The usual approach to linear prediction in speech coding is to minimize the squared norm of the error signal given a predictor order \(P\), i.e., $$\mathbf{a}_\textrm{2opt} = \arg \min_{\mathbf{a} \in \mathbf{R}^P} || \mathbf{x} - \mathbf{X}\mathbf{a} ||_2$$ where \(\mathbf{X}\) is a matrix of rows composed of elements of the signal \(\mathbf{x}\). Though this does guarantee that the error will be distributed more uniformly than the original signal, it does not guarantee that the error will be sparse, i.e., that the number of zero-valued samples in \( \mathbf{r} = \mathbf{x} - \mathbf{X}\mathbf{a}_\textrm{2opt} \) will be small. The authors propose instead to find a predictor that will recover the "sparse patterns" of speech, i.e., the glottal impulses in voiced speech. Thus, instead of minimizing the \(\ell_2\)-norm of the residual, they minimize the \(\ell_1\)-norm, which has been shown capable of recovering sparse solutions for underdetermined systems: $$\mathbf{a}_\textrm{1opt} = \arg \min_{\mathbf{a} \in \MR^P} || \mathbf{x} - \mathbf{X}\mathbf{a} ||_1.$$ We can return to an approximation of the original signal with either of these weights by filtering the error signal, i.e., $$\hat{\mathbf{x}} = \mathbf{H} (\mathbf{x} - \mathbf{X}\mathbf{a}_\textrm{1opt}) = \mathbf{H} \mathbf{r}$$ where \(\mathbf{H}\) is the \(N\times N\) synthesis matrix with rows constructed of the impulse response of the all-pole filter given by \(\mathbf{a}_\textrm{1opt}\) (and truncated to length-\(N\)). We can see in this form that if we consider \(\mathbf{r}\) to be sparse, and \(\mathbf{H}\) to be a dictionary, then we have succeeded in representing the non-sparse signal \(\mathbf{x}\) in a sparse way! However, we do not just want sparsity in the solution, but we want to guarantee the error will be less than some maximum \(\epsilon > 0\), i.e., we want the error signal to satisfy $$\mathbf{r}_\textrm{best} = \arg \min_{\mathbf{r}\in\MR^N} || \mathbf{r} ||_1 \; \textrm{subject to} \; || \mathbf{x} - \mathbf{H} \mathbf{r} ||_2^2 \le \epsilon.$$ Using an \(\ell_1\)-norm minimization of the residual for speech coding has been proposed before and revisited numerous times. However, what the authors do here is bring it to a compressed sensing framework. Given that we assume our error signal \(\mathbf{r}\) has some sparsity, we can construct a \(M \times N \) measurement matrix \(\mathbf{\Phi}\) with \( M \ll N \) such that the equation above is solvable with fewer constraints. The best solution is then given with high probability by $$\mathbf{r}_\textrm{best} = \arg \min_{\mathbf{r}\in\MR^N} || \mathbf{r} ||_1 \; \textrm{subject to} \; || \mathbf{\Phi}\mathbf{x} - \mathbf{\Phi}\mathbf{H} \mathbf{r} ||_2^2 \le \epsilon.$$ Now the problem is reduced to one with \(M \ll N\) constraints. There are, of course, caveats. We must assume the sparsity \(K\) of the error signal; however, due to the limitations of the human voice, we can reasonably gauge this given \(N\) samples. Second, based on this sparsity, or measurement matrix should have \(M = O(K \log N)\) to give a high probability the best solution \(\mathbf{r}_\textrm{best}\) will be found. Their experimental results indicate that for voiced speech they are not only recovering sparse solutions with signal errors close to an optimal method, but they are doing so 1000 times quicker! Good work fellows.

*Retrieving Sparse Patterns Using a Compressed Sensing Framework: Applications to Speech Coding Based on Sparse Linear Prediction,*IEEE Sig. Proc. Letts., vol. 17, no. 1, pp. 103-106, Jan. 2010.The usual approach to linear prediction in speech coding is to minimize the squared norm of the error signal given a predictor order \(P\), i.e., $$\mathbf{a}_\textrm{2opt} = \arg \min_{\mathbf{a} \in \mathbf{R}^P} || \mathbf{x} - \mathbf{X}\mathbf{a} ||_2$$ where \(\mathbf{X}\) is a matrix of rows composed of elements of the signal \(\mathbf{x}\). Though this does guarantee that the error will be distributed more uniformly than the original signal, it does not guarantee that the error will be sparse, i.e., that the number of zero-valued samples in \( \mathbf{r} = \mathbf{x} - \mathbf{X}\mathbf{a}_\textrm{2opt} \) will be small. The authors propose instead to find a predictor that will recover the "sparse patterns" of speech, i.e., the glottal impulses in voiced speech. Thus, instead of minimizing the \(\ell_2\)-norm of the residual, they minimize the \(\ell_1\)-norm, which has been shown capable of recovering sparse solutions for underdetermined systems: $$\mathbf{a}_\textrm{1opt} = \arg \min_{\mathbf{a} \in \MR^P} || \mathbf{x} - \mathbf{X}\mathbf{a} ||_1.$$ We can return to an approximation of the original signal with either of these weights by filtering the error signal, i.e., $$\hat{\mathbf{x}} = \mathbf{H} (\mathbf{x} - \mathbf{X}\mathbf{a}_\textrm{1opt}) = \mathbf{H} \mathbf{r}$$ where \(\mathbf{H}\) is the \(N\times N\) synthesis matrix with rows constructed of the impulse response of the all-pole filter given by \(\mathbf{a}_\textrm{1opt}\) (and truncated to length-\(N\)). We can see in this form that if we consider \(\mathbf{r}\) to be sparse, and \(\mathbf{H}\) to be a dictionary, then we have succeeded in representing the non-sparse signal \(\mathbf{x}\) in a sparse way! However, we do not just want sparsity in the solution, but we want to guarantee the error will be less than some maximum \(\epsilon > 0\), i.e., we want the error signal to satisfy $$\mathbf{r}_\textrm{best} = \arg \min_{\mathbf{r}\in\MR^N} || \mathbf{r} ||_1 \; \textrm{subject to} \; || \mathbf{x} - \mathbf{H} \mathbf{r} ||_2^2 \le \epsilon.$$ Using an \(\ell_1\)-norm minimization of the residual for speech coding has been proposed before and revisited numerous times. However, what the authors do here is bring it to a compressed sensing framework. Given that we assume our error signal \(\mathbf{r}\) has some sparsity, we can construct a \(M \times N \) measurement matrix \(\mathbf{\Phi}\) with \( M \ll N \) such that the equation above is solvable with fewer constraints. The best solution is then given with high probability by $$\mathbf{r}_\textrm{best} = \arg \min_{\mathbf{r}\in\MR^N} || \mathbf{r} ||_1 \; \textrm{subject to} \; || \mathbf{\Phi}\mathbf{x} - \mathbf{\Phi}\mathbf{H} \mathbf{r} ||_2^2 \le \epsilon.$$ Now the problem is reduced to one with \(M \ll N\) constraints. There are, of course, caveats. We must assume the sparsity \(K\) of the error signal; however, due to the limitations of the human voice, we can reasonably gauge this given \(N\) samples. Second, based on this sparsity, or measurement matrix should have \(M = O(K \log N)\) to give a high probability the best solution \(\mathbf{r}_\textrm{best}\) will be found. Their experimental results indicate that for voiced speech they are not only recovering sparse solutions with signal errors close to an optimal method, but they are doing so 1000 times quicker! Good work fellows.

Bob,

Do you guys have an RSS feed ?

Cheers,

Igor.

Hi Igor! We have just started this blog and are learning how to use Movable Type to customize it. I believe that MT has an easy way to include an RSS feed. I will also be adding your blog to our Blog Feeder when I figure that bit out! Thank you.

Bob,

In case MT does not provide an easy way, you can always use the Google/Feedburner solution.

Cheers,

Igor.

Hi Igor, I have got a feedburner going now, with a little link on the left-hand column of the blog.

I hope ICASSP is going well! I am keeping posted through Nuit Blanche.

Cheers.

-Bob.