Hello, and welcome to the Paper of the Day (Po'D): Sparse Domain (Dis)similarity Edition. Today's paper comes from Fall of 2009: R. Mazhar, P. D. Gader, and J. N. Wilson, "Matching-Pursuits Dissimilarity Measure for Shape-Based Comparison and Classification of High-Dimensional Data", IEEE Trans. Fuzzy Systems, vol. 17, no. 5, pp. 1175 - 1188, Oct. 2009.

The ability to compare signals and their contents is fundamental to every system using pattern recognition, e.g., voice recognition, musical source classification, fingerprint search, plagiarism detection, etc. In many situations, there is no lack of features with which to compare signals (data); but the difficult part is finding that small set of features that are in some way better than all others. The smaller this set of features gets, while still providing, e.g., good separation between classes and little separation within classes, then the less training data we need to provide in order to generate a good and generalized classifier. And so, if we can reduce a very high-dimensional vector, like sampled audio, to a small number of features that provides good description, then we can improve classifiers not only in performance (e.g., accuracy), but also efficiency (e.g., computational overhead, speed of classification). And if we can kill two birds with one stone, by having the features be uniquely invertible, that is even better. This could be the case using methods of sparse approximation.

The first portion of this article proposes a measure for comparing two signals using sparse approximations found by matching pursuit (MP). Given two signals in the same space, \(\vx_1, \vx_2 \in \mathbf{R}^K\), and the \(n_1\)-order representation of \(\vx_1\) found by MP using some real dictionary \(\MD\), i.e., \(\{\MH_1(n_1), \va_1(n_1), \vr_1(n_1)\} \leftarrow \mathcal{P}_\MD\{\vx_1\}\) where \(\MH_1(n_1)\) is a matrix of \(n_1\) atoms selected from the dictionary, \(\va_1(n_1)\) is a vector of \(n_1\) weights, and \(\vr_1(n_1) = \vx_1 - \MH_1(n_1) \va_1(n_1)\) is the \(n_1\)th-order residual, the matching pursuit dissimilarity measure (MPDM) between \(\vx_1, \vx_2\) is defined $$D_\alpha( \vx_1, \vx_2) := \left [ \alpha \Norm \vr_1(n_1) - (\vx_2 - \MH_1(n_1)\vb) \Norm_2^2 + (1-\alpha) \Norm \va_1(n_1) - \vb \Norm_2^2\right ]^{1/2}$$ where \(0 \le \alpha \le 1\), and \(\vb\) are the \(n_1\)weights found recursively by $$[\vb]_j = \bigl \langle \vx_2, [\MH_1(n_1)]_{*j} \bigr \rangle - \sum_{i=1}^{j-1} [\vb]_i \bigl \langle [\MH_1(n_1)]_{*i}, [\MH_1(n_1)]_{*j} \bigr \rangle, \;j = 1, \ldots, n_1.$$ All this is saying is let's take \(\vx_2\) and decompose it with the atoms found by MP in decomposing \(\vx_1\) preserving their order, and then compare the two models and their errors. The weight \(\alpha\) determines which of the two terms is more important: the distance between the residuals (\(\alpha > 0.5\)) --- which the authors claim corresponds to the comparison of shapes --- or the distance between the model weights (\(\alpha < 0.5\)) --- which the authors claim corresponds to a comparison of intensities. Whenever \(D_\alpha( \vx_1, \vx_2)\) is large, then the two signals are judged dissimilar with respect to \(\alpha\). This measure is not symmetric since \(D_\alpha( \vx_1, \vx_2) \ne D_\alpha( \vx_2, \vx_1)\) in general; but this can be satisfied if the distance measure is defined $$\overline{\delta_\alpha}(\vx_1,\vx_2) := \frac{1}{2} \left [ D_\alpha( \vx_1, \vx_2) + D_\alpha( \vx_2, \vx_1) \right ].$$ (If symmetry is required --- usually a good idea when comparing signals, but not necessary (see the Itakura-Saito distance) --- then this measure requires that both signals be decomposed by MP, contrary to the author's claim that only one signal needs to be decomposed. The authors apparently did not use this symmetric version.)

The second part of this article presents a classification method using the MPDM, and the compares its results to other classifiers on a dataset of landmine sensor measurements. To create their classifier, the authors use an unsupervised competitive agglomeration algorithm with the MPDM to find the optimal number clusters. From each of the clusters they generate prototypes (e.g., 60 mine and 50 non-mine prototypes). Classification is then done via fuzzy clustering (nearest neighbors) with class weights given by estimated conditional probabilities. Somewhere the prototypes are assigned a label of mine and not-mine, but I can't tell where, and if that is automatic or not. For the MPDM they set \(\alpha = 0.7\) for some reason. The authors also use an existing state-of-the-art detection method, a system based on support vector machines, and the same fuzzy clustering approach they use but with the Euclidean distance instead of the MPDM. The authors test all four classifiers on a set of real (798 samples) and simulated (10,000 samples) landmine sensor data, and compare their performance. Their results appear to demonstrate their classifier using the MPDM provides a healthy (fitting word choice) advantage over the other classifiers.

Though the authors appear to show improvement in using the MPDM to compare signals, I am not convinced by this approach. First of all, the authors make several unsubstantiated claims, and misstatements. For instance, ``The MP approximation of a signal contains accurate shape-based information about the signal.'' I am not sure what they mean by ``accurate,'' but it has been repeatedly shown, during the past 16 years, that MP can create signal decompositions that are rife with errors from its short-sighted greedy approach. (Also see S.~Jaggi, W.~C. Karl, S.~Mallat, and A.~S. Willsky, ``High resolution pursuit for feature extraction,'' Tech. Rep., Laboratory for Information and Decision Systems, MIT, Cambridge, MA, Nov. 1995.) The authors also claim that ``... the maximum information [in a MP decomposition] is contained in its first few MP coefficients.'' I don't know what they mean by ``information,'' and how they can claim a certain subset of model elements contain the maximum of that quantity. They also claim that their classification method using the MPDM does not require ``prior knowledge about the problem domain.'' However, their dictionary selection represents a use of prior knowledge since the dictionary ``depends on the expected characteristics of the data for a given problem'' (their words). Even in their experiments they created dictionaries from the labeled training data, which is obviously using prior knowledge.

On top of these, MP is non-linear, and slight changes in the signal input can create very different decompositions. MP is not shift-invariant unless the dictionary is made so, and remains sufficiently incoherent to any noise in the signal. Because of this, I cannot see how their measure takes into consideration any shifts in the data. If I only circularly shift \(\vx_1\) to produce an \(\vx_2\) --- which I assume amounts to a displacement of the mine location with reference to where the sweep begins --- then the MPDM will not be small because it is not comparing shapes, but only shapes in a particular location with a particular orientation.

On the whole, I believe much more should be fleshed from this paper in a formal mathematical setting. For instance, why not use the least-squares projection of each signal onto the column space of \(\MH_1(n_1)\)? There is nothing special about MP that OMP does not have, except for its egregious mistakes at the price of computational simplicity. Also, the selection of the parameter \(\alpha\) needs a better approach than just arguing from ``prior knowledge about the problem domain.''

The ability to compare signals and their contents is fundamental to every system using pattern recognition, e.g., voice recognition, musical source classification, fingerprint search, plagiarism detection, etc. In many situations, there is no lack of features with which to compare signals (data); but the difficult part is finding that small set of features that are in some way better than all others. The smaller this set of features gets, while still providing, e.g., good separation between classes and little separation within classes, then the less training data we need to provide in order to generate a good and generalized classifier. And so, if we can reduce a very high-dimensional vector, like sampled audio, to a small number of features that provides good description, then we can improve classifiers not only in performance (e.g., accuracy), but also efficiency (e.g., computational overhead, speed of classification). And if we can kill two birds with one stone, by having the features be uniquely invertible, that is even better. This could be the case using methods of sparse approximation.

The first portion of this article proposes a measure for comparing two signals using sparse approximations found by matching pursuit (MP). Given two signals in the same space, \(\vx_1, \vx_2 \in \mathbf{R}^K\), and the \(n_1\)-order representation of \(\vx_1\) found by MP using some real dictionary \(\MD\), i.e., \(\{\MH_1(n_1), \va_1(n_1), \vr_1(n_1)\} \leftarrow \mathcal{P}_\MD\{\vx_1\}\) where \(\MH_1(n_1)\) is a matrix of \(n_1\) atoms selected from the dictionary, \(\va_1(n_1)\) is a vector of \(n_1\) weights, and \(\vr_1(n_1) = \vx_1 - \MH_1(n_1) \va_1(n_1)\) is the \(n_1\)th-order residual, the matching pursuit dissimilarity measure (MPDM) between \(\vx_1, \vx_2\) is defined $$D_\alpha( \vx_1, \vx_2) := \left [ \alpha \Norm \vr_1(n_1) - (\vx_2 - \MH_1(n_1)\vb) \Norm_2^2 + (1-\alpha) \Norm \va_1(n_1) - \vb \Norm_2^2\right ]^{1/2}$$ where \(0 \le \alpha \le 1\), and \(\vb\) are the \(n_1\)weights found recursively by $$[\vb]_j = \bigl \langle \vx_2, [\MH_1(n_1)]_{*j} \bigr \rangle - \sum_{i=1}^{j-1} [\vb]_i \bigl \langle [\MH_1(n_1)]_{*i}, [\MH_1(n_1)]_{*j} \bigr \rangle, \;j = 1, \ldots, n_1.$$ All this is saying is let's take \(\vx_2\) and decompose it with the atoms found by MP in decomposing \(\vx_1\) preserving their order, and then compare the two models and their errors. The weight \(\alpha\) determines which of the two terms is more important: the distance between the residuals (\(\alpha > 0.5\)) --- which the authors claim corresponds to the comparison of shapes --- or the distance between the model weights (\(\alpha < 0.5\)) --- which the authors claim corresponds to a comparison of intensities. Whenever \(D_\alpha( \vx_1, \vx_2)\) is large, then the two signals are judged dissimilar with respect to \(\alpha\). This measure is not symmetric since \(D_\alpha( \vx_1, \vx_2) \ne D_\alpha( \vx_2, \vx_1)\) in general; but this can be satisfied if the distance measure is defined $$\overline{\delta_\alpha}(\vx_1,\vx_2) := \frac{1}{2} \left [ D_\alpha( \vx_1, \vx_2) + D_\alpha( \vx_2, \vx_1) \right ].$$ (If symmetry is required --- usually a good idea when comparing signals, but not necessary (see the Itakura-Saito distance) --- then this measure requires that both signals be decomposed by MP, contrary to the author's claim that only one signal needs to be decomposed. The authors apparently did not use this symmetric version.)

The second part of this article presents a classification method using the MPDM, and the compares its results to other classifiers on a dataset of landmine sensor measurements. To create their classifier, the authors use an unsupervised competitive agglomeration algorithm with the MPDM to find the optimal number clusters. From each of the clusters they generate prototypes (e.g., 60 mine and 50 non-mine prototypes). Classification is then done via fuzzy clustering (nearest neighbors) with class weights given by estimated conditional probabilities. Somewhere the prototypes are assigned a label of mine and not-mine, but I can't tell where, and if that is automatic or not. For the MPDM they set \(\alpha = 0.7\) for some reason. The authors also use an existing state-of-the-art detection method, a system based on support vector machines, and the same fuzzy clustering approach they use but with the Euclidean distance instead of the MPDM. The authors test all four classifiers on a set of real (798 samples) and simulated (10,000 samples) landmine sensor data, and compare their performance. Their results appear to demonstrate their classifier using the MPDM provides a healthy (fitting word choice) advantage over the other classifiers.

Though the authors appear to show improvement in using the MPDM to compare signals, I am not convinced by this approach. First of all, the authors make several unsubstantiated claims, and misstatements. For instance, ``The MP approximation of a signal contains accurate shape-based information about the signal.'' I am not sure what they mean by ``accurate,'' but it has been repeatedly shown, during the past 16 years, that MP can create signal decompositions that are rife with errors from its short-sighted greedy approach. (Also see S.~Jaggi, W.~C. Karl, S.~Mallat, and A.~S. Willsky, ``High resolution pursuit for feature extraction,'' Tech. Rep., Laboratory for Information and Decision Systems, MIT, Cambridge, MA, Nov. 1995.) The authors also claim that ``... the maximum information [in a MP decomposition] is contained in its first few MP coefficients.'' I don't know what they mean by ``information,'' and how they can claim a certain subset of model elements contain the maximum of that quantity. They also claim that their classification method using the MPDM does not require ``prior knowledge about the problem domain.'' However, their dictionary selection represents a use of prior knowledge since the dictionary ``depends on the expected characteristics of the data for a given problem'' (their words). Even in their experiments they created dictionaries from the labeled training data, which is obviously using prior knowledge.

On top of these, MP is non-linear, and slight changes in the signal input can create very different decompositions. MP is not shift-invariant unless the dictionary is made so, and remains sufficiently incoherent to any noise in the signal. Because of this, I cannot see how their measure takes into consideration any shifts in the data. If I only circularly shift \(\vx_1\) to produce an \(\vx_2\) --- which I assume amounts to a displacement of the mine location with reference to where the sweep begins --- then the MPDM will not be small because it is not comparing shapes, but only shapes in a particular location with a particular orientation.

On the whole, I believe much more should be fleshed from this paper in a formal mathematical setting. For instance, why not use the least-squares projection of each signal onto the column space of \(\MH_1(n_1)\)? There is nothing special about MP that OMP does not have, except for its egregious mistakes at the price of computational simplicity. Also, the selection of the parameter \(\alpha\) needs a better approach than just arguing from ``prior knowledge about the problem domain.''