Test results for inter-genre similarity, part 2

| No Comments
Yesterday, I posted some initial results with the music genre recognition system proposed by Bagci and Erzin. Since I am not too confident that I understand what PRTools is doing, I have decided to implement the process with the stats toolbox of MATLAB, and get it working on a standard machine learning dataset: the handwritten digits of the US Postal Service .
Changing the code wasn't too difficult. Essentially it is a little of this

    % get data from CV fold

    % train a GMM for each class
    for jj=1:numclasses
      idx = (trainlabels == jj);
      obj{jj} = gmdistribution.fit(traindata(idx,:),numGaussians, ...
        'Options',options,'CovType','diagonal');
      % get conditional densities
      Ptrain(:,jj) = pdf(obj{jj},traindata);
      Ptest(:,jj) = pdf(obj{jj},testdata);
    end
  
    % classify training data by maximum likelihood (equal priors)
    [~, predictedLabelsTrain] = max(Ptrain,[],2); 

    % for a number of times, cycle "inter-digit similarity" (IDS) approach 
    for kk = 1:numIDStimes
      % find misclassified instances
      idx_wrong = ~(predictedLabelsTrain == trainlabels);

      % create new models of digits classes, compute conditional densities
      clear obj Ptrain Ptest;
      for jj=1:numclasses
        idx = (trainlabels == jj);
        obj{jj} = gmdistribution.fit(traindata(idx & ~idx_wrong,:),numGaussians, ...
          'Options',options,'CovType','diagonal');
        Ptrain(:,jj) = pdf(obj{jj},traindata);
        Ptest(:,jj) = pdf(obj{jj},testdata);
      end
      % create model of IDS
      obj{11} = gmdistribution.fit(traindata(idx_wrong,:),numGaussians, ...
          'Options',options,'CovType','diagonal');
      Ptrain(:,11) = pdf(obj{11},traindata);
      Ptest(:,11) = pdf(obj{11},testdata);

      % classify test data
      [~, predictedLabelsTest] = max(Ptest(:,1:end-1),[],2);  
    end
And that is it really. What is different with respect to the music data is that each digit I am trying to classify consists of one feature vector, whereas with the music data I am considering windows. So if I really want to make it equivalent, I am going to have to break the digits into smaller frames, and then look over the frames. Regardless, let's see what happens with this process, where I am not even considering the IDS model. Here, I am just building models (mixtures of 3 Gaussians with diagonal covariance matrices) from the digits that are always classified correctly.

Running the above for 2-fold stratified CV, and five refinement iterations, I get the following classification errors on the training and testing sets.

 Fold  1: GMMC before IDS Error Train = 28.89, Test = 31.04 
 Fold  1: GMMC after  IDS Error Train = 22.82, Test = 25.47 
 Fold  1: GMMC after  IDS Error Train = 24.15, Test = 26.56 
 Fold  1: GMMC after  IDS Error Train = 25.16, Test = 26.73 
 Fold  1: GMMC after  IDS Error Train = 25.65, Test = 28.00 
 Fold  1: GMMC after  IDS Error Train = 22.51, Test = 24.80 
 Fold  2: GMMC before IDS Error Train = 31.16, Test = 32.89 
 Fold  2: GMMC after  IDS Error Train = 29.67, Test = 32.98 
 Fold  2: GMMC after  IDS Error Train = 30.98, Test = 34.45 
 Fold  2: GMMC after  IDS Error Train = 29.29, Test = 32.73 
 Fold  2: GMMC after  IDS Error Train = 25.93, Test = 29.27 
 Fold  2: GMMC after  IDS Error Train = 23.24, Test = 26.95 
As expected, the classification error on the training dataset decreases, but it appears that the classification error on the test set decreases as well. Let's check for statistical significance. Below is the contingency tables for three pairs of algorithms. digicontable.png The elements in the first row count those digits the once-tuned system (I1) gets correct, but the untuned system (G) gets correct (first column), G gets incorrect (second column), the system tuned five times (I5) gets correct (third column), and I5 gets incorrect (last column). By a Chi-squared test, we find with statistical significance (\(p<10^{-11}\)) that I5 performs better (with respect to accuracy) than G or I1.

That was fun. Let's run it again.

 Fold  1: GMMC before IDS Error Train = 26.71, Test = 27.42 
 Fold  1: GMMC after  IDS Error Train = 26.76, Test = 27.33 
 Fold  1: GMMC after  IDS Error Train = 33.24, Test = 34.36 
 Fold  1: GMMC after  IDS Error Train = 26.00, Test = 27.05 
 Fold  1: GMMC after  IDS Error Train = 26.09, Test = 26.69 
 Fold  1: GMMC after  IDS Error Train = 24.64, Test = 25.02 
 Fold  2: GMMC before IDS Error Train = 25.27, Test = 27.18 
 Fold  2: GMMC after  IDS Error Train = 24.95, Test = 27.33 
 Fold  2: GMMC after  IDS Error Train = 26.64, Test = 28.22 
 Fold  2: GMMC after  IDS Error Train = 25.82, Test = 27.24 
 Fold  2: GMMC after  IDS Error Train = 25.93, Test = 26.60 
 Fold  2: GMMC after  IDS Error Train = 30.64, Test = 33.07
And here are the contingency tables. digicontable2.png Well crap! By a Chi-squared test, we find that we cannot reject the null hypothesis that G and I1 perform the same; but with statistical significance (\(p<10^{-3}\)) we can say that I5 performs worse than GMM and I1. From the top to the bottom, just like that --- cross-validation can be a real bitch!

Now that I am satisfied I have some learning going on, it is time to break up the digits and use the IDS mechanism.

Leave a comment

About this Entry

This page contains a single entry by Bob L. Sturm published on January 23, 2013 3:46 PM.

Test results for inter-genre similarity, part 1 was the previous entry in this blog.

Test results for inter-genre similarity, part 3 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.