Say we have two trained classifiers (I and II), and we would like a statistically-rigorous way to test the hypothesis, "With respect to classification accuracy, system I performs no better than system II." So, we take \(N\) labeled observations, and have each classifier produce \(N\) output labels.
We might look at the classification accuracies. Say system I gets \(n_1/N\) correct and system II gets \(n_2/N\) correct. But these are just single samples of some unknown distributions, and we don't really have a way to statistically compare them --- they have infinite variance due to our lack of knowledge of their performance in the real world.
We might instead randomly partition the \(N\) observations into \(K\) equal-sized folds,
and then compute the mean accuracy and its variance from the performance of the \(K\) folds.
Then we can, e.g., assume the means are distributed Gaussian, and compute the probability of each mean being from the same distribution.
Is this any better, however? If we do this, we make a rather strong assumption:
that the \(N\) observations we have to begin with are an adequate random sample of the whole observation population, and that each \(N/K\) observations are also an adequate random sample of the observation population.
The larger we make \(K\), the "better" our statistics appear to become,
but the stronger this assumption becomes.

Instead, in the chapter "On comparing classifiers: Pitfalls to avoid and a recommended approach" (in U. Fayyad (ed.),

Now we run some simulations. Consider we have 10 classes, and \(N\) observations. System I assigns to each observation a random class (equiprobable). System II also assigns to each observation a random class, but we make sure it gets the correct class for r% of its predictions. This way we know when \(r > 0\) that we should reject the null hypothesis (system II is not better than the system I). Now, we run 1000 independent trials for several pairs of \((N,r)\), and find the frequency of rejecting the null hypothesis for \(\alpha = 0.05\).

The blue line marks our level of significance (0.05). We see, as a nice tip of the cap to theory, that for \(r = 0\) we reject the null hypothesis less than 5% of the trials. With increasing \(N\), i.e., more evidence, we see the frequency of obtaining significant results increases for all \(r > 0\). For \(N > 300\), and \(r > 0.05\), we reject the null hypothesis a majority of the time. When \(r = 3\), we reject the null hypothesis a majority of the time only when \(N > 750\). This is good news!

Now, all of the above is true when our \(N\) observations are

Now, two questions:

Instead, in the chapter "On comparing classifiers: Pitfalls to avoid and a recommended approach" (in U. Fayyad (ed.),

__Data mining and knowledge discovery__. Kluwer Academic Publishers, 1997), Salzberg outlines another approach starting from a contingency table. Test the systems on the same \(N\) observations, and compare their outputs when they differ. Define \(s\) to be the number of observations system I gets incorrect but that system II gets correct; and define \(f\) to be the number of observations system I gets correct, but that system II gets incorrect. Assume that our \(N\) observations are independent. When our two systems perform equally well, then we expect \(s = f = 0.5(s+f)\) --- in other words, for a given trial the probability of system I (or II) getting an observation correct while system II (or I) gets an observation incorrect is 50/50. Hence, define \(S\) a random variable distributed Binomial with parameters 0.5 and \(s+f\). The distribution of \(S\) describes the probability of system II getting a number of trials correct for which system I gets incorrect*given the two systems perform the same on this dataset*. With this in place, we can thus compute a \(p\)-value: $$ p = P[S \ge s | \mathrm{same}] = \sum_{s'=s}^{s+f}{s+f \choose s'} (0.5)^{s+f}. $$ Finally, by defining an acceptable level of significance \(\alpha\), we can can test our null hypothesis that system II performs no better than system I on this dataset.Now we run some simulations. Consider we have 10 classes, and \(N\) observations. System I assigns to each observation a random class (equiprobable). System II also assigns to each observation a random class, but we make sure it gets the correct class for r% of its predictions. This way we know when \(r > 0\) that we should reject the null hypothesis (system II is not better than the system I). Now, we run 1000 independent trials for several pairs of \((N,r)\), and find the frequency of rejecting the null hypothesis for \(\alpha = 0.05\).

The blue line marks our level of significance (0.05). We see, as a nice tip of the cap to theory, that for \(r = 0\) we reject the null hypothesis less than 5% of the trials. With increasing \(N\), i.e., more evidence, we see the frequency of obtaining significant results increases for all \(r > 0\). For \(N > 300\), and \(r > 0.05\), we reject the null hypothesis a majority of the time. When \(r = 3\), we reject the null hypothesis a majority of the time only when \(N > 750\). This is good news!

Now, all of the above is true when our \(N\) observations are

*independent*, and our classifiers are already trained (perhaps on different datasets, maybe not independent of our test data). However, consider we want to train and test the systems by using \(K\)-fold cross-validation in our dataset of \(N\) observations. We make a random disjoint partition, train on \(K-1\) folds, test on the remaining folds, and rotate.Now, two questions:

- After testing all the folds, can we still use the procedure above to test the null hypothesis?
- If so, can we re-use our observations by performing \(K\)-fold cross validation \(L\) times with different random partitions, and then test the null hypothesis using the \(LN\) classifications?