Faults in the Ballroom dataset

The Ballroom dataset (BD) was created around 2004 by F. Gouyon et al. for use in a comparative evaluation of particular features for music genre classification. BD is described in the paper, F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, "Evaluating rhythmic descriptors for musical genre classification", in Proc. Int. Conf. Audio Eng. Society, June 2004.

[BD] contains excerpts from 698 pieces of music, around 30 seconds long. The audio quality of this data is quite low, it was originally fetched in real audio format [from http://www.ballroomdancers.com], with a compression factor of almost 22 with respect to the common 44.1 kHz 16 bits mono WAV format. ... The data covers eight musical sub-genres of ballroom dance music: [Jive, Quickstep, Tango, Waltz, Viennese Waltz, Samba, Cha Cha Cha and Rumba].
Searching through the references of my music genre recognition survey, I find this dataset has been used in the evaluations of music genre recognition systems in at least 15 conference papers, journal articles, PhD dissertations:

  1. A. Flexer, F. Gouyon, S. Dixon, and G. Widmer. Probabilistic combination of features for music classification. In Proc. ISMIR, pages 111-114, Oct. 2006.
  2. F. Gouyon. A computational approach to rhythm description --- Audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing. PhD thesis, Universitat Pompeu Fabra, 2005.
  3. F. Gouyon and S. Dixon. Dance music classification: A tempo-based approach. In Proc. ISMIR, pages 501-504, 2004.
  4. F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer. Evaluating rhythmic descriptors for musical genre classification. In Proc. Int. Audio Eng. Soc. Conf., pages 196-204, 2004.
  5. A. Holzapfel and Y. Stylianou. Rhythmic similarity of music based on dynamic periodicity warping. In Proc. ICASSP, pages 2217-2220, 2008.
  6. A. Holzapfel and Y. Stylianou. A scale based method for rhythmic similarity of music. In Proc. ICASSP, pages 317-320, Apr. 2009.
  7. G. Peeters. Spectral and temporal periodicity representations of rhythm for the automatic classification of music audio signal. IEEE Trans. Audio, Speech, Lang. Process., 19(5):1242-1252, July 2011.
  8. T. Pohle, D. Schnitzer, M. Schedl, P. Knees, and G. Widmer. On rhythm and general music similarity. In Proc. ISMIR, 2009.
  9. J. Schl├╝ter and C. Osendorfer. Music similarity estimation with the mean- covariance restricted Boltzmann machine. In Proc. ICMLA, 2011.
  10. K. Seyerlehner. Content-based Music Recommender Systems: Beyond Simple Frame-level Audio Similarity. PhD thesis, Johannes Kepler University, Linz, Austria, Dec. 2010.
  11. K. Seyerlehner, M. Schedl, R. Sonnleitner, D. Hauger, and B. Ionescu. From improved auto-taggers to improved music similarity measures. In Proc. Adaptive Multimedia Retrieval, Copenhagen, Denmark, Oct. 2012.
  12. K. Seyerlehner, G. Widmer, and T. Pohle. Fusing block-level features for music similarity estimation. In Proc. DAFx, pages 1-8, 2010.
  13. E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama. Audio genre classification by clustering percussive patterns. In Proc. Acoustical Society of Japan, 2009.
  14. E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama. Audio genre classification using percussive pattern clustering combined with timbral features. In Proc. ICME, 2009.
  15. E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama. Beyond timbral statistics: Improving music classification using percussive patterns and bass lines. IEEE Trans. Audio, Speech, and Lang. Process., 19(4):1003-1014, May 2011.
BD is used in other papers that have to do with tempo estimation (e.g., F. Gouyon et al., "An experimental comparison of audio tempo induction algorithms," IEEE Trans. Audio, Speech, and Lang. Process., vol.14, no.5, pp.1832-1844, Sep. 2006.), and others, but I have yet to make a survey of these works.

However, as for GTZAN, it appears that researchers have taken for granted the integrity of BD. (F. Gouyon has mentioned that some have found errors in the accompanying tempo estimates.)

Through my fingerprinting method (just a little Shazam-like implementation), I compare all 698 excerpts to each other, and find the following 13 exact and recording replicas. Exact replicas are ones that sound the same. Recording replicas are ones that sound as if they come from the same song, e.g., the same music realization but without the vocals; or from the same recording but displaced in time. I confirm each detected repetition as a true positive by listening to the excerpts.

Exact replicas
  1. Quickstep/Albums-AnaBelen_Veneo-11.wav matches Quickstep/Albums-Chrisanne2-12.wav
  2. ChaChaCha/Albums-Fire-08.wav matches Samba/Albums-Fire-09.wav
  3. ChaChaCha/Albums-Latin_Jam2-05.wav matches ChaChaCha/Albums-Latin_Jam2-13.wav
  4. Waltz/Albums-Secret_Garden-01.wav matches Waltz/Media-104705.wav
Recording replicas
  1. Rumba-international/Albums-AnaBelen_Veneo-03.wav matches Rumba-international/Albums-AnaBelen_Veneo-15.wav
  2. Waltz/Albums-Ballroom_Magic-03.wav matches Waltz/Albums-Ballroom_Magic-18.wav
  3. ChaChaCha/Albums-Latin_Jam-04.wav matches ChaChaCha/Albums-Latin_Jam-13.wav
  4. Rumba-international/Albums-Latin_Jam-08.wav matches Rumba-international/Albums-Latin_Jam-14.wav
  5. Samba/Albums-Latin_Jam-06.wav matches Samba/Albums-Latin_Jam-15.wav
  6. Samba/Albums-Latin_Jam2-02.wav matches Samba/Albums-Latin_Jam2-14.wav
  7. Rumba-international/Albums-Latin_Jam2-07.wav matches Rumba-international/Albums-Latin_Jam2-15.wav
  8. ChaChaCha/Albums-Latin_Jam3-02.wav matches ChaChaCha/Media-103414.wav
  9. ChaChaCha/Media-103402.wav matches ChaChaCha/Media-103415.wav (sounds like someone is playing with the speed)
Several of the recording replicas sound like karaoke versions, where the vocals are removed from the recording.

Of course, 13 exact and recording replicas out of 698 excerpts is only about 1.9% of the excerpts. (I have yet to look for mislabelings.) However, by listening to the dataset, there sounds to be a significant amount of artist replicas as well. Accompanying the BD are log files that contain for each excerpt the song name, and the album name. This verifies that many excerpts come from the same artist and/or collection, which has been shown time and again to significantly bias estimates of classification accuracy. This also means that all results that have so far been derived from the BD for evaluation of music genre recognition systems are not trustworthy --- on top of the fact that Classify --- the way in which BD is used --- is not a valid approach for measuring the extent to which a music genre recognition system can recognize genre.

Ending on a positive note, BD is a dataset that at least reflects a real use of music, and so is a dataset that is in a very small way relevant to genre. In other words, since "genre" is (as argued by J. Frow) a set of rules and criteria that, for instance, help people to use and interpret a piece of human communication, then BD is something in the right direction since each excerpt is labeled with a prescribed use. Someone asking for music to which they can dance the Samba is specifying with very few words well-defined musicological criteria, e.g., "Music in 2, but not music in 3." That is what BD appears to encompass. It just has to be used in a way that takes into consideration its faults, along with an experiment that is valid with respect to the scientific question of interest.


Sorry to sidetrack here, but: Shazam is patented, you might run into trouble if publishing implementations of it.

Leave a comment

About this Entry

This page contains a single entry by Bob L. Sturm published on January 23, 2014 5:38 PM.

Paper of the Day (Po'D): Cross-validation and bootstrap edition was the previous entry in this blog.

One solution to the wine puzzle is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.