November 2014 Archives

leonid-sabaneyev-250x375.jpgAcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This acoustic (0.86) certainly classical (1.0) track is an instrumental (0.87) that is definitely not a danceable (1.0) Tango (1.0). It is in F major and atonal (0.98). It is not party (0.96), not aggressive (0.98), but relaxed (0.94) and maybe sad (0.6).

It's Blind John Davis playing "How Long Blues"

leonid-sabaneyev-250x375.jpgAcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This is a C-major track that is again atonal (0.92), not danceable (0.96) but definitely with a Tango rhythm (1.0). It is a male (0.64) voice track (0.88). Its mood is not party (0.94), and not aggressive (0.96) but not relaxed (0.93), and happy (0.91) but maybe sad (0.63). It is likely to be classical (0.96) and/or ambient (0.8).

It is a classic Yiddish song, an excerpt of which can be heard here.
leonid-sabaneyev-250x375.jpgAcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

Here we got a C-minor tonal (0.97) track with a tempo of 164 beats per minute. It is instrumental (0.62) and female (0.74), not danceable (0.98) but maybe a Viennese Waltz rhythm (0.52). It has an acoustic mood (0.97) and is not aggressive (0.96) and not electronic (0.72). Definitely not happy (0.98) and not party (1.0), but relaxed (0.98) and sad (0.75). It is quite likely to be electronic (1.0) ambient (0.95) jazz (0.91) classical (0.76) It is "Nicht streb', o Maid" from Richard Wagner's Die Walk├╝re (Act 3, Scene 3)! Apparently, it comes from this release.
leonid-sabaneyev-250x375.jpgAcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This track is most certainly an undancable (1.0) instrumental (1.0) in D minor with dark (0.99) and atonal (0.95) ChaChaCha rhythms (0.85). Its mood is electronic (0.98), and there is no doubt it is not party (1.0) and not aggressive (1.0). It is probably not acoustic (0.90) either, and not happy (0.79) and not sad (0.87) but relaxed (0.80) and possibly aggressive (0.62). Its genre according to some is electronic (0.98) ambient (0.8), but could possibly be hip hop (0.34) jazz (0.31).

Weird Al Yankovic singing "Jerry Springer"

AcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This C# major track is certainly atonal (1.0), and is probably voice (0.94), maybe female (0.62), but absolutely danceable (1.0) with a ChaChaCha rhythm (0.5). It's timbre is dark (0.99). Its mood is electronic (0.98) and not acoustic (0.90), and definitely aggressive (1.0) and not party (1.0). Curiously, it is not happy (0.99) but not sad (0.93), and probably relaxed (0.81). It is most assuredly jazz (1.0), but could also be called electronic (0.89) rock (0.85) trance (0.35) house (0.32). Some of its last.fm genre tags are, "90S" and 20Th Century".

"Ulcer" by Swedish death metal band Comecon.
AcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This instrumental (0.91) but not acoustic (0.73) track in the key of F major is certainly atonal (1.0), and likely male-gendered (0.89). It is most definitely not danceable (1.0), but also has a Tango rhythm (0.995). Its mood is electronic (0.92) party (0.64), and is definitely happy (1.0) but maybe relaxed (0.81) and might be sad (0.6). It is probably ambient (0.89) and/or blues (0.80), but could also be classical (0.45) and/or hip hop (0.38).

"The Love Scene" from Dracula (the movie) by John Williams

AcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This male-gendered voice (0.99) track is acoustic (0.92) and has a bright timbre (0.93). It is in D# major, and probably atonal (0.9). It is not danceable (0.97), maybe because it is ambient (0.65) folk-country (0.49) jazz (0.4) hip hop (0.39). It is not aggressive (0.95), not party (0.85), but sad (0.77) and quite likely relaxed (0.97). The track's "original date" (from MusicBrainz) is 1998.

It's Arthur Godfrey singing "Makin' Love Ukulele Style" probably long before 1998.

AcousticBrainz aims to automatically analyze the world's music, in partnership with MusicBrainz and "provide music technology researchers and open source hackers with a massive database of information about music." This effort is crowd sourced neatness, which means people from all over the world are contributing data by having their computer crunch through their MusicBrainz-IDed music libraries and automatically uploading all the low-level features it extracts.

I construct today's review from low- and high-level data recently extracted from a particular music track AcousticBrainz. Can you guess what it is? What characteristics it has? ("Probabilities" are in parentheses.) The answer will be revealed below tomorrow.

This female-gendered (0.81) vocal (0.78) track is not likely danceable (0.87), but it has a high probability of being electronic (1.0) and/or ambient (0.57) and/or classical (0.45) and/or jazz (0.31). It is in C major, with a tempo of about 148 bpm, and has a Tango rhythm (0.91). It has a bright timbre, is probably atonal (0.83), and labeled probably happy (0.63), but most likely not relaxed (0.96).

The track is "With God on Our Side (feat. Joan Baez)" by Bob Dylan.
Hello, and welcome to Paper of the Day (P'oD): The problem of accuracy as an evaluation criterion edition. Today's paper is one that I found after I published my tirade against classification accuracy: E. Law, "The problem of accuracy as an evaluation criterion," in Proc. ICML, 2008. I certainly should have included it.

My one-line precis of this (position) paper is: To evaluate solutions proposed to address problems centered on humans, humans must be directly involved in the mix.
Law takes a brief look at a key problem in each of three different research domains in which machine learning is being applied:
  1. Delimiting regions of interest in an image.
  2. Translation between written languages.
  3. Recorded music autotagging.
In each, she raises concern with accepted evaluation approaches. For region of interest detection, one accepted measure of algorithm performance is based on the amount of area overlap between its output rectangles and those in the ground truth. For machine translation, current metrics (BLEU, like precision) don't take into consideration that there can exist many acceptable translations. For music autotagging, a metric based only on the number of matching tags (precision and recall) while disregarding the meaning of "incorrect" tags might not reveal significant differences between algorithms producing the same score. She puts it very nicely:
"The problem in using accuracy to compare learned and ground truth data is that we are comparing sets of things without explicitly stating which subset is more desirable than another."
For each problem, she argues that the metric uses loses sight of the motivation behind solving the problem. For region of interest detection, it is object recognition. For machine translation, it is preservation of meaning. For music autotagging, it is facilitating information retrieval. Hence, humans must be involved in the evaluation. Including humans, of course, increases the cost of evaluation; but Law argues the evaluation process can be gamified, and made fun to do.

I think Law's paper is very nice, has good clear examples, and provides an interesting alternative. However, I would broaden her thesis beyond metrics because she is really taking aim at more (as am I). A discussion on which metric is more meaningful than another is unproductive without considering at the same time the design and the dataset used in an experiment (as well as the measurement model), and, before that, the explicitly specified hypotheses upon which the evaluation rests, and, before that, a well-defined (formal) description of the research problem. It is, I would argue, the whole enterprise of research problem solving that must be reconsidered.
Wordpress.com does not offer the latex support I need (macros, and sensible delimitation, see this mess), and apparently is blocked in particular countries around the world. So, I am going to try to move this blog to QMUL.

This blog is moving!

| No Comments

This blog and all of its contents are moving to: High Noon GMT. (At least I think I moved most of its contents ...)

Sorry for the hassle!

Some ISMIR 2014 Papers

| No Comments

ISMIR 2014 was a fantastic event! I really enjoyed the venue, the food, the organization, and most of all the variety of work and interesting discussions that resulted. Now that I am back, I want to review more closely about 50 papers from it. I include some of my notes below.

---

ON CULTURAL, TEXTUAL AND EXPERIENTIAL ASPECTS OF MUSIC MOOD (Abhishek Singhi and Daniel G. Brown)

The authors hypothesize that contributing to human mood assignments to music are factors that are cultural, experiential, and dependent upon language proficiency. They conduct an experiment crossing three factors: participant origin ("Chinese in Canada", "Canadians of Chinese origin", "Canadians of non-Chinese origin"), "songs" stimuli ("the first 90 seconds" of 50 "very popular English-language songs of the 2000's"), stimuli presentation (lyrics only, music audio, lyrics and music audio). They use 100 participants (students at Waterloo, "33 Chinese living in Canada for less than 3 years", "33 Canadians, not of Chinese origin ... with English as their mother tongue", "34 Canadians of Chinese origin, born and brought up in Canada"). Each participant is instructed to label the mood of each stimulus in a presentation as one of the 5 clusters of the MIREX emotion model. Each participant labels 10 songs (first 3 only lyrics, next 3 only audio, last 4 audio+lyrics), contributing 1000 total responses covering all 50 songs. (The experimental design (mapping) is specified no further.)

In their analysis, the authors compute for each group a "distribution of responses", which I assume means an estimation of the joint probability P_origin(mood, song, presentation). This is what they wish to compare across groups. However, note that each song stimuli then receives about 20 responses from all groups in all presentations. In each presentation, only 6 or 7 responses are given from all groups for one song. Each group then contributes around 1 or 2 responses for each song in each presentation. The estimate of the above joint probability should then be very poor.

I agree with the complaint that mood labeling of music is quite poorly defined, and highly ambiguous with regards to extrinsic influences. To me it seems obvious that labeling music with "emotion" is specific to an individual working within some cultural context that requires such labeling (Halloween playlist, for instance). But this experiment as designed does not really address that hypothesis. For one, there are too few responses in the cross-factor design. Also, as a music listener who does not listen to the lyrics in music, I am skeptical of the relevance of "lyrics" only presentation of "music". How is "lyrics", music?

Now, how to design the experiment to make a valid conclusion about the dependence of mood assignment on participant origin? I say ask a bunch of Western ears to label the moods of some classical Indian music using a variety of ragas and talas. Absurdity will result.


TRANSFER LEARNING BY SUPERVISED PRE-TRAINING FOR AUDIO-BASED MUSIC CLASSIFICATION (A. van den Oord, S. Dieleman and B. Schrauwen)

Transfer learning is the adaptation of models learned for some task (source) for some other task (target). In this paper, models are learned for music audio signals using one dataset (Million Song) for the source tasks "user listening preference prediction" and "tag prediction", and the adapted for the target tasks "genre classification" and "tag prediction". Essentially, the authors extract low-level features from audio spectrograms, perform dimensionality reduction, and then train multilayer perceptrons on the source task. These trained systems then are used to produce "high-level" features of a new dataset, which are then used to train an SVM for a different target task.

The authors test the low-level features in "genre classification" and "tag prediction" using 5 different datasets. For instance, they use 10fCV in GTZAN and find an increase of accuracy from about 85% using the low-level features to about 88% using transfer learning. Experiments on other datasets show similar trends. They conclude, "We have shown that features learned in this fashion work well for other audio classification tasks on different datasets, consistently outperforming a purely unsupervised feature learning approach." This is not a valid conclusion since: 1) they do not control for all independent variables in the measurement models of the experiments (e.g. the faults in GTZAN make a significant contribution to the outcome), 2) they do not define the problems being solved (classification by any means? by relevant means?), and 3) they do not specify "work well" and "consistently outperforming". This approach appears to reproduce a lot of "ground truth" in some datasets, but the reproduction of ground truth does not imply that something relevant for content-based music classification has been learned and is being used.

Are these "high-level" features really closer to the "musical surface", i.e., music content? It would be interesting to redo the experiment using GTZAN but taking into consideration its faults. Also, of course, to subject it to the method of irrelevant transformations to see if it is relying on confounds in the dataset.


AN ASSOCIATION-BASED APPROACH TO GENRE CLASSIFICATION IN MUSIC (T. Arjannikov and J. Z. Zhang)

Association analysis is a data mining technique that finds relationships between sets of unique objects in order to build logical implications, i.e., If A then (probably) B. In this work, quantized features extracted from labeled acoustic signals are used to produce such rules. Those quantized extracted features that appear frequent enough in signals with a particular label are then taken to imply that label. For instance, if many signals of label i have large (or small) values in feature dimension j at times {t_1, t_2}, then that is taken to imply i.

This paper reports experiments with the Latin music dataset (LMD), and a portion of the million song dataset. In the LMD, MFCC features are extracted from the first 30 seconds of each song. (This means the features can include the applause and speaking that begins many of the "live" songs. Also, no artist filtering is used, and there is no consideration of all the replicas.) Results show that the proposed systems reproduce "ground truth" labels more than random selection.

Regardless of the results, the evaluation design used in this work (Classify) is invalid with respect to genre recognition. Reproducing "ground truth" labels here does not provide any evidence that the rules learned have anything to do with the meaning of those labels in the dataset and in the real world. Taken to absurdity, that the first 30 seconds of audio recordings labeled "Forro" have a large first MFCC at times 13.1 seconds and a small 8th MFCC at 25.2 seconds is not a particularly useful rule, or one that is at all relevant to the task. Furthermore, this work approaches the problem of music genre recognition as an Aristotelian one, and presupposes the low-level features are "content-based" features relevant to the undefined task of music genre classification. It would be nice if the problem of music genre recognition was like that, but it just isn't.


MUSIC CLASSIFICATION BY TRANSDUCTIVE LEARNING USING BIPARTITE HETEROGENEOUS NETWORKS (D. F. Silva, R. G. Rossi, S. O. Rezende, G. Batista)

Transductive learning sidesteps the inductive step of building models of classes, and instead performs classification via similarity with exemplars. This is useful when there is not enough training data to build suitable models, or approximately good labels are desired. It is essentially a semi-supervised form of clustering.

This paper encodes low-level features (MFCCs) into bags of frames of features, and then builds a bipartite heterogeneous network to propagate labels through the network to unlabled data. Experiments on labeling music (GTZAN and Homburg) show the approach reproduces some "ground truth", but no fault filtering is used in GTZAN. Unfortunately, the experiments in this work do not show whether the results come from considerations of the music, or from something else unrelated.

I like the idea of transductive learning because it appears based more on notions of similarity (or proximity in some metric space), than on building general models that may be unachievable or unrealistic. However, the sanity of this approach for genre recognition (or music description in general) is highly dependent on the space in which the similarity is gauged (of course). A space from BFFs from MFCCs will likely have little to do with the high-level content used to judge the similarity of music. However, I can image several spaces for the same collection of music that emphasize specific high-level aspects of music, such as rhythm, key, instrumentation, and so on. Now, how to measure similarity in these spaces in a meaningful way?

Hello, and welcome to Paper of the Day (Po'D): Kiki-Bouba edition. Today's paper is my own: B. L. Sturm and N. Collins, "THE KIKI-BOUBA CHALLENGE: ALGORITHMIC COMPOSITION FOR CONTENT-BASED MIR RESEARCH & DEVELOPMENT", in Proc. Int. Symp. Music Info. Retrieval, Oct. 2014. Below is the video of my presentation from a few days ago (powerpoint slides here).



The one-line precis of our paper is:
The Kiki-Bouba Challenge (KBC) attempts to change the incentive in content-based MIR research from reproducing ground truth in a dataset to solving problems.

Blog Roll

About this Archive

This page is an archive of entries from November 2014 listed from newest to oldest.

October 2014 is the previous archive.

December 2014 is the next archive.

Find recent content on the main index or look in the archives to find all content.