E: sts@create.aau.dk

Workshop Machine Learning for Audio Signal Processing at NIPS 2017 (ML4Audio@NIPS17)


Audio signal processing is currently undergoing a paradigm change, where data-driven machine learning is replacing hand-crafted feature design. This has led some to ask whether audio signal processing is still useful in the “era of machine learning.” There are many challenges, new and old, including the interpretation of learned models in high dimensional spaces, problems associated with data-poor domains, adversarial examples, high computational requirements, and research driven by companies using large in-house datasets that is ultimately not reproducible.

ML4Audio (https://nips.cc/Conferences/2017/Schedule?showEvent=8790) aims to promote progress, systematization, understanding, and convergence of applying machine learning in the area of audio signal processing. Specifically, we are interested in work that demonstrates novel applications of machine learning techniques to audio data, as well as methodological considerations of merging machine learning with audio signal processing. We seek contributions in, but not limited to, the following topics:
– audio information retrieval using machine learning;
– audio synthesis with given contextual or musical constraints using machine learning;
– audio source separation using machine learning;
– audio transformations (e.g., sound morphing, style transfer) using machine learning;
– unsupervised learning, online learning, one-shot learning, reinforcement learning, and incremental learning for audio;
– applications/optimization of generative adversarial networks for audio;
– cognitively inspired machine learning models of sound cognition;
– mathematical foundations of machine learning for audio signal processing.

This workshop especially targets researchers, developers and musicians in academia and industry in the area of MIR, audio processing, hearing instruments, speech processing, musical HCI, musicology, music technology, music entertainment, and composition.


Also download Whova App for NIPS 2017.
Download the workshop NIPS 2017  Workshop Book (including ML4Audio and 45 other workshops)


08:00 AM Overture (Talk)Hendrik Purwins

08:15 AMKaren Livescu. Acoustic word embeddings for speech search (Invited Talk)

08:45 AM Yu-An Chung and James Glass.  Learning Word Embeddings from Speech (Talk)

In this paper, we propose a novel deep neural network architecture, Sequence-to- Sequence Audio2Vec, for unsupervised learning of fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the segments, and are close to other vectors in the embedding space if their corresponding segments are semantically similar. The design of the proposed model is based on the RNN Encoder-Decoder framework, and borrows the methodology of continuous skip-grams for training. The learned vector representations are evaluated on 13 widely used word similarity benchmarks, and achieved competitive results to that of GloVe. The biggest advantage of the proposed model is its capability of extracting semantic information of audio segments taken directly from raw speech, without relying on any other modalities such as text or images, which are challenging and expensive to collect and annotate.

09:05 AM Soumitro Chakrabarty, Emanuël Habets. Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise (Talk)

The problem of multi-speaker localization is formulated as a multi-class multi-label classification problem, which is solved using a convolutional neural network (CNN) based source localization method. Utilizing the common assumption of disjoint speaker activities, we propose a novel method to train the CNN using synthesized noise signals. The proposed localization method is evaluated for two speakers and compared to a well-known steered response power method.

09:25 AM Shrikant Venkataramani, Paris Smaragdis. Adaptive Front-ends for End-to-end Source Separation (Talk)

Source separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. We present an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for end-to-end supervised source separation.

09:45 AM Speech: source separation, enhancement, recognition, synthesis (Coffee break and poster session)Shuayb Zarar, Rasool Fakoor, Sri Harsha Dumpala, Minje Kim, Paris Smaragdis, Mohit Dubey, Jong Hwan Ko, Sakriani Sakti, Yuxuan Wang, Lijiang Guo, Garrett T Kenyon, Andros Tjandra, Tycho Tax, Younggun Lee


34 Lijiang Guo and Minje Kim. Bitwise Source Separation on Hashed Spectra: An Efficient Posterior Estimation Scheme Using Partial Rank Order Metrics

This paper proposes an efficient bitwise solution to the single-channel source separation task. Most dictionary-based source separation algorithms rely on iterative update rules during the run time, which becomes computationally costly especially when we employ an overcomplete dictionary and sparse encoding that tend to give better separation results. To avoid such cost we propose a bitwise scheme on hashed spectra that leads to an efficient posterior probability calculation. For each source, the algorithm uses a partial rank order metric to extract robust features that form a binarized dictionary of hashed spectra. Then, for a mixture spectrum, its hash code is compared with each source’s hashed dictionary in one pass. This simple voting-based dictionary search allows a fast and iteration-free estimation of ratio masking at each bin of a signal spectrogram. We verify that the proposed BitWise Source Separation (BWSS) algorithm produces sensible source separation results for the single-channel speech denoising task, with 6-8 dB mean SDR. To our knowledge, this is the first dictionary based algorithm for this task that is completely iteration-free in both training and testing.

33 Minje Kim and Paris Smaragdis. Bitwise Neural Networks for Efficient Single­Channel Source Separation

We present Bitwise Neural Networks (BNN) as an efficient hardware-friendly solution to single-channel source separation tasks in resource-constrained environments. In the proposed BNN system, we replace all the real-valued operations during the feedforward process of a Deep Neural Network (DNN) with bitwise arithmetic (e.g. the XNOR operation between bipolar binaries in place of multiplications). Thanks to the fully bitwise run-time operations, the BNN system can serve as an alternative solution where efficient real-time processing is critical, for example real-time speech enhancement in embedded systems. Furthermore, we also propose a binarization scheme to convert the input signals into bit strings so that the BNN parameters learn the Boolean mapping between input binarized mixture signals and their target Ideal Binary Masks (IBM). Experiments on the single-channel speech denoising tasks show that the efficient BNN-based source separation system works well with an acceptable performance loss compared to a comprehensive real-valued network, while consuming a minimal amount of resources.

6 Mohit Dubey, Garrett Kenyon, Nils Carlson and Austin Thresher. Does Phase Matter For Monaural Source Separation?

The “cocktail party” problem of fully separating multiple sources from a single channel audio waveform remains unsolved. Current biological understanding of neural encoding suggests that phase information is preserved and utilized at every stage of the auditory pathway. However, current computational approaches primarily discard phase information in order to mask amplitude spectrograms of sound. In this paper, we seek to address whether preserving phase information in spectral representations of sound provides better results in monaural separation of vocals from a musical track by using a neurally plausible sparse generative model. Our results demonstrate that preserving phase information reduces artifacts in the separated tracks, as quantified by the signal to artifact ratio (GSAR). Furthermore, our proposed method achieves state-of-the-art performance for source separation, as quantified by a mean signal to interference ratio (GSIR) of 19.46.


31: Rasool Fakoor, Xiaodong He, Ivan Tashev and Shuayb Zarar. Reinforcement Learning To Adapt Speech Enhancement to Instantaneous Input Signal Quality

Today, the optimal performance of existing noise-suppression algorithms, both data-driven and those based on classic statistical methods, is range bound to specific levels of instantaneous input signal-to-noise ratios. In this paper, we present a new approach to improve the adaptivity of such algorithms enabling them to perform robustly across a wide range of input signal and noise types. Our methodology is based on the dynamic control of algorithmic parameters via reinforcement learning. Specifically, we model the noise-suppression module as a black box, requiring no knowledge of the algorithmic mechanics except a simple feedback from the output. We utilize this feedback as the reward signal for a reinforcement-learning agent that learns a policy to adapt the algorithmic parameters for every incoming audio frame (16 ms of data). Our preliminary results show that such a control mechanism can substantially increase the overall performance of the underlying noise-suppression algorithm; 42% and 16% improvements in output SNR and MSE, respectively, when compared to no adaptivity.

35: Jong Hwan Ko, Josh Fromm, Matthai Phillipose, Ivan Tashev and Shuayb Zarar. Precision Scaling of Neural Networks for Efficient Audio Processing

While deep neural networks have shown powerful performance in many audio applications, their large computation and memory demand has been a challenge for real-time processing. In this paper, we study the impact of scaling the precision of neural networks on the performance of two common audio processing tasks, namely, voice-activity detection and single-channel speech enhancement. We determine the optimal pair of weight/neuron bit precision by exploring its impact on both the performance and processing time. Through experiments conducted with real user data, we demonstrate that deep neural networks that use lower bit precision significantly reduce the processing time (up to 30x). However, their performance impact is low (< 3.14%) only in the case of classification tasks such as those present in voice activity detection.

20: Marius Paraschiv, Lasse Borgholt, Tycho Tax, Marco Singh and Lars Maaløe. Exploiting Nontrivial Connectivity for Automatic Speech Recognition

Nontrivial connectivity has allowed the training of very deep networks by addressing the problem of vanishing gradients and offering a more efficient method of reusing parameters. In this paper we make a comparison between residual networks, densely-connected networks and highway networks on an image classification task. Next, we show that these methodologies can easily be deployed into automatic speech recognition and provide significant improvements to existing models.

1 Brian Mcmahan and Delip Rao. Listening to the World Improves Speech Command Recognition

In this paper, we present a study on transfer learning in convolutional network architectures for recognizing environmental sound events and speech commands. Our primary contribution is to show that representations learned for environmental sound classification can be used to significantly improve accuracies for the unrelated, voice-focused task of speech command recognition. Our second contribution is a simple multiscale input representation that uses dilated convolutions to aggregate larger contexts and increase classification performance. Our third and final contribution is a demonstration of an interaction effect between transfer learning and the multiscale input representations. For different versions of the speech command dataset, the pre-trained networks with multiscale inputs can be trained with only 50%-75% of the speech command training data and achieve similar accuracies as a non-pre-trained and non-multiscale networks with 100% of the training data.

7: Andros Tjandra, Sakriani Sakti and Satoshi Nakamura. End­-to-­End Speech Recognition with Local Monotonic Attention

Most attention mechanism in sequence-to-sequence model is based on a global attention property which requires a computation of a weighted summarization of the whole input sequence generated by encoder states. However, it is computationally expensive and often produces misalignment on the longer input sequence. Furthermore, it does not fit with monotonous or left-to-right nature in speech recognition task. In this paper, we propose a novel attention mechanism that has local and monotonic properties. Various ways to control those properties are also explored. Experimental results demonstrate that encoder-decoder based ASR with local monotonic attention could achieve significant performance improvements and reduce the computational complexity in comparison with the one that used the standard global attention architecture.

17: Sri Harsha Dumpala, Rupayan Chakraborty and Sunil Kumar Kopparapu. A novel approach for effective learning in low resourced scenarios

Deep learning based discriminative methods, being the state-of-the-art machine learning techniques, are ill-suited for learning from lower amounts of data. In this paper, we propose a novel framework, called simultaneous two sample learning (s2sL), to effectively learn the class discriminative characteristics, even from very low amount of data. In s2sL, more than one sample (here, two samples) are simultaneously considered to both, train and test the classifier. We demonstrate our approach for speech/music discrimination and emotion classification through exper- iments. Further, we also show the effectiveness of s2sL approach for classification in low-resource scenario, and for imbalanced data.


29 Yuxuan Wang, Rj Skerry­Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark and Rif A. Saurous. Uncovering Latent Style Factors for Expressive Speech Synthesis

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of “style tokens” in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.

13 Younggun Lee, Azam Rabiee and Soo-Young Lee. Emotional End-to-End Neural Speech synthesizer

In this paper, we introduce an emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron. Despite its benefits, we found that the original Tacotron suffers from the exposure bias problem and irregularity of the attention alignment. Later, we address the problem by utilization of context vector and residual connection at recurrent neural networks (RNNs). Our experiments showed that the model could successfully generate speech for given emotion labels.

11:00 AMMarco Marchini. Learning and transforming sound for interactive musical applications (Invited Talk)

11:30 AM 2 Andros Tjandra, Sakriani Sakti and Satoshi Nakamura. Compact Recurrent Neural Network based on Tensor Train for Polyphonic Music Modeling (Talk)

This paper introduces a novel compression method for recurrent neural networks (RNNs) based on Tensor Train (TT) format. The objective in this work are to reduce the number of parameters in RNN and maintain their expressive power. The key of our approach is to represent the dense matrices weight parameter in the simple RNN and Gated Recurrent Unit (GRU) RNN architectures as the n- dimensional tensor in TT-format. To evaluate our proposed models, we compare it with uncompressed RNN on polyphonic sequence prediction tasks. Our proposed TT-format RNN are able to preserve the performance while reducing the number of RNN parameters significantly up to 80 times smaller.

11:50 AM 21 Hyeong-Seok Choi, Ju-Heon Lee and Kyogu Lee  Singing Voice Separation using Generative Adversarial Networks (Talk)

In this paper, we propose a novel approach extending Wasserstein generative adversarial networks (GANs) [3] to separate singing voice from the mixture signal. We used the mixture signal as a condition to generate singing voices and applied the U-net style network for the stable training of the model. Experiments with the DSD100 dataset show the promising results with the potential of using the GANs for music source separation.

12:10 PM 32 Sungkyun Chang, Juheon Lee, Sankeun Choe and Kyogu Lee. Audio Cover Song Identification using Convolutional Neural Network (Talk)

In this paper, we propose a new approach to cover song identification using a CNN (convolutional neural network). Most previous studies extract the feature vectors that characterize the cover song relation from a pair of songs and used it to compute the (dis)similarity between the two songs. Based on the observation that there is a meaningful pattern between cover songs and that this can be learned, we have reformulated the cover song identification problem in a machine learning framework. To do this, we first build the CNN using as an input a cross-similarity matrix generated from a pair of songs. We then construct the data set composed of cover song pairs and non-cover song pairs, which are used as positive and negative training samples, respectively. The trained CNN outputs the probability of being in the cover song relation given a cross-similarity matrix generated from any two pieces of music and identifies the cover song by ranking on the probability. Experimental results show that the proposed algorithm achieves performance better than or comparable to the state-of-the-art.

12:30 PM Lunch Break (Break)

01:30 PM Polyphonic piano transcription using deep neural networks (Invited Talk). Douglas Eck

02:00 PM Deep learning for music recommendation and generation (Invited Talk). Sander Dieleman

02:30 PM Exploring Ad Effectiveness using Acoustic Features (Invited Talk). Matt Prockup, Puya Vahabi

03:00 PM Music and environmental sounds (Coffee break and poster session). Oriol Nieto, Jordi Pons, Bhiksha Raj, Tycho Tax, Benjamin Elizalde, Juhan Nam, Anurag Kumar


14 Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra. End­-to-­end learning for music audio tagging at scale

The lack of data tends to limit the outcomes of deep learning research – specially, when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study we make use of musical labels annotated for 1.2 million tracks. This large amount of data allows us to unrestrictedly explore different front-end paradigms: from assumption-free models – using waveforms as input with very small convolutional filters; to models that rely on domain knowledge – log-mel spectrograms with a convolutional neural network designed to learn temporal and timbral features. Results suggest that while spectrogram-based models surpass their waveform-based counterparts, the difference in performance shrinks as more data are employed.

27 Jongpil Lee, Taejun Kim, Jiyoung Park and Juhan Nam. Raw Waveform ­based Audio Classification Using Sample­level CNN Architectures

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics of learned filters.

38 Alfonso Perez-Carrillo Estimation of violin bowing features from Audio recordings with Convolutional Networks

The acquisition of musical gestures and particularly of instrument controls from a musical performance is a field of increasing interest with applications in many research areas. In the last years, the development of novel sensing technologies  has allowed the fine measurement of such controls. However, the acquisition process usually involves the use of expensive sensing systems and complex setups that are generally intrusive in practice. An alternative to direct acquisition is through the analysis of the audio signal. So called indirect acquisition has many  advantages including the simplicity and low-cost of the acquisition and its non-intrusive nature. The main challenge is designing robust detection algorithms to be as accurate as the direct approaches. In this paper, we present an indirect acquisition method to estimate violin bowing controls from audio signal analysis  based on training Convolutional Neural Networks with a database of multimodal  data (bowing controls and sound features) of violin performances.


3 Benjamin Elizalde, Rohan Badlani, Ankit Shah, Anurag Kumar, and Bhiksha Raj. NELS ­ Never­Ending Learner of Sounds

Sounds are essential to how humans perceive and interact with the world. These sounds are captured in recordings and shared on the Internet on a minute-by- minute basis. These recordings, which are predominantly videos, constitute the largest archive of sounds we’ve ever seen. However, most of these recordings have undescribed content making necessary methods for automatic audio content analysis, indexing and retrieval. These methods have to address multiple challenges, such as the relation between sounds and language, numerous and diverse sound classes, and large-scale evaluation. We propose a system that continuously learns from the web relations between sounds and language, improves sound recognition models over time and evaluates its learning competency in the large-scale without references. We introduce the Never-Ending Learner of Sounds (NELS), a project for continuously learning of sounds and their associated knowledge, available on line in nels.cs.cmu.edu.

30 Tycho Tax, Jose Antich, Hendrik Purwins and Lars Maaløe Utilizing Domain Knowledge in End-to-End Audio Processing

End-to-end neural network based approaches to audio modelling are generally outperformed by models trained on high-level data representations. In this paper we present preliminary work that shows the feasibility of training the first layers of a deep convolutional neural network (CNN) model to learn the commonly- used log-scaled mel-spectrogram transformation. Secondly, we demonstrate that upon initializing the first layers of an end-to-end CNN classifier with the learned transformation, convergence and performance on the ESC-50 environmental sound classification dataset are similar to a CNN-based model trained on the highly pre-processed log-scaled mel-spectrogram features.

4 Anurag Kumar and Bhiksha Raj. Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data

The development of audio event recognition models requires labeled training data, which are generally hard to obtain. One promising source of recordings of audio events is the large amount of multimedia data on the web. In particular, if the audio content analysis must itself be performed on web audio, it is important to train the recognizers themselves from such data. Training from these web data, however, poses several challenges, the most important being the availability of labels : labels, if any, that may be obtained for the data are generally weak, and not of the kind conventionally required for training detectors or classifiers. We propose a robust and efficient deep convolutional neural network (CNN) based framework to learn audio event recognizers from weakly labeled data. The proposed method can train from and analyze recordings of variable length in an efficient manner and outperforms a network trained with strongly labeled web data by a considerable margin.

04:00 PM TBD (Invited Talk)

04:30 PM  15 Ivan Bocharov, Bert de Vries,and Tjalling Tjalkens K-shot Learning of Acoustic Context (Talk)

In order to personalize the behavior of hearing aid devices in different acoustic scenes, we need personalized acoustic scene classifiers. Since we cannot afford to burden an individual hearing aid user with the task to collect a large acoustic database, we will want to train an acoustic scene classifier on one in-situ recorded waveform (of a few seconds duration) per class. In this paper we develop a method that achieves high levels of classification accuracy from a single recording of an acoustic scene.

04:50 PM 18 Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous. Towards Learning Semantic Audio Representations from Unlabeled Data (Talk)

Our goal is to learn semantically structured audio representations without relying on categorically labeled data. We consider several class-agnostic semantic constraints that are inherent to non-speech audio: (i) sound categories are invariant to additive noise and translations in time, (ii) mixtures of two sound events inherit the categories of the constituents, and (iii) the categories of events in close temporal proximity in a single recording are likely to be the same or related. We apply these constraints to sample training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks. The resulting low-dimensional representations provide both greatly improved query-by-example retrieval performance and reduced labeled data and model complexity requirements for supervised sound classification.

05:10 PM 23 Yunpeng Li, Ivan Kiskin, Davide Zilli, Marianne Sinka, Henry Chan, Kathy Willis, Stephen J Roberts. Cost-sensitive detection with variational autoencoders for environmental acoustic sensing (Talk)

Environmental acoustic sensing involves the retrieval and processing of audio signals to better understand our surroundings. While large-scale acoustic data make manual analysis infeasible, they provide a suitable playground for machine learning approaches. Most existing machine learning techniques developed for environmental acoustic sensing do not provide flexible control of the trade-off between the false positive rate and the false negative rate. This paper presents a cost-sensitive classification paradigm, in which the hyper-parameters of classifiers and the structure of variational autoencoders are selected in a principled Neyman- Pearson framework. We examine the performance of the proposed approach using a dataset from the HumBug project1 which aims to detect the presence of mosquitoes using sound collected by simple embedded devices.

05:30 PM Sepp Hochreiter, Karen Livescu, Oriol Nieto, Malcolm Slaney, Hendrik Purwins. Machine learning and audio signal processing: State of the art and future perspectives (Discussion Panel)