E: sts@create.aau.dk


Workshop Machine Learning for Audio Signal Processing at NIPS 2017 (ML4Audio@NIPS17)


Scope,Organisation Committee,Program Committee

Posters: Speech Source Separation, Speech Enhancement, Speech Synthesis, Music, Environmental Sounds


Audio signal processing is currently undergoing a paradigm change, where data-driven machine learning is replacing hand-crafted feature design. This has led some to ask whether audio signal processing is still useful in the “era of machine learning.” There are many challenges, new and old, including the interpretation of learned models in high dimensional spaces, problems associated with data-poor domains, adversarial examples, high computational requirements, and research driven by companies using large in-house datasets that is ultimately not reproducible.

ML4Audio (https://nips.cc/Conferences/2017/Schedule?showEvent=8790) aims to promote progress, systematization, understanding, and convergence of applying machine learning in the area of audio signal processing. Specifically, we are interested in work that demonstrates novel applications of machine learning techniques to audio data, as well as methodological considerations of merging machine learning with audio signal processing. We seek contributions in, but not limited to, the following topics:

  • audio information retrieval using machine learning;
  • audio synthesis with given contextual or musical constraints using machine learning;
  • audio source separation using machine learning;
  • audio transformations (e.g., sound morphing, style transfer) using machine learning;
  • unsupervised learning, online learning, one-shot learning, reinforcement learning, and incremental learning for audio;
  • applications/optimization of generative adversarial networks for audio;
  • cognitively inspired machine learning models of sound cognition;
  • mathematical foundations of machine learning for audio signal processing.

This workshop especially targets researchers, developers and musicians in academia and industry in the area of MIR, audio processing, hearing instruments, speech processing, musical HCI, musicology, music technology, music entertainment, and composition.

Program at NIPS17, Workshop Book (including 45 other workshops)

The preparation for a special journal issue is on the way.

Organisation Committee

Hendrik Purwins, Aalborg University Copenhagen, Denmark (hpu@create.aau.dk)
Bob L. Sturm, Queen Mary University of London, UK (b.sturm@qmul.ac.uk)
Mark Plumbley, University of Surrey, UK (m.plumbley@surrey.ac.uk)

Program Committee

Abeer Alwan (University of California, Los Angeles)
Jon Barker (University of Sheffield)
Sebastian Böck (Johannes Kepler University Linz)
Mads Græsbøll Christensen (Aalborg University)
Maximo Cobos (Universitat de Valencia)
Sander Dieleman (Google DeepMind)
Monika Dörfler (University of Vienna)
Shlomo Dubnov (UC San Diego)
Philippe Esling (IRCAM)
Cédric Févotte (IRIT)
Emilia Gómez (Universitat Pompeu Fabra)
Emanuël Habets (International Audio Labs Erlangen)
Jan Larsen (Danish Technical University)
Marco Marchini (Spotify)
Rafael Ramirez (Universitat Pompeu Fabra)
Gaël Richard (TELECOM ParisTech)
Fatemeh Saki (UT Dallas)
Sanjeev Satheesh (Baidu SVAIL)
Jan Schlüter (Austrian Research Institute for Artificial Intelligence)
Joan Serrà (Telefonica)
Malcolm Slaney (Google)
Emmanuel Vincent (INRIA Nancy)
Gerhard Widmer (Austrian Research Institute for Artificial Intelligence)
Tao Zhang (Starkey Hearing Technologies)


Invited Talk: Karen Livescu (TTI-Chicago). Acoustic word embeddings for speech search (slides)

For a number of speech tasks, it can be useful to represent speech segments of arbitrary length by fixed-dimensional vectors, or embeddings. In particular, vectors representing word segments — acoustic word embeddings — can be used in query-by-example search, example-based speech recognition, or spoken term discovery. *Textual* word embeddings have been common in natural language processing for a number of years now; the acoustic analogue is only recently starting to be explored. This talk will present our work on acoustic word embeddings and their application to query-by-example search. I will speculate on applications across a wider variety of audio tasks.

Karen Livescu is an Associate Professor at TTI-Chicago. She completed her PhD and post-doc in electrical engineering and computer science at MIT and her Bachelor’s degree in Physics at Princeton University. Karen’s main research interests are at the intersection of speech and language processing and machine learning. Her recent work includes multi-view representation learning, segmental neural models, acoustic word embeddings, and automatic sign language recognition. She is a member of the IEEE Spoken Language Technical Committee, an associate editor for IEEE Transactions on Audio, Speech, and Language Processing, and a technical co-chair of ASRU 2015 and 2017.

Yu-An Chung and James Glass.  Learning Word Embeddings from Speech (slides,BibTeX)

In this paper, we propose a novel deep neural network architecture, Sequence-to- Sequence Audio2Vec, for unsupervised learning of fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the segments, and are close to other vectors in the embedding space if their corresponding segments are semantically similar. The design of the proposed model is based on the RNN Encoder-Decoder framework, and borrows the methodology of continuous skip-grams for training. The learned vector representations are evaluated on 13 widely used word similarity benchmarks, and achieved competitive results to that of GloVe. The biggest advantage of the proposed model is its capability of extracting semantic information of audio segments taken directly from raw speech, without relying on any other modalities such as text or images, which are challenging and expensive to collect and annotate.

Soumitro Chakrabarty, Emanuël Habets. Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise(slides,BibTeX

The problem of multi-speaker localization is formulated as a multi-class multi-label classification problem, which is solved using a convolutional neural network (CNN) based source localization method. Utilizing the common assumption of disjoint speaker activities, we propose a novel method to train the CNN using synthesized noise signals. The proposed localization method is evaluated for two speakers and compared to a well-known steered response power method.

Shrikant Venkataramani, Paris Smaragdis. Adaptive Front-ends for End-to-end Source Separation (BibTeX)

Source separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. We present an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for end-to-end supervised source separation.

Invited Talk: Marco Marchini (Spotify). Learning and transforming sound for interactive musical applications 

Recent developments in object recognition (especially convolutional neural networks) led to a new spectacular application: image style transfer. But what would be the music version of style transfer? In the flow-machine project, we created diverse tools for generating audio tracks by transforming prerecorded music material. Our artists integrated these tools in their composition process and produced some pop tracks. I present some of those tools, with audio examples, and give an operative definition of music style transfer as an optimization problem. Such definition allows for an efficient solution which renders possible a multitude of musical applications: from composing to live performance.

Marco Marchini works at Spotify in the Creator Technology Research Lab, Paris. His mission is bridging the gap between between creative artists and intelligent technologies. Previously, he worked as research assistant for the Pierre-and-Marie-Curie University at the Sony Computer Science Laboratory of Paris and worked for the Flow Machine project. His previous academic research also includes unsupervised music generation and ensemble performance analysis, this research was carried out during my M.Sc. and Ph.D. at the Music Technology Group (DTIC, Pompeu Fabra University). He has a double degree in Mathematics from Bologna University.

Andros Tjandra, Sakriani Sakti and Satoshi Nakamura. Compact Recurrent Neural Network based on Tensor Train for Polyphonic Music Modeling (slides,BibTeX)

This paper introduces a novel compression method for recurrent neural networks (RNNs) based on Tensor Train (TT) format. The objective in this work are to reduce the number of parameters in RNN and maintain their expressive power. The key of our approach is to represent the dense matrices weight parameter in the simple RNN and Gated Recurrent Unit (GRU) RNN architectures as the n- dimensional tensor in TT-format. To evaluate our proposed models, we compare it with uncompressed RNN on polyphonic sequence prediction tasks. Our proposed TT-format RNN are able to preserve the performance while reducing the number of RNN parameters significantly up to 80 times smaller.

21 Hyeong-Seok Choi, Ju-Heon Lee and Kyogu Lee  Singing Voice Separation using Generative Adversarial Networks (BibTeX)

In this paper, we propose a novel approach extending Wasserstein generative adversarial networks (GANs) [3] to separate singing voice from the mixture signal. We used the mixture signal as a condition to generate singing voices and applied the U-net style network for the stable training of the model. Experiments with the DSD100 dataset show the promising results with the potential of using the GANs for music source separation.

32 Sungkyun Chang, Juheon Lee, Sankeun Choe and Kyogu Lee. Audio Cover Song Identification using Convolutional Neural Network(BibTeX)

In this paper, we propose a new approach to cover song identification using a CNN (convolutional neural network). Most previous studies extract the feature vectors that characterize the cover song relation from a pair of songs and used it to compute the (dis)similarity between the two songs. Based on the observation that there is a meaningful pattern between cover songs and that this can be learned, we have reformulated the cover song identification problem in a machine learning framework. To do this, we first build the CNN using as an input a cross-similarity matrix generated from a pair of songs. We then construct the data set composed of cover song pairs and non-cover song pairs, which are used as positive and negative training samples, respectively. The trained CNN outputs the probability of being in the cover song relation given a cross-similarity matrix generated from any two pieces of music and identifies the cover song by ranking on the probability. Experimental results show that the proposed algorithm achieves performance better than or comparable to the state-of-the-art.

Invited Talk: Douglas Eck (Google Magenta). Polyphonic piano transcription using deep neural networks (paper)

I’ll discuss the problem of transcribing polyphonic piano music with an emphasis on generalizing to unseen instruments. We optimize for two objectives. We first predict pitch onset events and then conditionally predict pitch at the frame level. I’ll discuss the model architecture, which combines CNNs and LSTMs. I’ll also discuss challenges faced in robust piano transcription, such as obtaining enough data to train a good model I’ll also provide some demos and links to working code. This collaboration was led by Curtis Hawthorne, Erich Elsen and Jialin Song.

Douglas Eck works at the Google Brain team on the Magenta project, an effort to generate music, video, images and text using machine intelligence.
He also worked on music search and recommendation for Google Play Music. I was an Associate Professor in Computer Science at University of Montreal in the BRAMS research center. He also worked on music performance modeling.

Invited Talk: Sander Dieleman (Google DeepMind). Deep learning for music recommendation and generation (slides)

The advent of deep learning has made it possible to extract high-level information from perceptual signals without having to specify manually and explicitly how to obtain it; instead, this can be learned from examples. This creates opportunities for automated content analysis of musical audio signals. In this talk, I will discuss how deep learning techniques can be used for audio-based music recommendation. I will also discuss my ongoing work on music generation in the raw waveform domain with WaveNet.

Sander Dieleman is a Research Scientist at DeepMind in London, UK, where he has worked on the development of AlphaGo and WaveNet. He was previously a PhD student at Ghent University, where he conducted research on feature learning and deep learning techniques for learning hierarchical representations of musical audio signals. During his PhD he also developed the Theano-based deep learning library Lasagne and won solo and team gold medals respectively in Kaggle’s “Galaxy Zoo” competition and the first National Data Science Bowl. In the summer of 2014, he interned at Spotify in New York, where he worked on implementing audio-based music recommendation using deep learning on an industrial scale.

Invited Talk: Matt Prockup, Puya Vahabi (Pandora). Exploring Ad Effectiveness using Acoustic Features (slides)

Online audio advertising is a form of advertising used abundantly in online music streaming services. In these platforms, providing high quality ads ensures a better user experience and results in longer user engagement. In this paper we describe a way to predict ad quality using hand-crafted, interpretable acoustic features that capture timbre, rhythm, and harmonic organization of the audio signal. We then discuss how the characteristics of the sound can be connected to concepts such as the clarity of the ad and its message.

Matthew K. Prockup is currently a scientist at Pandora working on methods and tools for Music Information Retrieval at scale. He received his Ph.D. in Electrical Engineering from Drexel University. His research interests span a wide scope of topics including audio signal processing, machine learning, and human computer interaction. He is also an avid percussionist and composer, having performed in and composed for various ensembles large and small. Puya – Hossein Vahabi is a senior research scientist at Pandora working on Audio/Video Computational Advertising. Before Pandora, he was a research scientist at Yahoo Labs. He has a PhD in CS, and he has been a research associate of the Italian National Research for many years. He has a PhD in CS, and his background is on computational advertising, graph mining and information retrieval.

Invited Talk: Bill Freeman. Sight and Sound 

William T. Freeman is the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science at MIT. His current research interests include motion re-rendering, computational photography, and learning for vision. He received outstanding paper awards at computer vision or machine learning conferences in 1997, 2006, 2009 and 2012, and recently won “test of time” awards for papers written in 1991 and 1995. Previous research topics include steerable filters and pyramids, the generic viewpoint assumption, color constancy, bilinear models for separating style and content, and belief propagation in networks with loops. He holds 30 patents.

15 Ivan Bocharov, Bert de Vries,and Tjalling Tjalkens. K-shot Learning of Acoustic Context (slides,BibTeX)

In order to personalize the behavior of hearing aid devices in different acoustic scenes, we need personalized acoustic scene classifiers. Since we cannot afford to burden an individual hearing aid user with the task to collect a large acoustic database, we will want to train an acoustic scene classifier on one in-situ recorded waveform (of a few seconds duration) per class. In this paper we develop a method that achieves high levels of classification accuracy from a single recording of an acoustic scene.

18 Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous. Towards Learning Semantic Audio Representations from Unlabeled Data (slides,BibTeX)

Our goal is to learn semantically structured audio representations without relying on categorically labeled data. We consider several class-agnostic semantic constraints that are inherent to non-speech audio: (i) sound categories are invariant to additive noise and translations in time, (ii) mixtures of two sound events inherit the categories of the constituents, and (iii) the categories of events in close temporal proximity in a single recording are likely to be the same or related. We apply these constraints to sample training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks. The resulting low-dimensional representations provide both greatly improved query-by-example retrieval performance and reduced labeled data and model complexity requirements for supervised sound classification.

23 Yunpeng Li, Ivan Kiskin, Davide Zilli, Marianne Sinka, Henry Chan, Kathy Willis, Stephen J Roberts. Cost-sensitive detection with variational autoencoders for environmental acoustic sensing (slides,BibTeX)

Environmental acoustic sensing involves the retrieval and processing of audio signals to better understand our surroundings. While large-scale acoustic data make manual analysis infeasible, they provide a suitable playground for machine learning approaches. Most existing machine learning techniques developed for environmental acoustic sensing do not provide flexible control of the trade-off between the false positive rate and the false negative rate. This paper presents a cost-sensitive classification paradigm, in which the hyper-parameters of classifiers and the structure of variational autoencoders are selected in a principled Neyman- Pearson framework. We examine the performance of the proposed approach using a dataset from the HumBug project1 which aims to detect the presence of mosquitoes using sound collected by simple embedded devices.

Discussion Panel: Sepp Hochreiter (Johannes Kepler University Linz), Bo Li (Google), Karen Livescu (Toyota Technological Institute at Chicago), Arindam Mandal (Amazon Alexa), Oriol Nieto (Pandora), Malcolm Slaney (Google), Hendrik Purwins (Aalborg University Copenhagen). Machine learning and audio signal processing: State of the art and future perspectives 

How can end-to-end audio processing be further optimized? How can an audio processing system be built that generalizes across domains, in particular different languages, music styles, or acoustic environments? How can complex musical hierarchical structure be learned? How can we use machine learning to build a music system that is able to react in the same way an improvisation partner would? Can we build a system that could put a composer in the role of a perceptual engineer?

Sepp Hochreiter is the head of the Institute of Bioinformatics at the Johannes Kepler University of Linz. Previously, he was at the Technical University of Berlin, at the University of Colorado at Boulder, and at the Technical University of Munich. Sepp Hochreiter has made numerous contributions in the fields of machine learning and bioinformatics. He developed the long short-term memory (LSTM), widely considered a milestone in the timeline of machine learning. He applied biclustering methods to drug discovery and toxicology. He extended support vector machines to handle kernels that are not positive definite with the “Potential Support Vector Machine” (PSVM) model, and applied this model to feature selection, especially to gene selection for microarray data. Also in biotechnology, he developed “Factor Analysis for Robust Microarray Summarization” (FARMS).

Arindam Mandal is Senior Manager in machine learning at Amazon and has worked on speech-to-text.
He has graduated from the University of Washington.

Bo Li is a research scientist in Google Speech Team. He received his Ph.D from School of Computing in National University of Singapore in 2014. He has been actively working on deep learning based robust speech recognition. He is one of the main contributors for Google’s Voice Search and Home ASR models.

Malcolm Slaney is a research scientist in the AI for machine hearing group at Google. He is also an Adjunct Professor at Stanford CCRMA (pronounced karma), and an affiliate professor in the Electrical Engineering Department at University of Washington. He has worked on auditory perception problems for many other companies, including Apple, IBM, Yahoo! and Microsoft Research. He can’t think of anything else to say for his fourth sentence.

Panel (from left: A. Mandal, K. Livescu, B. Li, S. Hochreiter, M. Slaney, Foto: O. Nieto)


Speech Source Separation


34 Lijiang Guo and Minje Kim. Bitwise Source Separation on Hashed Spectra: An Efficient Posterior Estimation Scheme Using Partial Rank Order Metrics (poster,BibTeX)

This paper proposes an efficient bitwise solution to the single-channel source separation task. Most dictionary-based source separation algorithms rely on iterative update rules during the run time, which becomes computationally costly especially when we employ an overcomplete dictionary and sparse encoding that tend to give better separation results. To avoid such cost we propose a bitwise scheme on hashed spectra that leads to an efficient posterior probability calculation. For each source, the algorithm uses a partial rank order metric to extract robust features that form a binarized dictionary of hashed spectra. Then, for a mixture spectrum, its hash code is compared with each source’s hashed dictionary in one pass. This simple voting-based dictionary search allows a fast and iteration-free estimation of ratio masking at each bin of a signal spectrogram. We verify that the proposed BitWise Source Separation (BWSS) algorithm produces sensible source separation results for the single-channel speech denoising task, with 6-8 dB mean SDR. To our knowledge, this is the first dictionary based algorithm for this task that is completely iteration-free in both training and testing.

33 Minje Kim and Paris Smaragdis. Bitwise Neural Networks for Efficient Single­Channel Source Separation(BibTeX)

We present Bitwise Neural Networks (BNN) as an efficient hardware-friendly solution to single-channel source separation tasks in resource-constrained environments. In the proposed BNN system, we replace all the real-valued operations during the feedforward process of a Deep Neural Network (DNN) with bitwise arithmetic (e.g. the XNOR operation between bipolar binaries in place of multiplications). Thanks to the fully bitwise run-time operations, the BNN system can serve as an alternative solution where efficient real-time processing is critical, for example real-time speech enhancement in embedded systems. Furthermore, we also propose a binarization scheme to convert the input signals into bit strings so that the BNN parameters learn the Boolean mapping between input binarized mixture signals and their target Ideal Binary Masks (IBM). Experiments on the single-channel speech denoising tasks show that the efficient BNN-based source separation system works well with an acceptable performance loss compared to a comprehensive real-valued network, while consuming a minimal amount of resources.

6 Mohit Dubey, Garrett Kenyon, Nils Carlson and Austin Thresher. Does Phase Matter For Monaural Source Separation? (poster,BibTeX)

The “cocktail party” problem of fully separating multiple sources from a single channel audio waveform remains unsolved. Current biological understanding of neural encoding suggests that phase information is preserved and utilized at every stage of the auditory pathway. However, current computational approaches primarily discard phase information in order to mask amplitude spectrograms of sound. In this paper, we seek to address whether preserving phase information in spectral representations of sound provides better results in monaural separation of vocals from a musical track by using a neurally plausible sparse generative model. Our results demonstrate that preserving phase information reduces artifacts in the separated tracks, as quantified by the signal to artifact ratio (GSAR). Furthermore, our proposed method achieves state-of-the-art performance for source separation, as quantified by a mean signal to interference ratio (GSIR) of 19.46.

Speech Enhancement


31: Rasool Fakoor, Xiaodong He, Ivan Tashev and Shuayb Zarar. Reinforcement Learning To Adapt Speech Enhancement to Instantaneous Input Signal Quality(BibTeX)

Today, the optimal performance of existing noise-suppression algorithms, both data-driven and those based on classic statistical methods, is range bound to specific levels of instantaneous input signal-to-noise ratios. In this paper, we present a new approach to improve the adaptivity of such algorithms enabling them to perform robustly across a wide range of input signal and noise types. Our methodology is based on the dynamic control of algorithmic parameters via reinforcement learning. Specifically, we model the noise-suppression module as a black box, requiring no knowledge of the algorithmic mechanics except a simple feedback from the output. We utilize this feedback as the reward signal for a reinforcement-learning agent that learns a policy to adapt the algorithmic parameters for every incoming audio frame (16 ms of data). Our preliminary results show that such a control mechanism can substantially increase the overall performance of the underlying noise-suppression algorithm; 42% and 16% improvements in output SNR and MSE, respectively, when compared to no adaptivity.

35: Jong Hwan Ko, Josh Fromm, Matthai Phillipose, Ivan Tashev and Shuayb Zarar. Precision Scaling of Neural Networks for Efficient Audio Processing(BibTeX)

While deep neural networks have shown powerful performance in many audio applications, their large computation and memory demand has been a challenge for real-time processing. In this paper, we study the impact of scaling the precision of neural networks on the performance of two common audio processing tasks, namely, voice-activity detection and single-channel speech enhancement. We determine the optimal pair of weight/neuron bit precision by exploring its impact on both the performance and processing time. Through experiments conducted with real user data, we demonstrate that deep neural networks that use lower bit precision significantly reduce the processing time (up to 30x). However, their performance impact is low (< 3.14%) only in the case of classification tasks such as those present in voice activity detection.

20: Marius Paraschiv, Lasse Borgholt, Tycho Tax, Marco Singh and Lars Maaløe. Exploiting Nontrivial Connectivity for Automatic Speech Recognition(BibTeX)

Nontrivial connectivity has allowed the training of very deep networks by addressing the problem of vanishing gradients and offering a more efficient method of reusing parameters. In this paper we make a comparison between residual networks, densely-connected networks and highway networks on an image classification task. Next, we show that these methodologies can easily be deployed into automatic speech recognition and provide significant improvements to existing models.

1 Brian Mcmahan and Delip Rao. Listening to the World Improves Speech Command Recognition(BibTeX)

In this paper, we present a study on transfer learning in convolutional network architectures for recognizing environmental sound events and speech commands. Our primary contribution is to show that representations learned for environmental sound classification can be used to significantly improve accuracies for the unrelated, voice-focused task of speech command recognition. Our second contribution is a simple multiscale input representation that uses dilated convolutions to aggregate larger contexts and increase classification performance. Our third and final contribution is a demonstration of an interaction effect between transfer learning and the multiscale input representations. For different versions of the speech command dataset, the pre-trained networks with multiscale inputs can be trained with only 50%-75% of the speech command training data and achieve similar accuracies as a non-pre-trained and non-multiscale networks with 100% of the training data.

7: Andros Tjandra, Sakriani Sakti and Satoshi Nakamura. End­-to-­End Speech Recognition with Local Monotonic Attention (poster,BibTeX)

Most attention mechanism in sequence-to-sequence model is based on a global attention property which requires a computation of a weighted summarization of the whole input sequence generated by encoder states. However, it is computationally expensive and often produces misalignment on the longer input sequence. Furthermore, it does not fit with monotonous or left-to-right nature in speech recognition task. In this paper, we propose a novel attention mechanism that has local and monotonic properties. Various ways to control those properties are also explored. Experimental results demonstrate that encoder-decoder based ASR with local monotonic attention could achieve significant performance improvements and reduce the computational complexity in comparison with the one that used the standard global attention architecture.

17: Sri Harsha Dumpala, Rupayan Chakraborty and Sunil Kumar Kopparapu. A novel approach for effective learning in low resourced scenarios (poster,BibTeX)

Deep learning based discriminative methods, being the state-of-the-art machine learning techniques, are ill-suited for learning from lower amounts of data. In this paper, we propose a novel framework, called simultaneous two sample learning (s2sL), to effectively learn the class discriminative characteristics, even from very low amount of data. In s2sL, more than one sample (here, two samples) are simultaneously considered to both, train and test the classifier. We demonstrate our approach for speech/music discrimination and emotion classification through exper- iments. Further, we also show the effectiveness of s2sL approach for classification in low-resource scenario, and for imbalanced data.

Speech Synthesis


29 Yuxuan Wang, Rj Skerry­Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark and Rif A. Saurous. Uncovering Latent Style Factors for Expressive Speech Synthesis (poster,BibTeX)

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of “style tokens” in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.

13 Younggun Lee, Azam Rabiee and Soo-Young Lee. Emotional End-to-End Neural Speech synthesizer(BibTeX)

In this paper, we introduce an emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron. Despite its benefits, we found that the original Tacotron suffers from the exposure bias problem and irregularity of the attention alignment. Later, we address the problem by utilization of context vector and residual connection at recurrent neural networks (RNNs). Our experiments showed that the model could successfully generate speech for given emotion labels.



14 Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra. End­-to-­end learning for music audio tagging at scale (poster,BibTeX)

The lack of data tends to limit the outcomes of deep learning research – specially, when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study we make use of musical labels annotated for 1.2 million tracks. This large amount of data allows us to unrestrictedly explore different front-end paradigms: from assumption-free models – using waveforms as input with very small convolutional filters; to models that rely on domain knowledge – log-mel spectrograms with a convolutional neural network designed to learn temporal and timbral features. Results suggest that while spectrogram-based models surpass their waveform-based counterparts, the difference in performance shrinks as more data are employed.

27 Jongpil Lee, Taejun Kim, Jiyoung Park and Juhan Nam. Raw Waveform ­based Audio Classification Using Sample­level CNN Architectures (poster(BibTeX)

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics of learned filters.

38 Alfonso Perez-Carrillo, Hendrik Purwins. Estimation of violin bowing features from Audio recordings with Convolutional Networks (poster,BibTeX)

The acquisition of musical gestures and particularly of instrument controls from a musical performance is a field of increasing interest with applications in many research areas. In the last years, the development of novel sensing technologies  has allowed the fine measurement of such controls. However, the acquisition process usually involves the use of expensive sensing systems and complex setups that are generally intrusive in practice. An alternative to direct acquisition is through the analysis of the audio signal. So called indirect acquisition has many  advantages including the simplicity and low-cost of the acquisition and its non-intrusive nature. The main challenge is designing robust detection algorithms to be as accurate as the direct approaches. In this paper, we present an indirect acquisition method to estimate violin bowing controls from audio signal analysis  based on training Convolutional Neural Networks with a database of multimodal  data (bowing controls and sound features) of violin performances.

Environmental Sounds


3 Benjamin Elizalde, Rohan Badlani, Ankit Shah, Anurag Kumar, and Bhiksha Raj. NELS ­ Never­Ending Learner of Sounds (poster,BibTeX)

Sounds are essential to how humans perceive and interact with the world. These sounds are captured in recordings and shared on the Internet on a minute-by- minute basis. These recordings, which are predominantly videos, constitute the largest archive of sounds we’ve ever seen. However, most of these recordings have undescribed content making necessary methods for automatic audio content analysis, indexing and retrieval. These methods have to address multiple challenges, such as the relation between sounds and language, numerous and diverse sound classes, and large-scale evaluation. We propose a system that continuously learns from the web relations between sounds and language, improves sound recognition models over time and evaluates its learning competency in the large-scale without references. We introduce the Never-Ending Learner of Sounds (NELS), a project for continuously learning of sounds and their associated knowledge, available on line in nels.cs.cmu.edu.

30 Tycho Tax, Jose Antich, Hendrik Purwins and Lars Maaløe. Utilizing Domain Knowledge in End-to-End Audio Processing(BibTeX)

End-to-end neural network based approaches to audio modelling are generally outperformed by models trained on high-level data representations. In this paper we present preliminary work that shows the feasibility of training the first layers of a deep convolutional neural network (CNN) model to learn the commonly- used log-scaled mel-spectrogram transformation. Secondly, we demonstrate that upon initializing the first layers of an end-to-end CNN classifier with the learned transformation, convergence and performance on the ESC-50 environmental sound classification dataset are similar to a CNN-based model trained on the highly pre-processed log-scaled mel-spectrogram features.

4 Anurag Kumar and Bhiksha Raj. Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data(BibTeX)

The development of audio event recognition models requires labeled training data, which are generally hard to obtain. One promising source of recordings of audio events is the large amount of multimedia data on the web. In particular, if the audio content analysis must itself be performed on web audio, it is important to train the recognizers themselves from such data. Training from these web data, however, poses several challenges, the most important being the availability of labels : labels, if any, that may be obtained for the data are generally weak, and not of the kind conventionally required for training detectors or classifiers. We propose a robust and efficient deep convolutional neural network (CNN) based framework to learn audio event recognizers from weakly labeled data. The proposed method can train from and analyze recordings of variable length in an efficient manner and outperforms a network trained with strongly labeled web data by a considerable margin.