WaveGlow


March 21, 2020

Background

Since WaveNet got proposed in 2016, we are much closer to human-level speech generation. For SOTA End2End text-to-speech(TTS) systems, the whole pipeline commonly got splitted into two parts: text2feature and feature2wav, melspectrum often serve as the feature to connecting two parts.

For feature2wav part(or called Voice Encoder, in short, VoCoder), even WaveNet has been avaiable on Google Cloud as a service for a long time, it is still interesting for further exploration based on these observations:

In this post, we will focus on WaveGlow, introduce the work, provide a TensorRT implementation and looking for ways to optimize it’s training phase, which may take months if you try to train from scratch.

For more background knowledge on flow-based model, pls check Dr Weng’s Post and Prof Lee’s Video Lecture despite the original papers.

Moreover, there has been the trend of trying to replace recurrent neural networks on sequence processing for years(due to hardness for training and parallelization), Google prefer using attention modules and Facebook prefer using convolutional layers for the work, the former one led to Attention Is All You Need which inspires some well-know language models like BERT, the later one led to Conv Seq2Seq and inspired more fully convolutional approach to slove various problems. We are following the second path in this post.

Fast Neural VoCoder for Text-to-Speech

WaveGlow, as proposed by Nvidia ADLR team in late 2018 and published on ICASSP 2019, is a parallelization-friendly, fast neural VoCoder. As it’s name suggests, the model is inspired by WaveNet(the VoCoder) and Glow(the SOTA flow-based generative model) for taking wavenet-like structure as a single flow, stacking more of it to form the whole VoCoder and used 1X1 convolution layers to fuse cross-channel information.

THE INPUT of the model are white noise(high-dimentional gaussian) with same size as output wav, and melspecturm(commonly 80 elements vector for each time slice, 1 seconds usually sliced into 80 pieces, so it’s 80X80 feature matrix for representing 1 seconds audio) for controlling the transformation from white noise to generated voice.

THE OUTPUT of the model is 1-D int16 typed array, as raw audio data of generated human speech.

THE WHOLE PROCESS is like turning the air from our lung(white noise) to real speech(meaningful wav) in our mouth controlled by the tongue.

Faster Inference by TensorRT

Pain in Training from Scratch