WaveGlow
Background
Since WaveNet got proposed in 2016, we are much closer to human-level speech generation. For SOTA End2End text-to-speech(TTS) systems, the whole pipeline commonly got splitted into two parts: text2feature and feature2wav, melspectrum often serve as the feature to connecting two parts.
For feature2wav part(or called Voice Encoder, in short, VoCoder), even WaveNet has been avaiable on Google Cloud as a service for a long time, it is still interesting for further exploration based on these observations:
- It’s high-dimension(or 1D but with millions of sampling points) sampling problem. High quality audio requries extremely high sampling rate(20KHz+) which raise the bar for generative model to work(like toy GAN generating 100X100 images vs GANs that works for generating 4K Images).
- Speed is a big concern for deploying SOTA work online, original WaveNet is pure auto-regressive, takes hours to generate few seconds audio wav. Things are getting better but considerable cost and engineering burden still exist compared to classical parametric and concatentive approaches.
- Interestingly, GAN does not dominate the field of VoCoders, instead, we see the raise of flow-based generative models by referencing to recent works, why it outperforms GAN at least temporally? :)
- Multi-lingual TTS is another story even though you have done well on English datasets.
In this post, we will focus on WaveGlow, introduce the work, provide a TensorRT implementation and looking for ways to optimize it’s training phase, which may take months if you try to train from scratch.
For more background knowledge on flow-based model, pls check Dr Weng’s Post and Prof Lee’s Video Lecture despite the original papers.
Moreover, there has been the trend of trying to replace recurrent neural networks on sequence processing for years(due to hardness for training and parallelization), Google prefer using attention modules and Facebook prefer using convolutional layers for the work, the former one led to Attention Is All You Need which inspires some well-know language models like BERT, the later one led to Conv Seq2Seq and inspired more fully convolutional approach to slove various problems. We are following the second path in this post.
Fast Neural VoCoder for Text-to-Speech
WaveGlow, as proposed by Nvidia ADLR team in late 2018 and published on ICASSP 2019, is a parallelization-friendly, fast neural VoCoder. As it’s name suggests, the model is inspired by WaveNet(the VoCoder) and Glow(the SOTA flow-based generative model) for taking wavenet-like structure as a single flow, stacking more of it to form the whole VoCoder and used 1X1 convolution layers to fuse cross-channel information.
THE INPUT of the model are white noise(high-dimentional gaussian) with same size as output wav, and melspecturm(commonly 80 elements vector for each time slice, 1 seconds usually sliced into 80 pieces, so it’s 80X80 feature matrix for representing 1 seconds audio) for controlling the transformation from white noise to generated voice.
THE OUTPUT of the model is 1-D int16 typed array, as raw audio data of generated human speech.
THE WHOLE PROCESS is like turning the air from our lung(white noise) to real speech(meaningful wav) in our mouth controlled by the tongue.