Guidance of Study and Exam Prep
TTS is relatively complex course. Similar to ASR, it’s a task that has many sub-tasks, which are often interconnected rather than independent. Also, it’s not a classification task you might have seen many times before, but a one-to-many generation task. As a result, in prepareing this course, I believe you might have encountered the following challenges, and I tried to based on them to adjust the content. I guessed understanding the motivation of slides design might be helpful to understand the content, so I put it as follow. since I’m not very familiar with curricula outside of IMS, I’m considering challenges mainly from the perspective of the IMS students.
Generative Models
The intro to deep learning course at the IMS doesn’t cover much on generative models, but in the advanced deep learning course, there’s a topic on this. Considering that 1) advanced DL is not a prerequisite 2) generative models are wildly used in current TTS systems 3) students choosing the TTS topic might need to read papers that base their models on generative models, I think it’s important to introduce these model types in more details.
Maths
Generative models are probabilistic models; they model uncertainty of prediction rather than deterministic values. So understanding these models makes it hard to completely avoid topics in probability, statistics, linear algebra, and calculus. Among the four models we cover, except for GANs, the other three aim for rigorous estimation of the marginal likelihood. I’m aware that some students might lack of math backgrounds, so initially, I tried to prepare slides without touching on math, but I found that, apart from broadly stating the main ideas, it was diffficult to causally relate various concepts. If I present it in this way, you’ll be forced to memorize concepts, like knowing that the objective of a VAE is ELBO or that NFs use many invertible functions. But why using ELBO models the marginal likelihood or why these functions must be invertible become confusing. Therefore, I decided to include some maths in the end.
If you want to better understand these concepts and build a solid foundation in DL, I recommend reading all the slides and exploring the external materials or links provided. However, I understand that for students who don’t have math training during their undergraduate studies, fully understanding the concepts would require a significant workload. So, if you just want to pass the exam, the slides marked as an aside won’t be asked, and the derivations won’t be examined either. Even for the formulas; if an objective is asked, you only need to intuitively explain what terms the formula has, and the motivation behind designing each term to get points.
Of course, if you can logically write down the full formula during the exam without lengthy explanations, you’ll aslo get points and answer more quickly! Or you can try to write it down entirely based on memorization, but you’ll find that less efficient than understanding the intuition. If you memorize and misplace a subscript, the entire meaning of the formula could change, leading to lost points.
Speech
Speech has its unique characteristics, different from images and text. If you lack sufficient understanding of speech and signal processing, you might feel confused when trying to grasp certain model arch designs, now knowing why they’re designed that way. If you do understand the propertie of speech, some designs will seem natural, and you won’t need to memorize them explicitly. You’ll think, “ah sure, that’s the way to do it, of course” For example, why does MelGAN use window-based methods cuz sliding windows are common in speech processing , or why does HiFiGAN need to handle periodicity speech signal can be decomposed into sinosidual waves with different periods/frequencies, and the freq humans can produce are limited . Without knowing this, you might find them odd at first. I tried to include the speech properties when introducing these concepts, but as mentioned in the introduction, having a foundational knowledge of speech processing is important for this course.
Of course, if you just need enough credits to graduate, and aren’t interested in speech itself, you’ll still be able to pass the exam. Just need more effort to memorize and learn the design. However, if you’re interested in speech technology, learning about phonetics, phonology, and signal processing is crucial the other linguistics is also important for speech understanding tasks, but I assume you already know enough from the Method class or other NLP tasks. So it’s worth to check the materials uploaded by Sarina in the Additional Resources folder.
Below are my notes from preparing the TTS part. They’re much more concise than the slides and mainly highlight the concepts I consider important . You can use these notes to prepare for the exam. All bullet points marked as advanced are optional. Some of these optional ones were eventually removed from the final slides, but the non-advanced parts should be fully covered. If you learn all the concepts including the advanced ones, you’ll have a better grasp of the content and better prepared for the project seminar. Skipping them won’t affect your exam score but you’ll eventually need them to understand the papers during the project seminar . Simply put, you’re not required to understand every single detail of the slides to pass the exam. Enjoy!
Overview:
- different speech synthesis techniques
- the difference between some speech synthesis tasks, e.g., tts, voice cloning, voice conversion
- tts challenges
- classical tts
- end-to-end tts
Classicial TTS Overview
- Text Analysis, Acoustic models, vocoder
- Introduce the input and output, the motivation, the model types
Text Analysis
- normalization
- preprocessing: motivation of preprocessing - NSWs, how to handle NSWs example solution: rule-based, neural models
- tokenization:
- motivation as inputs, ease the other operations and methods rule-based, neural models, statistics-based methods
- summary: prepare inputs for other models, ease phonetic analysis, provide prosodic features
- end of sentence prediction: provide prosodic features
- phonetic analysis (G2P)
- motivation e.g., pronunciation varies for different abbreviations or acronyms
- goal of G2P: turn graphemes to phonemes
- rule-based solution → two issues e.g., large vocabulary, OOV
- WFST → use less storage
- neural models → handle oov
- prosodic analysis
- prosody: what are they, example, fine-grained prosodic features and high-level features
- how to extract prosodic features: signal processing tools, two example application how they’re used in training, how they’re used in dataset construction
Acoustic Models
- input and output
- the learning objective difference between ASR and TTS
Basic DNN synthesis
- input and output representation (vectorized i/o)
- how to prepare the input
- alignment problem
- a naive solution: simple upsampling
- the problem of this solution not every phone takes the same amount of output frames, so we need a duration model
- duration model: how to train this model
- how to prepare the labels forced alignment is needed
- how to prepare the inputs
- what is the task
- advanced solutions: forced alignment involved
- HMM-GMM ASR acoustic model + searching algorithm
- model implicitly learns the alignment: attention-based mechanism or duration prediction which trained together with the TTS would be covered in tacotron and fastspeech later
- a naive solution: simple upsampling
- weakness of DNN model
RNN, Tacotron
- overview of the structure, input/output, wave reconstruction algorithm (griffin-lim)
- what is AutoRegressive (AR)
- important designs
- encoder-attention-decoder architecture
- encoder
- k-gram modeling in CBHG blocks
- contextualized representation using RNN
- attention mechanism to handle the alignment problem
- bottleneck in prenet to make it easy to do attention
- decoder
- multiple spectrogram prediction
- stop-flag
- post-processing net to smoothes the output
- others, tacotron 2
- simplifed building blocks
- advanced vocoder rather than griffin-lim
Transformer, FastSpeech
- motivation, input, output
- what is non-autoregress (NAR)
- FastSpeech: important designs
- length regulator to handle the alignment problem
- duration prediction: how to find ground truth labels monotonic alignment search on the encoder-decoder attention matrix, which is provided by another AR model
- FastSpeech2: important design
- motivation, how to mitigate mode-collapse
- variance adaptor introduce each adpator’s prosodic feature it focuses on, ground truth labels, training objectives
- duration predictor: no need to do stop-flag
- pitch predictor
- energy predictor
- weakness: they are deterministic models, so lack of change
Generative Models
- the goal, a probabilistic view
- motivation
- intro 4 types
GAN
likelihood-free methods, iterative training
- main architecture and probabilistic graph representation
- explain model: minimax game
- objectives of G and D
- intro the spectial training algorithms
- how to condition GAN on text for TTS
- use an example to show the idea: use the condition signals in the input of the generator The condition signals could be the linguistic features for acoustic models, or mel-spectrogram for vocoder. The signals can be used directly or together with the noise vecotr.
- intro why GAN is more common in vocoder but not in acoustic model
- (advanced study materials) the probabilistic view of its objective, important when we need to make comparison to other models
VAE
variation inference, can only approximate actual marginal likelihood
- main architecture, and probabilistic graph representation
- the intuition is related to disenglement
- it introduce z, and tries to model \(p(z \vert x)\)
- \(p(x \vert z)\) is needed for estimation due to bayes rule
- resolve the possible conflicts: I said generative models model \(p(x)\), now it seems that we want to model \(p(z \vert x)\)
- it introduce z, and tries to model \(p(z \vert x)\)
- model comparison
- further emphasize disenglement, and the weakness of this assumption
- likelihood-free and MLE
- model explain
- VAE Encoder and decoder, what they try to model
- objective: ELBO
- problem, intractability
- (advanced) the essense, we have ELBO because it’s variation inference, for variational inference we need Gaussian to do approximation the Gaussian distribution can handle one-to-many problems
- (advanced) resolve the conflicts, we claimed we can use MLE to update VAE, why do we use ELBO, show that maximize ELBO is equivalent to MLE
- explain the two main term of ELBO: reconstruction and regularization terms
- VAEs in TTS: applicaton and how to condition on text
- an example to show the main method: redesign ELBO’s encoder \(q_{\phi}(z \vert x)\) to condition on linguistic context (\(z_c\)) or speaker info (\(z_s\)), then sample from it
NF
exact likelihood representation, MLE, but limit in model selection
- architecture and probabilistic graph representation
- motivation: VAE is approximation, NF can do exact likelihood estimation
- so we need simple transformation
- we need a series of simple transformation to model the actual complex distribution
- we need two directions
- (advanced) explain why we need two direction → bayes rule, bijective models
- model comparision
- VAE: just approximation, because given \(p(x \vert z)\), intractable to solve \(p(z \vert x)\)
- NF solve this problem by using invertible functions
- (advanced) explain why using this specific funciton can solve the problem
- because of these functions, we have normalized density, that’s why it’s called Normalized
- because the capacity of these function is limited, we need a series of them to approx complex distribution, that’s why it’s called Flows
- (advanced) explain the determinant form from a geometric perspective
- NFs in TTS:
- an example Glow-TTS: not using the sampled noise vector from unconditional \(p(z)\), but the condition distribution \(p(z \vert x)\) \(p(z \vert x)\) can be learned in a similar way as VAE, then sample a vector from it, and feed that to multiple invertible functions
- weakness of NFs
- limited model selection range
- can struggle with high-dim data due to determinant calculation
DDPM
approximation + markov assumption for sequence modeling, also use ELBO, but wilder model selection range
- pipeline and the probabilistic graph representation
- intuition of denoising and diffusion
- method explain
- diffusion, how to add noise and the markov assumption
- denoising, how to reduce noise, supposed to model the distribution that’s used to add noise, but in practice, just model the noise also work; also the markov assumption
- model comparison
- NF, how many models they need
- DDPMs in TTS (how to condition on text, example DiffTTS)
- simply put, linguistic signal is added to the hidden states in the model that’s used to learn the noise distribution
Vocoder
- motivation
- if STFT is lossless compression, why can’t we go fro spectrogram to wave
- phase reconstruction problem
WaveNet
CNN-based, autoregressive (AR)
- the conflicts between CNN and AR: CNN takes future content as input, but AR doesn’t allow using future info so we need to use stacked dilated causal conv (SDCV) to make it causal
- main module:
- stacked dilated causal conv (SDCV)
- gated activation units
- model explain:
- input, represent by SDCV
- how to get the output (amplitude per step)
- naive solution, regressive the amplitude → but regression is hard due to the continuous nature
- actual solution, classification task → need quantization
- (advanced) what is \(\mu\)-law a nonlinear transformation that compresses the dynamic range of an audio signal, which allows lower amplitudes to have more detail while higher amplitudes are quantized more coarsely. This compression is especially effective for human speech, where most important information is in the mid- to low-amplitude ranges.
- wavenet is an AR model, so it can produce speech by conditioning on itself, then how to condition on spectrogram (what they are and how to do it)
- global condition
- local condition
- weakness: slow inference speed
GAN-based Vocder
- introduce three important ones, MelGAN, HiFiGAN, Avocode and their important discriminator’s designs.
- MelGAN (what they are)
- MSD: waveform information is hierarchical
- window-based objective: better than making decision on the whole waveform, better at capturing more fine-grained information and high-freq artefacts
- HiFiGAN:
- MPD: focus on periodicity
- Avocodo:
- artefacts in differnt feature-banks, intermediate results
- MelGAN (what they are)
End-to-end TTS
(this part is mainly explored in the seminar session)
- one-stage vs two-stage method’s diffference
- example for one-stage method: NaturalSpeech series
- (the architecture won’t be asked in the exam)
Dataset and Metrics
taken from florian’s original slides
- dataset:
- what is needed
- how to prepare
- metrics:
- objective evaluation: things that can be evaluated automatically, e.g., speaker similarity, phone error rate by ASR
- subjective evaluation: things that can be evaluated by human, e.g., naturalness, prosody, speaker similarity
- AB preference
- MOS