Guidance of Study and Exam Prep

TTS is relatively complex course. Similar to ASR, it’s a task that has many sub-tasks, which are often interconnected rather than independent. Also, it’s not a classification task you might have seen many times before, but a one-to-many generation task. As a result, in prepareing this course, I believe you might have encountered the following challenges, and I tried to based on them to adjust the content. I guessed understanding the motivation of slides design might be helpful to understand the content, so I put it as follow. since I’m not very familiar with curricula outside of IMS, I’m considering challenges mainly from the perspective of the IMS students.

Generative Models

The intro to deep learning course at the IMS doesn’t cover much on generative models, but in the advanced deep learning course, there’s a topic on this. Considering that 1) advanced DL is not a prerequisite 2) generative models are wildly used in current TTS systems 3) students choosing the TTS topic might need to read papers that base their models on generative models, I think it’s important to introduce these model types in more details.

Maths

Generative models are probabilistic models; they model uncertainty of prediction rather than deterministic values. So understanding these models makes it hard to completely avoid topics in probability, statistics, linear algebra, and calculus. Among the four models we cover, except for GANs, the other three aim for rigorous estimation of the marginal likelihood. I’m aware that some students might lack of math backgrounds, so initially, I tried to prepare slides without touching on math, but I found that, apart from broadly stating the main ideas, it was diffficult to causally relate various concepts. If I present it in this way, you’ll be forced to memorize concepts, like knowing that the objective of a VAE is ELBO or that NFs use many invertible functions. But why using ELBO models the marginal likelihood or why these functions must be invertible become confusing. Therefore, I decided to include some maths in the end.

If you want to better understand these concepts and build a solid foundation in DL, I recommend reading all the slides and exploring the external materials or links provided. However, I understand that for students who don’t have math training during their undergraduate studies, fully understanding the concepts would require a significant workload. So, if you just want to pass the exam, the slides marked as an aside won’t be asked, and the derivations won’t be examined either. Even for the formulas; if an objective is asked, you only need to intuitively explain what terms the formula has, and the motivation behind designing each term to get points.

Of course, if you can logically write down the full formula during the exam without lengthy explanations, you’ll aslo get points and answer more quickly! Or you can try to write it down entirely based on memorization, but you’ll find that less efficient than understanding the intuition. If you memorize and misplace a subscript, the entire meaning of the formula could change, leading to lost points.

Speech

Speech has its unique characteristics, different from images and text. If you lack sufficient understanding of speech and signal processing, you might feel confused when trying to grasp certain model arch designs, now knowing why they’re designed that way. If you do understand the propertie of speech, some designs will seem natural, and you won’t need to memorize them explicitly. You’ll think, “ah sure, that’s the way to do it, of course” For example, why does MelGAN use window-based methods cuz sliding windows are common in speech processing , or why does HiFiGAN need to handle periodicity speech signal can be decomposed into sinosidual waves with different periods/frequencies, and the freq humans can produce are limited . Without knowing this, you might find them odd at first. I tried to include the speech properties when introducing these concepts, but as mentioned in the introduction, having a foundational knowledge of speech processing is important for this course.

Of course, if you just need enough credits to graduate, and aren’t interested in speech itself, you’ll still be able to pass the exam. Just need more effort to memorize and learn the design. However, if you’re interested in speech technology, learning about phonetics, phonology, and signal processing is crucial the other linguistics is also important for speech understanding tasks, but I assume you already know enough from the Method class or other NLP tasks. So it’s worth to check the materials uploaded by Sarina in the Additional Resources folder.

Below are my notes from preparing the TTS part. They’re much more concise than the slides and mainly highlight the concepts I consider important . You can use these notes to prepare for the exam. All bullet points marked as advanced are optional. Some of these optional ones were eventually removed from the final slides, but the non-advanced parts should be fully covered. If you learn all the concepts including the advanced ones, you’ll have a better grasp of the content and better prepared for the project seminar. Skipping them won’t affect your exam score but you’ll eventually need them to understand the papers during the project seminar . Simply put, you’re not required to understand every single detail of the slides to pass the exam. Enjoy!

Overview:

Classicial TTS Overview

Text Analysis

Acoustic Models

Basic DNN synthesis

RNN, Tacotron

Transformer, FastSpeech

Generative Models

GAN

likelihood-free methods, iterative training

VAE

variation inference, can only approximate actual marginal likelihood

NF

exact likelihood representation, MLE, but limit in model selection

DDPM

approximation + markov assumption for sequence modeling, also use ELBO, but wilder model selection range

Vocoder

WaveNet

CNN-based, autoregressive (AR)

GAN-based Vocder

End-to-end TTS

(this part is mainly explored in the seminar session)

Dataset and Metrics

taken from florian’s original slides