Training & Tuning Text-to-Speech with NVIDIA NeMo and W&B
What to expect?
Text-to-speech (also known as speech synthesis), the concept of generating spoken audio from written text, is a common use case for deep learning and neural networks. Speech Synthesis is commonly solved by leveraging multiple deep learning algorithms and is tuned very often to improve the intonation and inflections of the synthesized speech.
In this hands-on lab, you will learn how to use NVIDIA’s NeMo toolkit to train an end-to-end TTS system and Weights & Biases to keep track of various experiments and performance metrics. We will walk you through setting up the environment, explain different code blocks and tools as we execute the Jupyter Notebook, and then deploy the model to test its performance on specific blocks of text.
NeMo, a Python toolkit, simplifies and abstracts building GPU SOTA real-time speech AI (ASR, TTS) and NLP semantically correct models. Weights and Biases enables storing all experiments in one place, including model weights and datasets, which are handy when comparing experiments.