Converting Text To Speech Using Python

Introduction

Hello! 😃

In this tutorial I will show you how to easily convert text to an audio file using Python, Tacotron 2 and WaveGlow. 😎

Requirements

Basic knowledge of Python

Creating The Virtual Environment

First we need to create the virtual environment we will be using for this project, this can be created with the following command:

python3 -m venv env

Then we activate the newly created environment with the following command:

source env/bin/activate

Installing The Requirements

First we need to install the needed requirements for the project so open a file called "requirements.txt" and populate it with the following:

numpy
scipy
torch

The requirements can then be installed via the following command:

pip install -r requirements.txt

Done! Now we can actually get to coding. 😸

Coding The Project

Now we can actually start coding the project, open up a file called "main.py" and add the following imports at the very top of the file:

import torch
from scipy.io.wavfile import write
import argparse

Next we need to create a method that takes an input text and then transforms the text into audio and be saved into a file:

def text_to_audio(input_text):
    tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
    tacotron2 = tacotron2.to('cuda')
    tacotron2.eval()

    waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
    waveglow = waveglow.remove_weightnorm(waveglow)
    waveglow = waveglow.to('cuda')
    waveglow.eval()

    utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
    sequences, lengths = utils.prepare_input_sequence([input_text])

    with torch.no_grad():
        mel, _, _ = tacotron2.infer(sequences, lengths)
        audio = waveglow.infer(mel)
    audio_numpy = audio[0].data.cpu().numpy()
    rate = 22050

    write("audio.wav", rate, audio_numpy)

What the above method does is download and load both the tacotron 2 and waveglow pretrained models, move them to the GPU and then set them to evaluation mode.

The utils variable is what we will be using to preprocess the input text. The tacotron 2 model then generates mel spectrograms (a type of visual representation of the spectrum of frequencies in a sound) from the preprocessed text. The WaveGlow model then generates audio from these mel spectrograms.

Finally we write the result to a file called "audio.wav" with a sample rate of 22050, which is a common sample rate for speech files.

Phew! No to finally finish off the code we next need a main method, which is as follows:

if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-t", "--text", required = True, help = "Text to convert")
    args = vars(ap.parse_args())

    text_to_audio(args["text"])
    print("Finished!")

The main method basically takes a command line argument which is the text we want to convert to audio and then passes the text to text_to_audio method to be transformed into a audio file.

Now that the code is complete you can run the script using the following command:

python3 main.py -t "Hello World"

This should produce an audio file, feel free to have a listen. 😆

Conclusion

In this tutorial I have shown how to simply convert text into an audio file. I hope this tutorial was useful to you.

There are some limitations to the above example, for instance it doesn't work well with long text. You may need to create multiple audio files and then glue them together if you want to convert a long audio string.

As always you find the repo on my Github page: https://github.com/ethand91/text-to-speech-sample

Happy Coding! 😎

Like me work? I post about a variety of topics, if you would like to see more please like and follow me. Also I love coffee.

If you are looking to learn Algorithm Patterns to ace the coding interview I recommend the following course