Powered by a text-to-speech model using Deep Convolutional networks with transfer learning. Based on Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention and dc_tts, an implementation of the methods described by the paper.
Currently, I'm working on cleaning up the source code and rewriting it to a newer version of Tensorflow. I hope to eventually release a consolidated CLI to streamline usage of the TTS model.
I made a video about the project and it features a Hamlet Soliloquy (my voice) and Obama reading the Navy Seal Copypasta. Check it out here.
When I first started this project, my goal, while lofty ( ^ _ ^ ), was to be able to replicate someone's voice given a short audio clip. To be honest, I still have no idea what I meant by “short”—perhaps 5-10 seconds? Yea, it doesn't take a scientist to figure out that that IS NOT easy. But, I still wanted to try something. After trying a lot of different models (Tacotron 1/2, WaveNet, DeepSpeech), I settled upon this rather nifty approach using convolution neural networks—which in and of itself trains a lot faster than its recurrent counterparts. Still, even using this model, I'd need A LOT of data to clone someone's voice to a decent quality. Of course, that's only if you want to train it from scratch; however, I can leverage the REALLY BIG open-source speech datasets and train a generic model to be very good. And then simply use this model as a starting point.
The idea to apply transfer learning didn't actually come until I experimented with a few different approaches to voice style conversion. My first though was to train a model solely responsible for converting this “generic” voice to the target voice (i.e. Obama or Trump). One approach to this, which actually seemed to have 'reasonable' results, was to randomly initialize a CNN with random parameters and train it to output the target Mel spectogram. The result: well….., you could certainly hear elements of the target voice but it wasn't immediately recognizable. I also tried adapting an approach mentioned in Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis; however, again, the results were not too great (though I think this was due to implementation errors made on my part and just insufficient amounts of data).
Given a really well-trained model (this means a model that has been trained on both A LOT of data for A LOT of time), we can significantly reduce the training of 'derivative' voices if we simply 'transfer' the weights from the pretrained model. Further training off this model is simply fine-tuning it to fit the specific voice; rather than the more generic base. You can think of this as training the model to do an impression of the target voice. It's pretty cool, I know!
Surprisingly, it turns out that simply summing the weights of the pretrained model to that of the new model actually performs REALLY well. In hindsight, I suppose this really is just 'instructing' the model to fine-tune the general voice, and to do this—to get a general picture of how the target voice should sound—it doesn't really need a lot of data.
All the models you heard in this were trained on roughly 10-15 minutes of data samples (consisting of clips that range between 2-10 seconds). All the datasets were manually made by splitting audio recordings into small clips and then transcribing them. I experimented with many different voices for this video (e.g. Donkey and Shrek) and I found that the best results come from when extreme tone is minimized. This is why the Donkey results were never very good. As well as the fact the there is constantly music and other noise in Donkey scenes. TL;DR: if you're trying this at home, remember to get CLEAN data (e.g. no noise in the background, trimmed silence, consistent tone).
The Obama model was trained on 130 samples (12 minutes and 55 seconds total time) sourced from “The President's Speech in Cairo: A New Beginning.” Likewise, The Trump model was trained on 136 samples (12 minutes and 11.8 seconds total time) sourced from his Liberty University Address in 2017.
Obviously, the quality of the TTS model is far from perfect. Pronunciation seems to be the largest obstacle; however, this is just a consequence of a not-so-great dataset. Still, I think the results are pretty remarkable considering that the model was just trained on a small set of audio clips.