Audio samples from "WavThruVec: Latent speech representation as intermediate features for neural speech synthesis"

✌ WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Authors: Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

Paper: arXiv

Abstract Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec – a two-stage architecture that resolves the bottleneck by using high-dimensional wav2vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as wav2vec 2.0 embeddings are already time-aligned. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.

Listen to our model reading this abstract:


alt text

A high-level comparison of TTS architectures: a) a traditional two-stage pipeline with mel-spectrogram as an intermediate speech representation; b) end-to-end TTS that generates waveform directly from input text; c) a proposed two-stage TTS that leverages latent speech representation from the external, pretrained model. Green blocks represent learnable neural modules, red represents predetermined features, while blue represents hidden representation. The dashed outline indicates that Wav2Vec is freezed during the training and its parameters are not updated.

Text-to-speech

Text Ground truth Tacotron 2 Fastpitch VITS WavThruVec (22 kHz)
That decision is for the British Parliament and people.
And a film maker was born.
However, he is a coach, and he remains a coach at heart.
We have not yet received a letter from the Irish.
The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Neither side would reveal the details of the offer.
I've never seen anything like it.
To the Hebrews it was a token that there would be no more universal floods.
He did not oppose the divorce.
I did not think it was very proper.

Out-of-vocabulary pronunciation (unseen words)

Text Tacotron 2 Fastpitch VITS WavThruVec (22 kHz)
The oldest method is that of horoscope comparison, called synastry.
Many educated Jews were inmates of Theresienstadt.
Though a lot of of the proteins recognized inside the immunoprecipitates be of nuclear source.
He will succeed Sir Ian Wrigglesworth.
This review appears in a themed issue regarding Nanobiotechnology.
There are numerous ethnolinguistic groups in the Kashmir territory.
Reducing conditions are usually maintained by the addition of beta-mercaptoethanol or dithiothreitol.
Compartmented drug release systems were prepared based on nanotubes or nanofibres.
Mrs. Comstock, always antagonistically honest, presents her with an old dress.
Latitudinarianism: Broad church theology of Anglicanism.

Voice conversion

Input Target voice VITS WavThruVec (22 kHz) WavThruVec

Bonus: Voice conversion with noisy input (end-user device simulation)

Input Target voice VITS WavThruVec (22 kHz) WavThruVec

Zero-shot text-to-speech

Text Ground truth Zero-shot (1 s) Zero-shot (3 s) Zero-shot (10 s) Zero-shot (30 s)
We stopped under the willows by Kempton Park, and lunched.
It is a pretty little spot there: a pleasant grass plateau, running along by the water's edge, and overhung by willows.
The appearance of the island when I came on deck next morning was altogether changed.
The appearance of the island when I came on deck next morning was altogether changed.