Abstract
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level
intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally
limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden
representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to
train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec – a
two-stage architecture that resolves the bottleneck by using high-dimensional wav2vec 2.0 embeddings as
intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more
robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module.
At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as
wav2vec 2.0 embeddings are already time-aligned. This results in an increased generalization
capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the
proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties
enabling tasks like voice conversion or zero-shot synthesis.
Listen to our model reading this abstract:
A high-level comparison of TTS architectures: a) a traditional two-stage pipeline with mel-spectrogram as an
intermediate speech representation; b) end-to-end TTS that generates waveform directly from input text; c) a proposed
two-stage TTS that leverages latent speech representation from the external, pretrained model. Green blocks represent
learnable neural modules, red represents predetermined features, while blue represents hidden representation. The dashed
outline indicates that Wav2Vec is freezed during the training and its parameters are not updated.
Text-to-speech
Text
Ground truth
Tacotron 2
Fastpitch
VITS
WavThruVec (22 kHz)
That decision is for the British Parliament and people.
And a film maker was born.
However, he is a coach, and he remains a coach at heart.
We have not yet received a letter from the Irish.
The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Neither side would reveal the details of the offer.
I've never seen anything like it.
To the Hebrews it was a token that there would be no more universal floods.
He did not oppose the divorce.
I did not think it was very proper.
Out-of-vocabulary pronunciation (unseen words)
Text
Tacotron 2
Fastpitch
VITS
WavThruVec (22 kHz)
The oldest method is that of horoscope comparison, called synastry.
Many educated Jews were inmates of Theresienstadt.
Though a lot of of the proteins recognized inside the immunoprecipitates be of nuclear source.
He will succeed Sir Ian Wrigglesworth.
This review appears in a themed issue regarding Nanobiotechnology.
There are numerous ethnolinguistic groups in the Kashmir territory.
Reducing conditions are usually maintained by the addition of beta-mercaptoethanol or dithiothreitol.
Compartmented drug release systems were prepared based on nanotubes or nanofibres.
Mrs. Comstock, always antagonistically honest, presents her with an old dress.
Latitudinarianism: Broad church theology of Anglicanism.
Voice conversion
Input
Target voice
VITS
WavThruVec (22 kHz)
WavThruVec
Bonus: Voice conversion with noisy input (end-user device simulation)
Input
Target voice
VITS
WavThruVec (22 kHz)
WavThruVec
Zero-shot text-to-speech
Text
Ground truth
Zero-shot (1 s)
Zero-shot (3 s)
Zero-shot (10 s)
Zero-shot (30 s)
We stopped under the willows by Kempton Park, and lunched.
It is a pretty little spot there: a pleasant grass plateau, running along by the water's edge, and overhung by willows.
The appearance of the island when I came on deck next morning was altogether changed.
The appearance of the island when I came on deck next morning was altogether changed.