AdaSpeech: Adaptive Text to Speech for Custom Voice

Contents

Audio Samples
1.1 Adaptation voice on VCTK, LJSpeech and LibriTTS
Ablation Studies
2.1 Ablation Study on VCTK
2.2 Utterance-level Visualization Analysis
2.3 Finetune CLN and Finetune Other Decoder Parameters
2.4 Varying Adaptation Data on AdaSpeech
Demo Audio for ICLR 2021 Response
For AnonReviewer5
For AnonReviewer5 exp1
For AnonReviewer5 exp2
For AnonReviewer5 exp3
For AnonReviewer2

Audio Samples

Adaptation voice on LibriTTS, VCTK and LJSpeech

VCTK speaker : Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

VCTK speaker : The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

LJSpeech speaker : Especially as no more time is occupied or cost incurred in casting setting or printing beautiful letters.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

LJSpeech speaker : Printing in the only sense with which we are at present concerned differs from most if not from all the arts and crafts represented in the exhibition.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

Libritts speaker : And so, howsoever reluctantly, she had gone.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

Libritts speaker : All that I am doing is to use its logical tenability as a help in the analysis of what occurs when we remember.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

Ablation Studies

Audios of Ablation Study on VCTK

VCTK speaker : There is , according to legend, a boiling pot of gold at one end.

AdaSpeech AdaSpeech w/o CLN AdaSpeech w/o PL-ACM AdaSpeech w/o UL-ACM

VCTK speaker : Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

AdaSpeech AdaSpeech w/o CLN AdaSpeech w/o PL-ACM AdaSpeech w/o UL-ACM

Audios of Utterance-level Visualization Analysis

Pink Point in Brown Circle

You little scamp! Well! why do you not enter?

Blue Point in Brown Circle

The Fairy.

Audios of Finetune CLN and Finetune Other Decoder Parameters

VCTK speaker : Ask her to bring these things with her from the store.

Finetune CLN Finetune Other Decoder Parameters

Audios of Varying Adaptation Data on AdaSpeech

LJSpeech speaker : especially as no more time is occupied or cost incurred in casting setting or printing beautiful letters.

1 Adaptation Sample 2 Adaptation Samples 5 Adaptation Samples 10 Adaptation Samples 20 Adaptation Samples

VCTK speaker : Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

1 Adaptation Sample 2 Adaptation Samples 5 Adaptation Samples 10 Adaptation Samples 20 Adaptation Samples

Demo Audio for ICLR 2021 Response

[speaker embedding with the utterance-level vector extracted from a different speaker] for AnonReviewer5

VCTK speaker : Ask her to bring these things with her from the store.

speaker embedding 306F with reference speech 306F speaker embedding 306F with reference speech 361F speaker embedding 306F with reference speech 345M
reference speech 306F reference speech 361F reference speech 345M

VCTK speaker : She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

speaker embedding 345M with reference speech 345M speaker embedding 345M with reference speech 360M speaker embedding 345M with reference speech 306F
reference speech 345M reference speech 360M reference speech 306F

[exp1] for AnonReviewer5

VCTK speaker : Ask her to bring these things with her from the store.

Adaspeech With Noisy Reference Speech Noisy Reference Speech Adaspeech With Clean Reference Speech Clean Reference Speech

VCTK speaker : We also need a small plastic snake and a big toy frog for the kids.

Adaspeech With Noisy Reference Speech Noisy Reference Speech Adaspeech With Clean Reference Speech Clean Reference Speech

[exp2] for AnonReviewer5

Finetune DataSet

Speech 1 Speech 2

VCTK speaker : Ask her to bring these things with her from the store.

Adaspeech With Clean Reference Speech Clean Reference Speech

VCTK speaker : When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

Adaspeech With Clean Reference Speech Clean Reference Speech

[exp3] for AnonReviewer5

VCTK speaker : Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

With Phoneme-level Without Phoneme-level

Libritts speaker : It is not logically necessary to the existence of a memory belief that the event remembered should have occurred, or even that the past should have existed at all.

With Phoneme-level Without Phoneme-level

[Some speakers don't sound so good] for AnonReviewer2

VCTK speaker : People look, but no one ever finds it.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

VCTK speaker : Please call Stella.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

VCTK speaker : Throughout the centuries people have explained the rainbow in various ways.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech

VCTK speaker : Some have accepted it as a miracle without physical explanation.

GT GT mel + MelGAN Baseline (Spk Emb) Baseline (Decoder) AdaSpeech