Abstract: End-to-end (E2E) sequence-to-sequence (S2S) neural text-to-speech (TTS) models and E2E-S2S neural voice conversion (VC) models can achieve high-quality speech synthesis with a single neural ...
Abstract: Current state-of-the-art text-to-speech (TTS) systems predominantly utilize denoising-based acoustic decoders with language models (LLMs) or with non-autoregressive front-ends, known for ...
Results that may be inaccessible to you are currently showing.
Hide inaccessible results