Abstract: Current state-of-the-art text-to-speech (TTS) systems predominantly utilize denoising-based acoustic decoders with language models (LLMs) or with non-autoregressive front-ends, known for ...
Abstract: Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech ...