Traditional Text-to-Speech (TTS) systems rely on studioquality speech recorded in controlled settings. Recently, an effort known as “noisy-TTS training” has emerged, aiming to utilize in-the-wild data. However, the lack of dedicated datasets has been a significant limitation. We introduce the TTS In the Wild (TITW) dataset, which is publicly available1 , created through a fully automated pipeline applied to the VoxCeleb1 dataset. It comprises two training sets: TITW-Hard, derived from the transcription, segmentation, and selection of raw VoxCeleb1 data, and TITW-Easy, which incorporates additional enhancement and data selection based on DNSMOS. State-of-theart TTS models achieve over 3.0 UTMOS score with TITWEasy, while TITW-Hard remains difficult showing UTMOS below 2.8. Beyond TTS, TITW’s unique design, leveraging a automatic speaker recognition dataset, strengthens ethical efforts to counteract malicious use of TTS models by supporting tasks such as speech deepfake detection.
The text-to-speech in the wild (TITW) database
INTERSPEECH 2025, 17-21 August 2025, Rotterdam, The Netherlands
Type:
Conference
City:
Rotterdam
Date:
2025-08-17
Department:
Digital Security
Eurecom Ref:
8326
Copyright:
© ISCA. Personal use of this material is permitted. The definitive version of this paper was published in INTERSPEECH 2025, 17-21 August 2025, Rotterdam, The Netherlands and is available at : http://dx.doi.org/10.21437/Interspeech.2025-2536
See also:
PERMALINK : https://www.eurecom.fr/publication/8326