The text-to-speech in the wild (TITW) database

Jung, Jee-weon; Zhang, Wangyou; Maiti, Soumi; Wu, Yihan; Wang, Xin; Kim, Ji-Hoon; Matsunaga, Yuta; Um, Seyun; Tian, Jinchuan; Shim, Hye-jin; Evans, Nicholas; Chung, Joon Son; Takamichi, Shinnosuke; Watanabe, Shinji

INTERSPEECH 2025, 17-21 August 2025, Rotterdam, The Netherlands

Traditional Text-to-Speech (TTS) systems rely on studioquality speech recorded in controlled settings. Recently, an effort known as “noisy-TTS training” has emerged, aiming to utilize in-the-wild data. However, the lack of dedicated datasets has been a significant limitation. We introduce the TTS In the Wild (TITW) dataset, which is publicly available1 , created through a fully automated pipeline applied to the VoxCeleb1 dataset. It comprises two training sets: TITW-Hard, derived from the transcription, segmentation, and selection of raw VoxCeleb1 data, and TITW-Easy, which incorporates additional enhancement and data selection based on DNSMOS. State-of-theart TTS models achieve over 3.0 UTMOS score with TITWEasy, while TITW-Hard remains difficult showing UTMOS below 2.8. Beyond TTS, TITW’s unique design, leveraging a automatic speaker recognition dataset, strengthens ethical efforts to counteract malicious use of TTS models by supporting tasks such as speech deepfake detection.

Detail

Document

ARXIV

DOI

BIBTEX

Type:

Conference

City:

Rotterdam

Date:

2025-08-17

Department:

Digital Security

Eurecom Ref:

8326

© ISCA. Personal use of this material is permitted. The definitive version of this paper was published in INTERSPEECH 2025, 17-21 August 2025, Rotterdam, The Netherlands and is available at : http://dx.doi.org/10.21437/Interspeech.2025-2536