Abstract
The rapid advancements in text-to-speech (TTS) and voice conversion (VC) technologies necessitate evaluating the quality of synthesized speech. In this paper, we propose a novel network, FUSE-MOS, which combines the learned latent representations from raw audio waveforms and their corresponding Log-Mel spectrograms, to estimate the posterior distribution of Mean Opinion Score (MOS). Our method thus learns a broader and more nuanced representation of the speech signal. At inference, it predicts MOS value (point estimate) and also provides a measure of uncertainty of that prediction. By leveraging the combined latent representation, FUSE-MOS achieves significant improvements in performance metrics when compared to other existing approaches on benchmark datasets. We also explore an intelligent form of uncertainty filtering strategy to filter out low-confidence (high-uncertainty) samples. It shows FUSE-MOS's capability to maintain strong performance even with reduced data.
| Original language | English |
|---|---|
| Pages (from-to) | 2350-2354 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 2025 |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: Aug 17 2025 → Aug 21 2025 |
Keywords
- MOS prediction
- mean opinion score
- speech quality assessment
- uncertainty estimation
- whisper
Fingerprint
Dive into the research topics of 'FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver