Skip to main navigation Skip to search Skip to main content

FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification

  • SUNY Buffalo

Research output: Contribution to journalConference articlepeer-review

Abstract

The rapid advancements in text-to-speech (TTS) and voice conversion (VC) technologies necessitate evaluating the quality of synthesized speech. In this paper, we propose a novel network, FUSE-MOS, which combines the learned latent representations from raw audio waveforms and their corresponding Log-Mel spectrograms, to estimate the posterior distribution of Mean Opinion Score (MOS). Our method thus learns a broader and more nuanced representation of the speech signal. At inference, it predicts MOS value (point estimate) and also provides a measure of uncertainty of that prediction. By leveraging the combined latent representation, FUSE-MOS achieves significant improvements in performance metrics when compared to other existing approaches on benchmark datasets. We also explore an intelligent form of uncertainty filtering strategy to filter out low-confidence (high-uncertainty) samples. It shows FUSE-MOS's capability to maintain strong performance even with reduced data.

Original languageEnglish
Pages (from-to)2350-2354
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: Aug 17 2025Aug 21 2025

Keywords

  • MOS prediction
  • mean opinion score
  • speech quality assessment
  • uncertainty estimation
  • whisper

Fingerprint

Dive into the research topics of 'FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification'. Together they form a unique fingerprint.

Cite this