Skip to main navigation Skip to search Skip to main content

PixMus: Video and Text Conditioned Background Music Generation Using Latent Diffusion

  • SUNY Buffalo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The seamless integration of background music with video content is vital for enhancing audience engagement and emotional resonance, yet current music synthesis methods face significant challenges. These include limitations in generating high-quality audio directly in WAV format and aligning music with intricate visual and narrative elements, restricting their effectiveness in applications such as film scoring, video games, and social media. To address these challenges, we propose PixMus, a diffusion-based model for music synthesis conditioned on video content and textual prompts. Using a 3D U-Net architecture, PixMus generates extended sequences of latent representations, enabling the production of high-fidelity and longformat music that aligns with the mood, pace, and themes of the video. Evaluation results demonstrate that PixMus outperforms state-of-the-art models, achieving lower Frechet Audio Distance (FAD) and higher CLAP scores, underscoring its ability to produce contextually precise and musically rich compositions. In addition, we introduce a curated dataset of 53,000 background music samples, producing a valuable resource to advance videoconditioned music synthesis research.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages761-766
Number of pages6
ISBN (Electronic)9798331524005
DOIs
StatePublished - 2025
Event3rd IEEE Conference on Artificial Intelligence, CAI 2025 - Santa Clara, United States
Duration: May 5 2025May 7 2025

Publication series

NameProceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025

Conference

Conference3rd IEEE Conference on Artificial Intelligence, CAI 2025
Country/TerritoryUnited States
CitySanta Clara
Period05/5/2505/7/25

Keywords

  • Background-Music-Synthesis
  • Deep Learning
  • GenerativeAI
  • Latent diffusion
  • Multimedia
  • Video-to-Music

Fingerprint

Dive into the research topics of 'PixMus: Video and Text Conditioned Background Music Generation Using Latent Diffusion'. Together they form a unique fingerprint.

Cite this