TY - GEN
T1 - PixMus
T2 - 3rd IEEE Conference on Artificial Intelligence, CAI 2025
AU - Tilak Sharma, H. K.
AU - Kaushik, Arjun Ramesh
AU - Ratha, Nalini
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The seamless integration of background music with video content is vital for enhancing audience engagement and emotional resonance, yet current music synthesis methods face significant challenges. These include limitations in generating high-quality audio directly in WAV format and aligning music with intricate visual and narrative elements, restricting their effectiveness in applications such as film scoring, video games, and social media. To address these challenges, we propose PixMus, a diffusion-based model for music synthesis conditioned on video content and textual prompts. Using a 3D U-Net architecture, PixMus generates extended sequences of latent representations, enabling the production of high-fidelity and longformat music that aligns with the mood, pace, and themes of the video. Evaluation results demonstrate that PixMus outperforms state-of-the-art models, achieving lower Frechet Audio Distance (FAD) and higher CLAP scores, underscoring its ability to produce contextually precise and musically rich compositions. In addition, we introduce a curated dataset of 53,000 background music samples, producing a valuable resource to advance videoconditioned music synthesis research.
AB - The seamless integration of background music with video content is vital for enhancing audience engagement and emotional resonance, yet current music synthesis methods face significant challenges. These include limitations in generating high-quality audio directly in WAV format and aligning music with intricate visual and narrative elements, restricting their effectiveness in applications such as film scoring, video games, and social media. To address these challenges, we propose PixMus, a diffusion-based model for music synthesis conditioned on video content and textual prompts. Using a 3D U-Net architecture, PixMus generates extended sequences of latent representations, enabling the production of high-fidelity and longformat music that aligns with the mood, pace, and themes of the video. Evaluation results demonstrate that PixMus outperforms state-of-the-art models, achieving lower Frechet Audio Distance (FAD) and higher CLAP scores, underscoring its ability to produce contextually precise and musically rich compositions. In addition, we introduce a curated dataset of 53,000 background music samples, producing a valuable resource to advance videoconditioned music synthesis research.
KW - Background-Music-Synthesis
KW - Deep Learning
KW - GenerativeAI
KW - Latent diffusion
KW - Multimedia
KW - Video-to-Music
UR - https://www.scopus.com/pages/publications/105011281402
U2 - 10.1109/CAI64502.2025.00136
DO - 10.1109/CAI64502.2025.00136
M3 - Conference contribution
AN - SCOPUS:105011281402
T3 - Proceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025
SP - 761
EP - 766
BT - Proceedings - 2025 IEEE Conference on Artificial Intelligence, CAI 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 5 May 2025 through 7 May 2025
ER -