Language as the Semantic Bridge in Audio, Music, and Multimodal Artificial Intelligence: A Systematic Review (2021-2025)

Ratnasari, Novia; Wibawa, Aji Prasetya

Language as the Semantic Bridge in Audio, Music, and Multimodal Artificial Intelligence: A Systematic Review (2021-2025)

Ratnasari, Novia and Wibawa, Aji Prasetya (2026) Language as the Semantic Bridge in Audio, Music, and Multimodal Artificial Intelligence: A Systematic Review (2021-2025). Buletin Ilmiah Sarjana Teknik Elektro, 8 (2). pp. 344-379.

[thumbnail of 15564-Article Text-75118-1-10-20260330.pdf]

Text
15564-Article Text-75118-1-10-20260330.pdf - Published Version
Download (1MB)

Official URL: https://journal2.uad.ac.id/index.php/biste/article...

Abstract

This study presents a systematic review of research in Audio, Music, and Multimodal Artificial Intelligence published between 2021 and 2025, investigating how language operates as a semantic mediation layer between acoustic signals and high-level meaning. The research addresses the fragmentation of existing surveys by introducing a Domain; Modality; Technique; Task (D-M-T-T) taxonomy that systematically differentiates domain focus, modality configuration, modeling techniques, and task objectives. The research contribution is a structured analytical framework that offers a more granular perspective than architecture-centered surveys of Multimodal Large Language Models. Following the PRISMA 2020 protocol, 2,197 Scopus-indexed publications were screened, yielding 369 eligible studies. Language is defined as a representational layer encompassing natural language and structured symbolic encodings that connect acoustic embeddings to semantic interpretation and generative reasoning. Multimodal systems aligning audio and vision without explicit textual grounding are included and analyzed as non-linguistic alignment architectures within the taxonomy. The findings reveal a shift from recognition-based models toward unified multimodal systems in which language conditions alignment, reasoning, and generative synthesis. For instance, text-conditioned music generation demonstrates how linguistic prompts guide compositional structure and emotional expression. These developments reflect an epistemic transition from signal recognition paradigms to language-mediated generative intelligence. Emerging gaps include limited explainability in generative audio systems and insufficient low-resource cross-modal semantic grounding.

Item Type:	Article
Subjects:	T Technology > TK Electrical engineering. Electronics Nuclear engineering
Depositing User:	Alfian Ma'arif
Date Deposited:	24 Apr 2026 17:47
Last Modified:	24 Apr 2026 17:47
URI:	https://alxiv.org/id/eprint/59

Actions (login required)

: View Item