Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results