TY - GEN
T1 - Miko
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Lu, Feihong
AU - Wang, Weiqi
AU - Luo, Yangyifei
AU - Zhu, Ziqin
AU - Sun, Qingyun
AU - Xu, Baixuan
AU - Shi, Haochen
AU - Gao, Shiqi
AU - Li, Qian
AU - Song, Yangqiu
AU - Li, Jianxin
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - Social media has become ubiquitous for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicit and commonsense nature of these intentions, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Knowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, our approach uses an MLLM to interpret the image, an LLM to extract key information from the text, and another LLM to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. Moreover, We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation, and further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.
AB - Social media has become ubiquitous for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicit and commonsense nature of these intentions, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Knowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, our approach uses an MLLM to interpret the image, an LLM to extract key information from the text, and another LLM to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. Moreover, We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation, and further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.
KW - intention knowledge distillation
KW - large language model
KW - large vision language model
KW - social media
UR - https://www.scopus.com/pages/publications/85209776177
U2 - 10.1145/3664647.3681339
DO - 10.1145/3664647.3681339
M3 - 会议稿件
AN - SCOPUS:85209776177
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 3303
EP - 3312
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -