TY - GEN
T1 - EventFormer
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Su, Qile
AU - Zhu, Shoutai
AU - Zhang, Shuai
AU - Liang, Baoyu
AU - Tong, Chao
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about 35K annotated videos and more than 178K video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
AB - Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about 35K annotated videos and more than 178K video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
KW - computer vision
KW - event graph
KW - hierarchical attention
KW - video event prediction
UR - https://www.scopus.com/pages/publications/105024077368
U2 - 10.1145/3746027.3755556
DO - 10.1145/3746027.3755556
M3 - 会议稿件
AN - SCOPUS:105024077368
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 4698
EP - 4707
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -