TY - GEN
T1 - CASEG
T2 - 31st IEEE International Conference on Image Processing, ICIP 2024
AU - Huang, Suyuan
AU - Zhang, Haoxin
AU - Xu, Yanyu
AU - Gao, Yan
AU - Hu, Yao
AU - Qin, Zengchang
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Video action segmentation aims to identify and localize actions. Existing models have achieved impressive performance with pre-extracted frame-level features, but this may limit zero-shot learning and cross-dataset inference, especially for new actions or scenes. To overcome this problem, we propose a novel end-to-end network designed for robust performance across both familiar and novel action segmentation scenarios. Our approach combines a plug-and-play visual prompt module enhancing CLIP features' temporal understanding, and a learnable text prompt that enriches label semantics and refines the model's focus, significantly boosting performance. Our results demonstrate that CLIP features can assist in action segmentation tasks, and prompts can improve task effectiveness. Furthermore, our findings show that CLIP features contain information that i3d features do not. We evaluate the proposed method on several video datasets, including Georgia Tech Egocentric Activities (GTEA), 50Salads, and Breakfast, and the results show that the proposed model outperforms existing SOTA models.
AB - Video action segmentation aims to identify and localize actions. Existing models have achieved impressive performance with pre-extracted frame-level features, but this may limit zero-shot learning and cross-dataset inference, especially for new actions or scenes. To overcome this problem, we propose a novel end-to-end network designed for robust performance across both familiar and novel action segmentation scenarios. Our approach combines a plug-and-play visual prompt module enhancing CLIP features' temporal understanding, and a learnable text prompt that enriches label semantics and refines the model's focus, significantly boosting performance. Our results demonstrate that CLIP features can assist in action segmentation tasks, and prompts can improve task effectiveness. Furthermore, our findings show that CLIP features contain information that i3d features do not. We evaluate the proposed method on several video datasets, including Georgia Tech Egocentric Activities (GTEA), 50Salads, and Breakfast, and the results show that the proposed model outperforms existing SOTA models.
KW - Action Segmentation
KW - Cross-dataset Inference
KW - Learnable Text Prompt
KW - Video Understanding
KW - Zero-shot Learning
UR - https://www.scopus.com/pages/publications/85216902514
U2 - 10.1109/ICIP51287.2024.10647731
DO - 10.1109/ICIP51287.2024.10647731
M3 - 会议稿件
AN - SCOPUS:85216902514
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 2201
EP - 2207
BT - 2024 IEEE International Conference on Image Processing, ICIP 2024 - Proceedings
PB - IEEE Computer Society
Y2 - 27 October 2024 through 30 October 2024
ER -