TY - JOUR
T1 - SportSal
T2 - Hypernetwork-Based Saliency Prediction for Sports Videos
AU - Xu, Mai
AU - Wen, Shijie
AU - Jiang, Lai
AU - Qiao, Minglang
AU - Li, Shengxi
AU - Xu, Tao
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2026
Y1 - 2026
N2 - Saliency prediction is crucial for improving sports video processing efficiency, thereby providing an enriched viewing experience for a wide-ranging audience. However, there is a long-term absence of well-established eye-tracking dataset and learning-based approach, particularly tailored for sports videos. In this paper, we establish a large-scale eye-tracking dataset dubbed audio-visual sports (AVS). AVS consists of 1,000 high-quality sports videos with eye fixations from 60 participants. Through data analysis on AVS, we observe that human attention patterns exhibit significant variations based on the specific scene context of the sports. Motivated by our observations, we propose a sports-aware saliency prediction approach, named SportSal, which can adaptively predict saliency maps in a hyper manner. Specifically, a hypernetwork is introduced to learn sports-aware priors. Meanwhile, an audio-visual fusion (AVF) block is developed to effectively fuse features from the visual and audio backbones. Given the learned priors and fused audio-visual features, we propose the hyper deformable convolutional (HDC) block and the hyper upsampling (HU) block for dynamic feature extraction and upsampling, respectively. The two blocks are alternatingly connected to adaptively predict saliency maps. Experimental results show that our approach outperforms 21 state-of-the-art saliency prediction approaches over three sports video eye-tracking datasets. Finally, we demonstrate the application of our SportSal approach in perceptual video compression. The dataset and code will be available at https://github.com/WeNsHiJIe-19950103/SportSal
AB - Saliency prediction is crucial for improving sports video processing efficiency, thereby providing an enriched viewing experience for a wide-ranging audience. However, there is a long-term absence of well-established eye-tracking dataset and learning-based approach, particularly tailored for sports videos. In this paper, we establish a large-scale eye-tracking dataset dubbed audio-visual sports (AVS). AVS consists of 1,000 high-quality sports videos with eye fixations from 60 participants. Through data analysis on AVS, we observe that human attention patterns exhibit significant variations based on the specific scene context of the sports. Motivated by our observations, we propose a sports-aware saliency prediction approach, named SportSal, which can adaptively predict saliency maps in a hyper manner. Specifically, a hypernetwork is introduced to learn sports-aware priors. Meanwhile, an audio-visual fusion (AVF) block is developed to effectively fuse features from the visual and audio backbones. Given the learned priors and fused audio-visual features, we propose the hyper deformable convolutional (HDC) block and the hyper upsampling (HU) block for dynamic feature extraction and upsampling, respectively. The two blocks are alternatingly connected to adaptively predict saliency maps. Experimental results show that our approach outperforms 21 state-of-the-art saliency prediction approaches over three sports video eye-tracking datasets. Finally, we demonstrate the application of our SportSal approach in perceptual video compression. The dataset and code will be available at https://github.com/WeNsHiJIe-19950103/SportSal
KW - Sports videos
KW - hypernetwork
KW - saliency prediction
UR - https://www.scopus.com/pages/publications/105019558550
U2 - 10.1109/TCSVT.2025.3621424
DO - 10.1109/TCSVT.2025.3621424
M3 - 文章
AN - SCOPUS:105019558550
SN - 1051-8215
VL - 36
SP - 2980
EP - 2998
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 3
ER -