What I see is what you see: Joint attention learning for first and third person video co-analysis

  • Huangyue Yu
  • , Minjie Cai
  • , Yunfei Liu
  • , Feng Lu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, more and more videos are captured from the first-person viewpoint by wearable cameras. Such first-person video provides additional information besides the traditional third-person video, and thus has a wide range of applications. However, techniques for analyzing the first-person video can be fundamentally different from those for the third-person video, and it is even more difficult to explore the shared information from both viewpoints. In this paper, we propose a novel method for first- and third-person video co-analysis. At the core of our method is the notion of “joint attention”, indicating the learnable representation that corresponds to the shared attention regions in different viewpoints and thus links the two viewpoints. To this end, we develop a multi-branch deep network with a triplet loss to extract the joint attention from the first- and third-person videos via self-supervised learning. We evaluate our method on the public dataset with cross-viewpoint video matching tasks. Our method outperforms the state-of-the-art both qualitatively and quantitatively. We also demonstrate how the learned joint attention can benefit various applications through a set of additional experiments.

Original languageEnglish
Title of host publicationMM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages1358-1366
Number of pages9
ISBN (Electronic)9781450368896
DOIs
StatePublished - 15 Oct 2019
Event27th ACM International Conference on Multimedia, MM 2019 - Nice, France
Duration: 21 Oct 201925 Oct 2019

Publication series

NameMM 2019 - Proceedings of the 27th ACM International Conference on Multimedia

Conference

Conference27th ACM International Conference on Multimedia, MM 2019
Country/TerritoryFrance
CityNice
Period21/10/1925/10/19

Keywords

  • Co-analysis
  • Cross-view
  • Deep learning
  • First-person video
  • Joint attention
  • Shared representation
  • Third-person video

Fingerprint

Dive into the research topics of 'What I see is what you see: Joint attention learning for first and third person video co-analysis'. Together they form a unique fingerprint.

Cite this