Joint discriminative representation learning for end-to-end person search

  • Pengcheng Zhang
  • , Xiaohan Yu
  • , Xiao Bai*
  • , Chen Wang
  • , Jin Zheng
  • , Xin Ning
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Person search simultaneously detects and retrieves a query person from uncropped scene images. Existing methods are either two-step or end-to-end. The former employs two standalone models for the two sub-tasks, while the latter conducts person search with a unified model. Despite encouraging progress, most existing end-to-end methods focus on balancing the model between detection and retrieval sub-tasks, while ignoring to enhance the learned representation for retrieval, which leads to inferior accuracy to two-step approaches. To that end, we propose a novel hierarchical framework that jointly optimizes instance-aware and part-aware embedding to enable discriminative representation learning. Specifically, we develop a region-of-interest cosegment (ROICoseg) module that captures part-aware information without requiring extra annotations to enable fine-grained discriminative representation. On top of that, a Contextual Instance Batch Sampling (CIBS) method is introduced to effectively employ contextual information for constructing training batches, thus facilitating effective instance-aware representation learning. We further introduce the first cross-door person search dataset (CDPS) that retrieves a target person in outdoor cameras with an indoor captured image or vice versa. Extensive experiments show that our proposed model achieves competitive performance on CUHK-SYSU and outperforms state-of-the-art end-to-end methods on the more challenging PRW and CDPS.1

Original languageEnglish
Article number110053
JournalPattern Recognition
Volume147
DOIs
StatePublished - Mar 2024

Keywords

  • Batch sampling
  • Part segmentation
  • Person re-identification
  • Person search

Fingerprint

Dive into the research topics of 'Joint discriminative representation learning for end-to-end person search'. Together they form a unique fingerprint.

Cite this