Towards zero-shot human-object interaction detection via vision-language integration.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Qi Liu, Yuxiao Wang, Zhenao Wei, Xiaofen Xing, Xiangmin Xu, Weiying Xue

Ngôn ngữ: eng

Ký hiệu phân loại: 261.834 Christian attitudes toward social groups

Thông tin xuất bản: United States : Neural networks : the official journal of the International Neural Network Society , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 708052

Thêm vào giỏ Liên kết toàn văn

Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. Such heavy reliance on closed-set supervised learning limits their generalization capabilities to unseen object categories. Inspired by the remarkable zero-shot capabilities of VLM, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of the visual-language model to improve zero-shot HOI detection. Specifically, we propose a ho-pair encoder to supplement contextual and interaction-specific semantic representation decoder into our model. Additionally, we propose two fusion strategies to facilitate prior knowledge transfer of VLM. One is visual-level fusion, producing more global context interaction features
another is language-level fusion, further enhancing the capability of VLM for HOI detection. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings. The source code is available in https://github.com/xwyscut/K2HOI.

Tạo bộ sưu tập với mã QR