Skip to main navigation Skip to search Skip to main content

HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models

  • Mingzhen Huang
  • , Fu Jen Chu
  • , Bugra Tekin
  • , Kevin J. Liang
  • , Haoyu Ma
  • , Weiyao Wang
  • , Xingyu Chen
  • , Pierre Gleize
  • , Hongfei Xue
  • , Siwei Lyu
  • , Kris Kitani
  • , Matt Feiszli
  • , Hao Tang
  • Meta
  • SUNY Buffalo

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (e.g. text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

Original languageEnglish
Pages (from-to)7136-7146
Number of pages11
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
Duration: Jun 11 2025Jun 15 2025

Fingerprint

Dive into the research topics of 'HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models'. Together they form a unique fingerprint.

Cite this