I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

CVPR 2024


Chengfeng Zhao1, Juze Zhang1,2,3, Jiashen Du1, Ziwei Shan1, Junye Wang1, Jingyi Yu1, Jingya Wang1, Lan Xu1,*

1ShanghaiTech University    2Shanghai Advanced Research Institute, Chinese Academy of Sciences
3University of Chinese Academy of Sciences
*Corresponding author

Abstract



We are living in a world surrounded by diverse and “smart” devices with rich modalities of sensing ability. Conveniently capturing the interactions between us humans and these objects remains far-reaching. In this paper, we present I'm-HOI, a monocular scheme to faithfully capture the 3D motions of both the human and object in a novel setting: using a minimal amount of RGB camera and object-mounted Inertial Measurement Unit (IMU). It combines general motion inference and category-aware refinement. For the former, we introduce a holistic human-object tracking method to fuse the IMU signals and the RGB stream and progressively recover the human motions and subsequently the companion object motions. For the latter, we tailor a category-aware motion diffusion model, which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body, hand, and object motions. Moreover, we contribute a large dataset with ground truth human and object motions, dense RGB inputs, and rich object-mounted IMU measurements. Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid capture setting. Our dataset and code will be released to the community.


Pipeline



I'm-HOI combines general motion inference and category-aware refinement. For the former, we introduce a holistic human-object tracking method to fuse IMU signals with RGB stream, then recover the human and companion object motions progressively. For the latter, we tailor a category-aware motion diffusion model, which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body, hand, and object motions.


IMHD Dataset



We exhibit sampled highlights of Inertial and Multi-view Highly Dynamic human-object interactions Dataset (IMHD2) on the left side, and 10 well-scanned objects on the right side. In total, our dataset records 295 sequences and captures about 892k frames of data.


Capture Results



I'm-HOI performs consistently better than baselines on multiple datasets, especially on IMHD2 which is characterized by fast interaction motions.


Rhobin Challenge



Welcome to participate in Rhobin Challenge on human-object interactions reconstruction tasks! For details, please refer to the challenge website.


Citation


@inproceedings{zhao2024imhoi,
  title={I'm hoi: Inertia-aware monocular capture of 3d human-object interactions},
  author={Zhao, Chengfeng and Zhang, Juze and Du, Jiashen and Shan, Ziwei and Wang, Junye and Yu, Jingyi and Wang, Jingya and Xu, Lan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={729--741},
  year={2024}
}