I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

Abstract

We are living in a world surrounded by diverse and “smart” devices with rich modalities of sensing ability. Conveniently capturing the interactions between us humans and these objects remains far-reaching. In this paper, we present I'm-HOI, a monocular scheme to faithfully capture the 3D motions of both the human and object in a novel setting: using a minimal amount of RGB camera and object-mounted Inertial Measurement Unit (IMU). It combines general motion inference and category-aware refinement. For the former, we introduce a holistic human-object tracking method to fuse the IMU signals and the RGB stream and progressively recover the human motions and subsequently the companion object motions. For the latter, we tailor a category-aware motion diffusion model, which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body, hand, and object motions. Moreover, we contribute a large dataset with ground truth human and object motions, dense RGB inputs, and rich object-mounted IMU measurements. Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid capture setting. Our dataset and code will be released to the community.

Pipeline

I'm-HOI combines general motion inference and category-aware refinement. For the former, we introduce a holistic human-object tracking method to fuse IMU signals with RGB stream, then recover the human and companion object motions progressively. For the latter, we tailor a category-aware motion diffusion model, which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body, hand, and object motions.

IMHD² Dataset

We exhibit sampled highlights of Inertial and Multi-view Highly Dynamic human-object interactions Dataset (IMHD²) on the left side, and 10 well-scanned objects on the right side. In total, our dataset records 295 sequences and captures about 892k frames of data.

Capture Results

I'm-HOI performs consistently better than baselines on multiple datasets, especially on IMHD² which is characterized by fast interaction motions.

Citation

@InProceedings{zhao2024imhoi,
      author    = {Zhao, Chengfeng and Zhang, Juze and Du, Jiashen and Shan, Ziwei and Wang, Junye and Yu, Jingyi and Wang, Jingya and Xu, Lan},
      title     = {I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2024},
      pages     = {729-741}
}

Acknowledgments

We thank Jingyan Zhang and Hongdi Yang for setting up the capture system. We thank Jingyan Zhang, Zining Song, Jierui Xu, Weizhi Wang, Gubin Hu, Yelin Wang, Zhiming Yu, Xuanchen Liang, af and zr for data collection. We thank Xiao Yu, Yuntong Liu and Xiaofan Gu for data checking and annotations.

I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

Taking a monocular RGB video and single inertial measurement unit (IMU) sensor recording, I'm-HOI can capture challenging and dynamic human-object interaction (HOI) motions like skateboarding.