MoMa: Skinned Motion Retargeting Using Masked Pose Modeling

1University of Trento, 2CNIT

Abstract

Motion retargeting requires to carefully analyze the differences in both skeletal structure and body shape between source and target characters. Existing skeleton-aware and shape-aware approaches can deal with such differences, but they struggle when the source and target characters exhibit significant dissimilarities in both skeleton (like joint count and bone length) and shape (like geometry and mesh properties).

In this work we introduce MoMa, a novel approach for skinned motion retargeting which is both skeleton and shape-aware. Our skeleton-aware module learns to retarget animations by recovering the differences between source and target using a custom transformer-based auto-encoder coupled with a spatio-temporal masking strategy. The auto-encoder can transfer the motion between input and target skeletons by reconstructing the masked skeletal differences using shared joints as a reference point. Surpassing the limitations of previous approaches, we can also perform retargeting between skeletons with a varying number of leaf joints. Our shape-aware module incorporates a novel face-based optimiser that adapts skeleton positions to limit collisions between body parts. In contrast to conventional vertex-based methods, our face-based optimizer excels in resolving surface collisions within a body shape, resulting in more accurate retargeted motions.

The proposed architecture outperforms the state-of-the-art results on the Mixamo dataset, both quantitatively and qualitatively.

Overview

Overview.

On the left: the skeleton-aware module can retarget motion between 'isomorphic' (yellow), 'homeomorphic' (orange) as well as 'non-homeomorphic' skeletons (green). On the right: after the skinning, the shape-aware module can adapt the skeleton position to avoid mesh collisions, yet preserving the dynamics of the motion.

Animation Representation

Animation Representation.

Each skeleton can be represented as a graph with n ∈ [1,Nk] nodes (joints) where the parent-child relationship is defined by the kinematic chain. Moreover, each skeleton is described by a static representation Sk containing the offsets (bone lengths), and a motion representation A(Qk).

Architecture

Architecture.

Starting from an input animation A(Qk), an encoder embeds each input joint into a set of tokens Ek. Next, we randomly mask (black squares) a subset of Ek and concatenate the remaining missing joints to include all the possible topologies, resulting in EMC. To model the relationships between the embedded joints in EMC, we add a spatio-temporal positional embedding εN + εW. To this representation, we concatenate the learnable token Sk representing the static information to form the input of the encoding transformer. The latter learns to map the masked input to a latent feature representation containing the embedded motion EC, where all the masked joints have been predicted. Finally, the decoder extracts the super-skeleton motion A(QC) from the latent space using the learnt token (Sk at training time and St at test time), from which we can derive the reconstructed input motion A'(Sk,Qk) and the retargeted motion A(St,Qt). We train our auto-encoder to predict the masked joints by enforcing a MSE loss between the input A(Sk,Qk) and the reconstructed A'(Sk,Qk).

Results

Results. Qualitative results for the retargeting between isomorphic skeletons. Our method (last column) successfully minimizes the Skeleton Collision Error (SCE) compared to state-of-the-art results.

BibTeX

@article{martinelli2024moma,
  title={MoMa: Skinned motion retargeting using masked pose modeling},
  author={Martinelli, Giulia and Garau, Nicola and Bisagno, Niccol{\'o} and Conci, Nicola},
  journal={Computer Vision and Image Understanding},
  volume={249},
  pages={104141},
  year={2024},
  publisher={Elsevier}
}