Skeleton-Aware Motion Retargeting Using Masked Pose Modeling

1University of Trento

Abstract

Motion retargeting aims at transferring a given motion from a source character to a target one. The task becomes increasingly challenging as the differences between the body shape and skeletal structure of input and target characters increase.

We present a novel approach for motion retargeting between skeletons whose goal is to transfer the motion from a source skeleton to a target one in a different format. Our approach works when the two skeletons differ in scale, bone length, and number of joints. Surpassing previous approaches, our method can also retarget between skeletons that differ in hierarchy and topology, such as retargeting between animals and humans. We train our method as a transformer using a random masking strategy both in time and space, aiming at reconstructing the joints of the masked input skeleton to obtain a deep representation of the motion. At testing time, our proposal can retarget the input motion to different skeletons, reconstructing the disparities between the source and the target.

Our method outperforms state-of-the-art results on the Mixamo dataset, which features a high variance between skeleton formats. Moreover, we show how our approach can effectively generalize to different domains by transferring between human motion and quadrupeds, and vice-versa.

Topologies

Topologies. Examples of isomorphic skeletons are (a) and (b), which have the same joints but varying bone lengths. Both (a) and (b) are homeomorphic to the skeleton (c), which has a different number of joints. They are defined as homeomorphic because they can all be reduced to a common primal skeleton (d). The skeleton of a quadruped (e) is neither homeomorphic nor isomorphic to the others because the topology is changed and thus (e) cannot be reduced to the common primal skeleton.

Architecture

Architecture.

Starting with an initial animation A(Qk), the process begins with a tokenizer, which converts each joint of the input into a series of tokens Ek. We then proceed by masking (indicated by black squares) a random subset of Ek tokens. The remaining unmasked tokens are concatenated to include every possible topology, forming EMC. To capture the relationships between the joints in EMC, we add a spatio-temporal positional embedding, represented as εN + εW. This combined representation is augmented with a learnable token Sk that represents static information, and this forms the input to the encoding transformer. The transformer maps the masked input to a latent feature representation EC, which contains the embedded motion with all masked joints predicted. Subsequently, a decoder extracts the super-skeleton motion A(QC) from this latent space, using the learned token (Sk during training and St during testing). From this, we derive the reconstructed input motion A'(Sk, Qk) and the retargeted motion A(St, Qt). The auto-encoder is trained to predict the masked joints by minimizing the mean squared error (MSE) loss between the original input A(Sk, Qk) and the reconstructed output A'(Sk, Qk).

Results

Results. Qualitative results for the retargeting between isomorphic skeletons. The outputs are overlaid with the ground truth, represented by the green skeleton. Our results demonstrate greater stability and alignment with the ground truth. In the second row, we present the mesh visualization. Notably, our method achieves superior qualitative results even without employing any mesh collision penalizer.

Real-world results

Real-world results. Our approach can retarget from real-world characters, whose motion can be modeled by a common SMPL mesh, even with challenging and unseen motion such as backflip.

BibTeX

Coming soon