MEGA: Masked Generative Autoencoder for Human Mesh Recovery

1CentraleSupélec, IETR UMR CNRS 6164, France - 2Inria, Univ. Grenoble Alpes, CNRS, LJK, France - 3Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Spain

ArXiv 2024
Teaser

Abstract

Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. MEGA enables us to propose multiple outputs and to evaluate the uncertainty of the predictions. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.

Architecture of the model


Architecture

MEGA, a multi-output HMR approach based on self-supervised learning and masked generative modeling of tokenized human meshes. MEGA relies on Mesh-VQ-VAE to encode/decode a 3D human mesh to/from a set of discrete tokens.

Our training process unfolds in two steps: (1) Firstly, akin to (vector quantized) masked autoencoders, we pre-train MEGA in a self-supervised manner to reconstruct human mesh tokens from partially visible inputs. This leverages large amounts of motion capture data without the need for paired image data. (2) Subsequently, for HMR from RGB images, we train MEGA to predict randomly masked human mesh tokens conditioned on image feature embeddings. During inference, we begin with a fully masked sequence of tokens and generate a human mesh conditioned on an input image.

We propose two distinct generation modes: (2.a) In deterministic mode, MEGA predicts all tokens in a single forward pass, ensuring speed and accuracy; (2.b) In stochastic mode, the generation process involves iteratively sampling human mesh tokens, enabling MEGA to produce multiple predictions from a single image for uncertainty quantification.


Stochastic mode

In stochastic mode, we iteratively sample human mesh tokens to build predictions, which leads to multiple outputs given a single image. We can then compute the uncertainty of each vertex as its standard deviation over the predictions.


reconstruction


We visualise multiple outputs given single images in stochastic mode.


Scarce

BibTeX

@article{fiche2024mega,
                title={MEGA: Masked Generative Autoencoder for Human Mesh Recovery},
                author={Fiche, Gu{\'e}nol{\'e} and Leglaive, Simon and Alameda-Pineda, Xavier and Moreno-Noguer, Francesc},
                journal={arXiv preprint arXiv:2405.18839},
                year={2024}
              }