VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

1CentraleSupélec, IETR UMR CNRS 6164, France - 2Inria, Univ. Grenoble Alpes, CNRS, LJK, France - 3Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Spain

European Conference on Computer Vision (ECCV) 2024
Description of the image

We present the results of VQ-HPS on two challenging scenarios with in-the-wild conditions and poor illumination, comparing its performance to that of HMR, CLIFF and FastMETRO-S when trained on little data.

Abstract

Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with few data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE.



Architecture of the model


Architecture

This work introduces a method significantly different from all prior human pose and shape estimation (HPSE) approaches. Instead of predicting the parameters of a human body model or 3D coordinates, we learn to predict a learned discrete latent representation of 3D meshes, transforming the HPSE into a classification problem.

For learning our discrete latent representation of meshes, we build on the vector quantized-variational autoencoder (VQVAE) framework and adapt it to the fully convolutional mesh autoencoder. The encoder of the proposed model, called Mesh-VQ-VAE, provides a low-dimensional discrete latent representation preserving the spatial structure of the mesh.

We then propose a Transformer-based encoder-decoder model, called VQ-HPS, for predicting the human mesh discrete representation of the introduced Mesh-VQ-VAE from image features. To ease the low-dimensional representation learning of the mesh, the predicted mesh is non-oriented and centered on the origin: we call it a canonical mesh. To obtain the final oriented mesh, we then need to predict the rotation , and for better alignment with the image, we also regress the weak-perspective camera.



Mesh-VQ-VAE


The Mesh-VQ-VAE is trained on the AMASS dataset and finetuned on the 3DPW training set. The final reconstruction error is 4.7mm. Some reconstruction samples from the 3DPW dataset can be found below.


reconstruction

Training on scarce data

We train VQ-HPS separately on the 3DPW, COCO, and EMDB training sets to see how it performs when trained on limited data. We compare our performance with HMR, CLIFF, and FastMETRO trained with the same data. Qualitative results can be found below.


Scarce

Training on large-scale datasets

Following the standard practice, we train VQ-HPS on Human3.6M, MPI-INF-3DHP, COCO, and MPII. Qualitative results on the 3DPW and EMDB datasets are available below. For video results, the frames are processed indepently but we apply a linear moving average with a window of size 10 on vertices to smooth the results.


Real


BibTeX

@inproceedings{fiche2024vq,
                title={{VQ-HPS}: Human Pose and Shape Estimation in a Vector-Quantized Latent Space},
                author={Fiche, Gu{\'e}nol{\'e} and Leglaive, Simon and Alameda-Pineda, Xavier and Agudo, Antonio and Moreno-Noguer, Francesc},
                booktitle={European Conference on Computer Vision ({ECCV})},
                year={2024}
              }