Journal club: Wu S. et al, 2019

Unsupervised Learning of Probably Symmetric Deformable 3D Objects
from Images in the Wild

SHANGZHE WU | CHRISTIAN RUPPRECHT | ANDREA VEDALDI 

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 1-10.

DOI: https://arxiv.org/abs/1911.11130

This paper was presented at CVPR 2020 where it got the award for best paper. The problem tackled is the 3D reconstruction of objects (mainly faces) from single raw images of the subject through a learning-based approach,  without any kind of external supervision. This paper represents an extremely interesting example of the potential of self-supervised learning which is, in turn, extremely appealing for computer vision applications in the medical framework.

Problem and Related Work

3D reconstruction from 2D images is a well-established problem in computer graphics. To be “well-posed” the problem involves having at least 2 image views and corresponding key-points on each of them. An alternative to this is represented by the “structure-from-motion” approach, where multiple views coupled with camera local motion signal are used for estimating 3D structures. Another family of models is the learning-based one: by learning suitable object priors from the training data, those models are able to solve the problem even in the ill-posed formulation (e.g. from a single view). Traditionally learning-based methods have been trained by means of different kind of supervision signals (3D shape, 2D key-points annotations, 2D object masks etc.). In this work, a deep learning model to estimate 3D shape from a single view of the object is built, and trained end-to-end in a completely self-supervised way.

The method

In order to reconstruct the 3D shape of the object from a single image of it, the image is factored in 4 components (depth image, albedo, viewpoint and illumination), which are then recombined to generate the 3D shape and to render it back on the image plane, thus producing a synthetic image that should match the original one. In this way, the learning loop is closed without any external supervision signal, and the distance between the original image and the synthetic one is used as a loss to be optimized. The 4 components are extracted from the image by 4 independent neural networks. The core element of this approach is the recombination of those for 4 components: depending only on the way they are used in the reconstruction and rendering operations, each of them get their semantic meaning (of “depth”, “albedo”, “viewpoint” and “illumination”). This kind of approach is, of course, extremely prone to degenerate solutions: as long as the loss is minimized, in fact, the network has no reason to learn the expected meaningful factorization. In order to avoid this, which is inevitable considering the ill-posed form of the problem of reconstructing a 3D object from a single image without supervision, an assumption of “reasonable symmetry” of the objects to reconstruct is done. Exploiting this hypothesis,  the network can be effectively trained end-to-end. 

Discussion

This paper proposes a new learning-based method able to reconstruct the 3D shape of an object from single 2D images, together with an end-to-end completely self-supervised training strategy. Although this paper does not belong to the medical field, the idea of using self-supervision in the medical context is extremely appealing. In fact, deep learning models in surgical applications would benefit from training on a huge amount of data, in order to capture their intrinsic variability; on the other side, labelling medical data is extremely difficult and expensive. Developing reliable self-supervised approaches could allow us to efficiently exploit the big amount of unlabelled data that we can nowadays collect, without the expensive burden of annotating them.