Diffusion-SDF: Conditional Generative Modeling of
Signed Distance Functions
arXiv Preprint
Probabilistic diffusion models have achieved state-of-the-art results for image synthesis, inpainting, and text-to-image tasks. However, they are still in the early stages of generating complex 3D shapes.
This work proposes Diffusion-SDF, a generative model for shape completion, single-view reconstruction, and reconstruction of real-scanned point clouds. We use neural signed distance functions (SDFs) as our 3D representation to parameterize the geometry of various signals (e.g., point clouds, 2D images) through neural networks. Neural SDFs are implicit functions and diffusing them amounts to learning the reversal of their neural network weights, which we solve using a custom modulation module.
Extensive experiments show that our method is capable of both realistic unconditional generation and conditional generation from partial inputs. This work expands the domain of diffusion models from learning 2D, explicit representations, to 3D, implicit representations.
Diffusion-SDF: Conditional Generative Modeling of
Signed Distance Functions
Gene Chou, Yuval Bahat, Felix Heide
Preprint
Modulating SDFs
We first create a compressed representation of SDFs. Directly diffusing thousands of SDFs, where one SDF represents one object, is difficult because one must first train all SDFs individually and the distribution of thousands of SDFs is challenging to learn. Thus, we map SDFs, represented by MLPs, to 1D latent vectors by jointly training a conditional SDF network and a VAE. Two objectives: the diffusion model needs to learn and sample from the distribution of latent vectors effectively, and generated outputs of the diffusion model are mapped back into an SDF.
Diffusing Modulation Vectors
Next, we use our sampled latent vectors ${z}$ from the previous step as sample space for the proposed diffusion probabilistic model. In every iteration, Gaussian noise is added to the latent vectors at random timesteps, and the model learns to denoise the vectors. The denoised modulation vector $z’$ is passed back to our SDF-VAE model to obtain a final generated SDF.
Furthermore, given some condition $y$, we can train a custom encoder $\Upsilon$ to extract shape features and leverage cross-attention to guide multi-modal generations.
Tuning End-to-End
Our joint SDF-VAE model and diffusion model can be trained end-to-end. As shown by the gray arrow, the output of the VAE can directly be used as input to the diffusion model, whose output can then be fed into the VAE decoder for calculating its SDF loss. In practice, we found that training end-to-end from scratch took longer than training the modules separately since there are many building blocks. After the two modules complete training, however, we fine-tune them end-to-end. We find fine-tuning improves generation diversity and output complexity.
Unconditional Generations
Different Modalities
Generations Guided by Partial Point Clouds
Interpolation