Diffusion-SDF: Conditional Generative Modeling of
Signed Distance Functions

Probabilistic diffusion models have achieved state-of-the-art results for image synthesis, inpainting, and text-to-image tasks. However, they are still in the early stages of generating complex 3D shapes.

This work proposes Diffusion-SDF, a generative model for shape completion, single-view reconstruction, and reconstruction of real-scanned point clouds. We use neural signed distance functions (SDFs) as our 3D representation to parameterize the geometry of various signals (e.g., point clouds, 2D images) through neural networks. Neural SDFs are implicit functions and diffusing them amounts to learning the reversal of their neural network weights, which we solve using a custom modulation module.

Extensive experiments show that our method is capable of both realistic unconditional generation and conditional generation from partial inputs. This work expands the domain of diffusion models from learning 2D, explicit representations, to 3D, implicit representations.


Diffusion-SDF: Conditional Generative Modeling of
Signed Distance Functions

Gene Chou, Yuval Bahat, Felix Heide
Preprint

Modulating SDFs

We first create a compressed representation of SDFs. Directly diffusing thousands of SDFs, where one SDF represents one object, is difficult because one must first train all SDFs individually and the distribution of thousands of SDFs is challenging to learn. Thus, we map SDFs, represented by MLPs, to 1D latent vectors by jointly training a conditional SDF network and a VAE. Two objectives: the diffusion model needs to learn and sample from the distribution of latent vectors effectively, and generated outputs of the diffusion model are mapped back into an SDF.

Diffusing Modulation Vectors

Next, we use our sampled latent vectors ${z}$ from the previous step as sample space for the proposed diffusion probabilistic model. In every iteration, Gaussian noise is added to the latent vectors at random timesteps, and the model learns to denoise the vectors. The denoised modulation vector $z’$ is passed back to our SDF-VAE model to obtain a final generated SDF.

Furthermore, given some condition $y$, we can train a custom encoder $\Upsilon$ to extract shape features and leverage cross-attention to guide multi-modal generations.

Tuning End-to-End

Our joint SDF-VAE model and diffusion model can be trained end-to-end. As shown by the gray arrow, the output of the VAE can directly be used as input to the diffusion model, whose output can then be fed into the VAE decoder for calculating its SDF loss. In practice, we found that training end-to-end from scratch took longer than training the modules separately since there are many building blocks. After the two modules complete training, however, we fine-tune them end-to-end. We find fine-tuning improves generation diversity and output complexity.


 

Unconditional Generations

 

 

Different Modalities

 

 

Generations Guided by Partial Point Clouds

 

 

Interpolation