Inverse Neural Rendering
for Explainable Multi-Object Tracking

We propose to recast 3D multi-object tracking (MOT) from RGB cameras as an Inverse Neural Rendering (INR) problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image.

Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze.
To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure cases, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method on automotive datasets that are completely unseen to our method and do not require fine-tuning.

Dataset unseen by the method.

Paper

Julian Ost*, Tanushree Banerjee*, Mario Bijelic, Felix Heide

arXiv Preprint

Approach

We cast object tracking as a test-time inverse rendering problem that fits generated multi-object scenes to the observed image frames.

For each 3D detection result from an object detector, we initialize embedding codes of an object generator $\mathbf{z}_S$ for shape and $\mathbf{z}_T$ for texture. This prior trained generative object model is frozen and only the embedding representation of both modalities together with the pose and size are optimized through inverse rendering to best fit the image observation at timestep $t_{k}$. By this, we model the underlying 3D scene for a frame observation as a composition of all object instances without the background scene. Inverse-rendered embeddings and refined object locations are provided to the matching stage to match predicted states of tracked objects of the past and the new observations. Shape, texture, and pose of matched tracklets are updated, and unmatched detections or tracklets discarded before predicting states in the next step $t_{k+1}$.

For each 3D detection bounding box $\left[ \psi, \mathbf{t}, s \right]$, we initialize the embedding codes of an object generator $\mathbf{z}_S$ for shape and $\mathbf{z}_T$ for texture. This prior trained model is frozen and only the embedding representation of both modalities together with the pose and size are optimized through inverse rendering to best fit the image observation. Inverse-rendered texture and shape embeddings and refined object locations are provided to the matching stage to match predicted states of tracked objects of the past and the new observations. Matched and new tracklets are updated, and unmatched tracklets are ultimately discarded before predicting states in the next step.

Results on Waymo Open Dataset

Without changing the model or training on the dataset, 3D multi-object tracking through Inverse Neural Rendering generalizes well to multiple autonomous driving datasest including the Waymo Open Driving Dataset. We overlay the captured images with the closest generated object and predicted 3D bounding boxes. The color of the bounding boxes for each object corresponds to the predicted tracklet ID. Our method does not lose any tracks even on a different unseen dataset in diverse scenes, validating that the approach generalizes.

Dataset unseen by the method.

Results on nuScenes

Additional MOT results via Inverse Neural Rendering on nuScenes underline the generalization capabilities of the method. We overlay the captured images with the closest generated object and predicted 3D bounding boxes. The color of the bounding boxes for each object corresponds to the predicted tracklet ID. We see that even in such diverse scenarios, our method does not lose any tracks and performs robustly across all scenarios, although the dataset is unseen.

Dataset unseen by the method.

Optimization Process

From left to right, we show the observed image, (ii) the rendering predicted by the initial starting point latent embeddings, (iii) the generated rendered objects after the texture code is optimized (iv) the objects after the translation, scale, and rotation are optimized, and (v) the objects after the shape is optimized. The ground truth images are faded to show our rendered objects clearly. Our method is capable of reasoning the predicted texture, pose, and shape over several optimization steps, even if initialized with poses or appearance far from the target ś all corrected through inverse rendering.

Input Frame

Initial Guess

Texture Fitting

Pose Fitting

Shape Fitting

Related Publications

[1] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural Scene Graphs for Dynamic Scenes. CVPR 2021