Towards generalizable and interpretable three-
dimensional tracking with inverse neural rendering
Nature Machine Intelligence
We propose to recast vision problems with RGB inputs as an Inverse Neural Rendering (INR) problem, by optimizing via a differentiable rendering pipeline over the embeddings of pre-trained 3D object representations and retrieve latents that best represent object instances in a given input image.
Feed-forward neural networks, that have allowed for empirical accuracy, efficiency, and task adaptation, also come with fundamental disadvantages, such as reasoning in high-dimensional scene features. This is true particularly when predicting 3D from 2D. Specifically, we solve the task of 3D multi-object tracking by optimizing an image loss over generative embeddings that inherently disentangle shape and appearance.
We investigate not only an alternate take on tracking but also enable examining the generated objects, reasoning about failure situations, and ambiguous cases. We validate the generalization and scaling capabilities by learning a generative prior from synthetic data only and assessing camera-based 3D tracking on two unseen large-scale autonomous robot datasets.without fine-tuning, confirming robust transferability.
Dataset unseen by the method.
Paper
Julian Ost*, Tanushree Banerjee*, Mario Bijelic, Felix Heide
(* indicates equal contribution)
Nature Machine Intelligence
Approach
We cast object tracking as a test-time inverse rendering problem that fits generated multi-object scenes to the observed image frames. We find that this test-time optimization approach generalizes across datasets.
For each 3D detection result from an object detector, we initialize embedding representation $\mathbf{z}$ of a generic differentiable forward generation method, e.g. $\mathbf{z}_S$ for shape and $\mathbf{z}_T$ for texture. This prior trained generative object model is frozen and only the embedding representation together with the pose and size are optimized through inverse rendering to best fit the image observation at timestep $t_{k}$. By this, we model the underlying 3D scene for a frame observation as a composition of all object instances without the background scene. Inverse-rendered embeddings and refined object locations are provided to the matching stage to match predicted states of tracked objects of the past and the new observations. Matched and new tracklets are updated, and unmatched tracklets are ultimately discarded before predicting states in the next step.

Optimization Process
From left to right, we show the observed image, (ii) the rendering predicted by the initial starting point latent embeddings, (iii) the generated rendered objects after the texture code is optimized (iv) the objects after the translation, scale, and rotation are optimized, and (v) the objects after the shape is optimized. The ground truth images are faded to show our rendered objects clearly. Our method is capable of reasoning the predicted texture, pose, and shape over several optimization steps, even if initialized with poses or appearance far from the target ś all corrected through inverse rendering.

Input Frame

Initial Guess

Texture Fitting

Pose Fitting

Shape Fitting
Results on Waymo Open Dataset
Without changing the model or training on the dataset, 3D multi-object tracking through Inverse Neural Rendering generalizes well to multiple autonomous driving datasest including the Waymo Open Driving Dataset. We overlay the captured images with the closest generated object and predicted 3D bounding boxes. The color of the bounding boxes for each object corresponds to the predicted tracklet ID. Our method does not lose any tracks even on a different unseen dataset in diverse scenes, validating that the approach generalizes.
Dataset unseen by the method.
Generalization Results on nuScenes
Additional MOT results via Inverse Neural Rendering on nuScenes underline the generalization capabilities of the method, showing consistent performance across datasets and object-classes (see cars and motorcycle in third scene). We overlay the captured images with the closest generated object and predicted 3D bounding boxes. The color of the bounding boxes for each object corresponds to the predicted tracklet ID. We see that even in such diverse scenarios, our method does not lose any tracks and performs robustly across all scenarios, although the dataset is unseen.
Dataset unseen by the method.
Related Publications
[1] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural Scene Graphs for Dynamic Scenes. CVPR 2021