Dirty Pixels: Towards End-to-End Image Processing and Perception

We propose an end-to-end architecture for joint demosaicking, denoising, deblurring, and classification that makes classification robust in low-light scenarios. The proposed architecture learns a processing pipeline optimized for classification, which enhances fine details relevant for this high-level task – at the expense of more noise as measured by conventional metrics, PSNR and SSIM – and improves state-of-the art accuracy. The proposed architecture has a principled and modular design and generalizes across light levels and cameras.

Real-world imaging systems acquire measurements that are degraded by noise, optical aberrations, and other imperfections that make image processing for human viewing and higher-level perception tasks challenging. Conventional imaging involves processing the RAW sensor measurements in a sequential pipeline of steps, such as demosaicking, denoising, deblurring, tone-mapping and compression. This pipeline is optimized to obtain a visually pleasing image. High-level processing, on the other hand, involves steps such as feature extraction, classification, tracking, and fusion. While this silo-ed design approach allows for efficient development, it also dictates compartmentalized performance metrics, without knowledge of the higher-level task of the camera system. For example, today’s demosaicking and denoising algorithms are designed using perceptual image quality metrics but not with domain-specific tasks such as object detection in mind. We propose an end-to-end differentiable architecture that jointly performs demosaicking, denoising, deblurring, tone-mapping, and classification. The architecture does not require any intermediate losses based on perceived image quality and learns processing pipelines whose outputs differ from those of existing ISPs optimized for perceptual quality, preserving fine detail at the cost of increased noise and artifacts. We show that state-of-the-art ISPs discard information that is essential in corner cases, such as extremely low-light conditions, where conventional imaging and perception stacks fail. We demonstrate on captured and simulated data that our model substantially improves perception in low light and other challenging conditions, which is imperative for real-world applications like autonomous driving, robotics, and surveillance. Finally, we found that the proposed model also achieves state-of-the-art accuracy when optimized for image reconstruction in low-light conditions, validating the architecture itself as a potentially useful drop-in network for reconstruction and analysis tasks beyond the applications demonstrated in this work.


Dirty Pixels: Towards End-to-End Image Processing and Perception

Steven Diamond, Vincent Sitzmann, Frank Julca-Aguilar, Stephen Boyd, Gordon Wetzstein, Felix Heide


Video Summary

End-to-end Differentiable Image Processing Framework

We propose an end-to-end architecture (top) for joint denoising, demosaicking, (deblurring) and classification. It combines a novel low-level Anscombe network block and a high-level task-specific network component in a single stack that takes in RAW CFA sensor data and outputs image labels. The Anscombe network component (zoom-in on the bottom) exploits knowledge of the calibrated image formation model and a learned proximal operator in a recurrent manner. The high-level model takes the output of the Anscombe network unit (either a feature tensor or an image) and feeds it into a standard classification network trunk. This proximal operator in the Anscombe network is a recurrent residual U-Net model with dense skip connections across all operator “iterations”. A partly unrolled network is show at the bottom.

Selected Results

Classification Performance under Low-Light Conditions

Comparing to conventional ISPs (Darktable and Movidius), bilinear demosaicking, our proposed Anscombe Networks not only removes noise, but also selectively amplifies structures of the target class, which benefit the overall classification accuracy of the model.

Low-light RAW images, and their corresponding classification results after processing with conventional ISPs (Darktable and Movidius), bilinear demosaicking, and Anscombe Networks.

Single-Image Low-Light Photography (w/o Classification)

Anscombe Networks also achieve state-of-the-art low-light performance for human viewing. We employ the identical Anscombe network architecture but, instead of concatenating this model with a higher-level domain-specific network, we minimize a loss formulated directly on the output image of the Anscombe network. And our method outperforms the U-Net-based deep ISP [Chen et al. 2018] qualitatively and quantitatively.

Qualitative low-light denoising results for human viewing using the traditional Darktable ISP, the U-Net model proposed by [Chen et al. 2018], and the proposed Anscombe network.

Real-time Ultra Low-light Classification on a Mobile Prototype

We have implemented our joint low/high-level classification architecture on a mobile smartphone prototype along with a remote TensorFlow model server. The mobile prototype performs classification tasks robustly even in extreme low-light scenarios. We achieve an inference throughput of about 60 FPS, without any inference optimization or integer-quantization.

Related Publications

[1] Ethan Tseng, Felix Yu, Yuting Yang, Fahim Mannan, Karl St-Arnaud, Derek Nowrouzezahrai, Jean-François Lalonde, and Felix Heide. Hyperparameter optimization in black-box image processing using differentiable proxies. ACM Transactions on Graphics (SIGGRAPH), 38(4):27, 2019

[2] Ali Mosleh, Avinash Sharma, Emmanuel Onzon, Fahim Mannan, Nicolas Robidoux, and Felix Heide. Hardware-in-the-loop End-to-end Optimization of Camera Image Processing Pipelines. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020

[3] Emmanuel Onzon, Fahim Mannan, and Felix Heide. Neural Auto Exposure for High-Dynamic Range Object Detection. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021

[4] Nicolas Robidoux, Luis Eduardo García Capel, Dong-eun Seo, Avinash Sharma, Federico Ariza, and Felix Heide. End-to-end High Dynamic Range Camera Pipeline Optimization. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021