Differentiable 3D-Gaussian splatting (GS) is emerging as a prominent technique in computer vision and graphics for reconstructing 3D scenes. GS represents a scene as a set of 3D Gaussians with varying opacities and employs a computationally efficient splatting operation along with analytical derivatives to compute the 3D Gaussian parameters given scene images captured from various viewpoints. Unfortunately, capturing surround view (360ยบ viewpoint) images is impossible or impractical in many real-world imaging scenarios, including underwater imaging, rooms inside a building, and autonomous navigation. In these restricted baseline imaging scenarios, the GS algorithm suffers from a well-known "missing cone" problem, which results in poor reconstruction along the depth axis. In this paper, we demonstrate that using transient data (from sonars) allows us to address the missing cone problem by sampling high-frequency data along the depth axis. We extend the Gaussian splatting algorithms for two commonly used sonars and propose fusion algorithms that simultaneously utilize RGB camera data and sonar data. Through simulations, emulations, and hardware experiments across various imaging scenarios, we show that the proposed fusion algorithms lead to significantly better novel view synthesis (5 dB improvement in PSNR) and 3D geometry reconstruction (60% lower Chamfer distance).
In this paper, we extend Gaussian splatting for sonar and build fusion techniques that reconstruct geometry using the complementary information from both the cameras and sonars. Our extension involves the development of splatting operations along the z-axis tailored for these sensor types. The specific contributions of this paper include:
Ray View Transformation and Z-Axis Splatting (a) This illustration shows the camera view, which transforms the Gaussians from the world view to the camera view. (b) The Gaussians are transformed into the ray view through an local affine approximation of the projection transform using the Jacobian (J). (c) The transformed 3D Gaussian is then projected (splat) onto the xy-plane for rendering camera and z-axis for rendering echosounder (for collocated camera and echosounder). The gray Gaussian is occluded by the Gaussian in the front, so the Transmission(T) of that Gaussian is smaller than the others independent of whether we are rendering camera or sonar. Each ray undergoes splatting independently, ensuring that if a Gaussian is rasterized by multiple rays, it will be splatted multiple times.
In small baseline imaging scenarios, camera images fail to capture depth-axis covariances and variance, resulting in missing cone problems in Fourier space. This limits 3D reconstruction fidelity due to missing frequency information. Our approach leverages time-resolved measurements from sonar to capture the z-axis projection of the volume, addressing the missing cone issue. By combining sonar data with camera images, we enhance 3D reconstruction, particularly in scenarios with limited camera baselines.
Sonar measurements provide complementary information. (a) Volumetric scene captured with three pairs of cameras and sonars (echosounder). We assume the sensors are in the far field (i.e., the local affine approximation to the projective transform in Gaussian splatting research is valid). For the center camera-sonar pair, camera measure- ments are obtained by projecting the volumetric data along the vertical axis, and sonar measurements are obtained by projecting the volumetric data along the horizontal axis. (b) If only camera measurements are considered, then using the Fourier-slice theorem, we are capturing only a few slices of the Fourier transform of the volume and missing information on a large cone. (c) Sonar (time-resolved data) captures orthogonal slices in the Fourier space, and hence, 3D reconstruction of the scene is better conditioned if we do the camera-sensor fusion instead of using only camera data.
Simulation and emulation pipeline for both echosounder and FLS fusion techniques (a) Raw depth image captured with Time-of-Flight (ToF) camera. (b) An RGB image captured with a camera. (c) Simulated echosounder intensity was generated using the depth histogram and utilized as ground truth during training. (d) A 3D Gaussian scene. We use xy-splatting to render RGB images and z-splatting to render echosounder depth intensity distribution. (e) Simulated FLS intensity generated by histogramming depth per row. (f) A 3D Gaussian scene, and we splat along xy-direction to render RGB image and along yz-direction to render FLS image. We minimize the sum of RGB loss and corresponding depth loss to train the camera-sonar fusion algorithms.
@article{qu2024z,
title={Z-Splat: Z-Axis Gaussian Splatting for Camera-Sonar Fusion},
author={Qu, Ziyuan and Vengurlekar, Omkar and Qadri, Mohamad and Zhang, Kevin and Kaess, Michael and Metzler, Christopher and Jayasuriya, Suren and Pediredla, Adithya},
journal={arXiv preprint arXiv:2404.04687},
year={2024}
}