PVSDNet: Joint Depth Prediction and View Synthesis via Shared Latent Spaces in Real-Time

1Mid Sweden University, Sweden 2Technical University of Berlin, Germany 3HTW Berlin - University of Applied Sciences, Germany

Supplementary Video

PVSDNet: Allows query of both novel views and its corresponding depth from a single image input

Abstract

Real-time novel view synthesis (NVS) and depth estimation are pivotal for immersive applications, particularly in augmented telepresence. While state-of-the-art monocular depth estimation methods could be employed to predict depth maps for novel views, their independent processing of novel views often leads to temporal inconsistencies, such as flickering artifacts in depth maps. To address this, we present a unified multimodal framework that generates both novel view images and their corresponding depth maps, ensuring geometric and visual consistency.



View and Depth Estimation in Wild (without any fine tuning)



Zero-Shot Depth Estimation Results in Wild

ZeroShot Relative Depth Estimation on: KITTI, DIODE, ETH3D, DDAD, and SIntel

Previews here are downscaled for fater loading on web

KITTI Dataset: 1216x352

RGB Input Ours
RGB Input Ours
RGB Input Depth Anything
RGB Input Depth Anything

DIODE Dataset: 1024x768

RGB Input Ours
RGB Input Ours
RGB Input Depth Anything
RGB Input Depth Anything

ETH3D Dataset: 6215x4141

RGB Input Ours
RGB Input Ours
RGB Input Depth Anything
RGB Input Depth Anything

NYU Dataset

RGB Input Ours
RGB Input Ours
RGB Input Depth Anything
RGB Input Depth Anything

SIntel Dataset: 1024x436

RGB Input Ours
RGB Input Ours
RGB Input Depth Anything
RGB Input Depth Anything

BibTeX

will be added later