MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Zhixuan Liu1, Haokun Zhu1, Rui Chen1, Jonathan Francis1,2, Soonmin Hwang3, Ji Zhang1, Jean Oh1
1CMU   2Bosch Center for AI   3Hanyang University
Under Review

TL;DR We generate Multi-view Consistent images given depths from any trajectory viewpoints.

Abstract

We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments.

MOSAIC: Multi-view Overlapped Scene Alignment
with Implicit Consistency

We introduce a novel zero-shot approach that explicitly considers cross-view dependencies within the same scene in the probabilistic sense.

MOSAIC Pipeline Illustration
Overview of the MOSAIC pipeline.

Design Features:

  • (a) MOSAIC denoising process: during each diffusion denoising step, MOSAIC employs Multi-Channel Test-Time Optimization on the formulated consistency objective.
  • (b) Projection loss: MOSAIC mathematically proves that the consistency objective functions as a cross-view projection loss, and proposes a weighted projection loss for practical implementation.
  • (c) Pixel spaces refinement loss: the non-linear VAE decoder means that latent space consistency doesn't guarantee pixel-space consistency. MOSAIC addresses this through pixel space refinement during final denoising stages.

Citation


@misc{liu2025mosaicgeneratingconsistentprivacypreserving,
  title={MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments}, 
  author={Zhixuan Liu and Haokun Zhu and Rui Chen and Jonathan Francis and Soonmin Hwang and Ji Zhang and Jean Oh},
  year={2025},
  eprint={2503.13816},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.13816}, 
}
          

Acknowledgements

We thank Yanbo Xu, Yifan Pu, Zhipeng Bao, Zongtai Li for their helpful inputs. This work was supported in part by NSF IIS-2112633.