+ TLDR:
+ This method evaluates an image by comparing it with multiple views of
+ the same scene through cross-attention, eliminating the need for a
+ pre-aligned ground truth image.
+
+
+ Application: Evaluate rendered images from novel view
+ synthesis (NVS) applications where ground truth references are unavailable.
+
+
+
+
+
+
+
+
Abstract
+
+ We introduce a novel Cross-Reference image quality assessment
+ method that effectively fills the gap in the image assessment
+ landscape, complementing the array of established evaluation schemes --
+ ranging from
+ Full-Reference metrics like SSIM,
+ No-Reference metrics such as NIQE, to
+ General-Reference metrics including FID, and
+ Multi-Modal-Reference metrics, e.g. CLIPScore.
+
+
+
+
+
+
+
+
+
+ Utilising a neural network with the cross-attention mechanism and a unique data collection
+ pipeline from NVS optimisation, our method enables accurate image quality assessment without
+ requiring ground truth references.
+ By comparing a query image against multiple views of the same scene, our method addresses
+ the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where
+ direct reference images are unavailable.
+ Experimental results show that our method is closely correlated to the
+ full-reference metric SSIM, while not requiring ground truth references.
+
+
+
+
+
+
+
+
+
Method
+
+ Our goal is to evaluate the quality of a query image, using a set of reference images
+ that capture the same scene as the query image but from other viewpoints.
+ From the NVS application perspective, the query image is often a rendered image
+ with artefacts, and the reference images consists of the real captured images.
+
+
+
+
+
+
+
Network
+
+ We propose a network that takes a query image and a set of reference images
+ and predict a dense score map for the query image.
+ Our network consists of three components:
+
+
+
an image encoder which extracts feature maps from input images;
+
a cross-reference module that associates a query image with multi-view reference images; and
+
a score regression head that regresses a CrossScore for each pixel of the query image.
+
+
+ In practice, we adapt
+ a pretrained DINOv2-small model as the image encoder,
+ a Transformer Decoder for the cross-reference module, and
+ a shallow MLP for the score regression head.
+
+
+
Self-supervised Training
+
+ We leverage existing NVS systems and abundant multi-view datasets to generate
+ SSIM maps for our training.
+
+
+
+ Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as
+ our data engine.
+ Given a set of images, a NeRF recovers a neural representation of a scene by
+ iteratively reconstructing the given image set with photometric losses.
+
+
+ By rendering images with the camera parameters from the original captured
+ image set at multiple NeRF training checkpoints, we generate a large number of
+ images that contain various types of artefacts at various levels.
+ From which, we compute SSIM maps between
+ rendered images and corresponding real captured images, which serve as
+ our training objectives.
+
+
+
+
+
+
+
+
+
+
Additional Results
+
+
+
+
+
+
+
Ablation: Enable and Disable Reference Images
+
+ Here, we show our method effectively leverage reference views while
+ evaluating a query image.
+ With reference images enabled (ON), the score map predicted
+ by our method contains more details than when reference images
+ are disabled (OFF), where the model tends to assign
+ a high score everywhere.
+
+
+
+
+
+
+
+
+
+
Attention Weights Visualisation
+
+ We further illustrate that our model indeed checking related context
+ in reference images, as evidenced by the visualisation of attention maps below.
+
+
+
+
+
+
+
+
+
+
Acknowledgement
+
+ This research is supported by an ARIA
+ research gift grant from Meta Reality Lab.
+ We gratefully thank
+ Shangzhe Wu,
+ Tengda Han,
+ Zihang Lai
+ for insightful discussions, and
+ Michael Hobley
+ for proofreading.
+
+
+
+
+
BibTeX
+
+ @article{wang2024crossscore,
+ title={CrossScore: Towards Multi-View Image Evaluation and Scoring},
+ author={Zirui Wang and Wenjing Bian and Omkar Parkhi and Yuheng Ren and Victor Adrian Prisacariu},
+ journal={arXiv preprint arXiv:2404:14409},
+ year={2024}
+ }
+