+ +
+

CrossScore: Towards Multi-View Image Evaluation and Scoring

+
+ +
+

+ Zirui Wang1    + Wenjing Bian1    + Omkar Parkhi2    + Yuheng Ren2    + Victor Adrian Prisacariu1 +

+

+ 1University of Oxford     2Meta Reality Lab +

+
+ +
+ Arxiv + Code (Comming Soon) +
+ +

+ TLDR: + This method evaluates an image by comparing it with multiple views of + the same scene through cross-attention, eliminating the need for a + pre-aligned ground truth image. +

+

+ Application: Evaluate rendered images from novel view + synthesis (NVS) applications where ground truth references are unavailable. +

+ +
+
+ +
+ We introduce an image assessment method that examines query images by + referencing multiple views of the same scene, + producing results termed CrossScore maps. + + Our results show that CrossScore is closely correlated with SSIM + across diverse datasets, without requiring pre-aligned + ground truth images. + + Colour coding: + red represents the highest score, + followed by + orange, + green, and + blue, + indicating decreasing scores respectively. +
+
+
+ + +

Abstract

+

+ We introduce a novel Cross-Reference image quality assessment + method that effectively fills the gap in the image assessment + landscape, complementing the array of established evaluation schemes -- + ranging from + Full-Reference metrics like SSIM, + No-Reference metrics such as NIQE, to + General-Reference metrics including FID, and + Multi-Modal-Reference metrics, e.g. CLIPScore. +

+ + +
+
+ +
+ We propose a novel + cross-reference (CR) + image quality assessment (IQA) scheme, which evaluates a query image + using multiple unregistered reference images that are captured from + different viewpoints. + This approach sets a new research trajectory apart from conventional + IQA schemes such as + full-reference (FR), + general-reference (GR), + no-reference (NR), and + multi-modal-reference (MMR). +
+
+
+ + +

+ Utilising a neural network with the cross-attention mechanism and a unique data collection + pipeline from NVS optimisation, our method enables accurate image quality assessment without + requiring ground truth references. + By comparing a query image against multiple views of the same scene, our method addresses + the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where + direct reference images are unavailable. + Experimental results show that our method is closely correlated to the + full-reference metric SSIM, while not requiring ground truth references. +

+ + + + + + + +

Method

+

+ Our goal is to evaluate the quality of a query image, using a set of reference images + that capture the same scene as the query image but from other viewpoints. + From the NVS application perspective, the query image is often a rendered image + with artefacts, and the reference images consists of the real captured images. +

+ +
+
+ + +
+ Method Overview. + Left: Our NVS-based data engine that supplies query and reference images + along with SSIM maps to drive the self-supervised training of our model. + Right: Our model that takes a query image and a set of reference images + as input and predicts a score map for the query image. +
+
+
+ +

Network

+

+ We propose a network that takes a query image and a set of reference images + and predict a dense score map for the query image. + Our network consists of three components: +

+
    +
  1. an image encoder which extracts feature maps from input images;
  2. +
  3. a cross-reference module that associates a query image with multi-view reference images; and
  4. +
  5. a score regression head that regresses a CrossScore for each pixel of the query image.
  6. +
+

+ In practice, we adapt + a pretrained DINOv2-small model as the image encoder, + a Transformer Decoder for the cross-reference module, and + a shallow MLP for the score regression head. +

+ +

Self-supervised Training

+

+ We leverage existing NVS systems and abundant multi-view datasets to generate + SSIM maps for our training. +

+ +

+ Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as + our data engine. + Given a set of images, a NeRF recovers a neural representation of a scene by + iteratively reconstructing the given image set with photometric losses. +

+

+ By rendering images with the camera parameters from the original captured + image set at multiple NeRF training checkpoints, we generate a large number of + images that contain various types of artefacts at various levels. + From which, we compute SSIM maps between + rendered images and corresponding real captured images, which serve as + our training objectives. +

+ + + + + + + + +

Additional Results

+
+ +
+ Evaluating images rendered from a popular NVS method (Gaussian-Splatting) + using CrossScore and SSIM. + CrossScore is highly correlated with SSIM, while not requiring + ground truth images. +
+ +
+ + + + + +

Ablation: Enable and Disable Reference Images

+

+ Here, we show our method effectively leverage reference views while + evaluating a query image. + With reference images enabled (ON), the score map predicted + by our method contains more details than when reference images + are disabled (OFF), where the model tends to assign + a high score everywhere. +

+
+
+ +
+ Ablation study on the importance of reference images. +
+
+
+ + + + + +

Attention Weights Visualisation

+

+ We further illustrate that our model indeed checking related context + in reference images, as evidenced by the visualisation of attention maps below. +

+
+
+ +
+ Attention weights visualisation of our model. + Top left: a query image with a region of interest (centre of image) + highlighted with a magenta box. + + Right column: three reference images from our cross-reference + set with attention maps overlaid. The attention maps illustrate the attention + that is paid to predicting image quality at the query region. + + Red and + blue denote high and low + attention weights respectively. + Note that we use 5 reference images in our experiment, + but only 3 are shown due to space constraint. + + Bottom: Predicted CrossScore map and SSIM map. + + Red and + blue denote high and low + quality image regions respectively. +
+
+
+ + + + + +

Acknowledgement

+

+ This research is supported by an ARIA + research gift grant from Meta Reality Lab. + We gratefully thank + Shangzhe Wu, + Tengda Han, + Zihang Lai + for insightful discussions, and + Michael Hobley + for proofreading. +

+ + + +

BibTeX

+
+  @article{wang2024crossscore,
+    title={CrossScore: Towards Multi-View Image Evaluation and Scoring},
+    author={Zirui Wang and Wenjing Bian and Omkar Parkhi and Yuheng Ren and Victor Adrian Prisacariu},
+    journal={arXiv preprint arXiv:2404:14409},
+    year={2024}
+  }
+  
+ +