Relatinal Visual Similarity

——— arXiv 2025 ———
Thao Nguyen1     Sicheng Mo3     Krishna Kumar Singh2     Yilin Wang2     Jing Shi2     Nicholas Kolkin2
Eli Shechtman2     Yong Jae Lee1,2, β˜…     Yuheng Li1, β˜…
β˜…: Equal advising
1. University of Wisconsin-Madison     2. Adobe Research     3. UCLA
arxiv Paper Code HuggingFace HuggingFace Dataset data viewer Data Viewer BibTeX Citation
Slides (TBD) Poster (TBD) image collection Qualitative Gallery image collection Image Retrieval

TL;DR: We introduce a new visual similarity notion: relational visual similarity, which complements traditional attribute similarity.


πŸ“œ Abstract

Humans do not just see attribute similarity---we also see relational similarity. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive.

peach-earth

An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit.
(Above figure is modified from original version: Plate Tectonic Metaphor Illustrations (CMU))

How can we go beyond the visible content of an image to capture its relational properties? To answer this question, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ.
We then curate 114k image-caption dataset in which the captions are anonymized---describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision Language Model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it---revealing a critical gap in visual computing.


βš’οΈ Usage Example

This is the example usage of the relsim. For more details, please check: github/relsim
Install
pip install relsim
Quick Run
from relsim.relsim_score import relsim
from PIL import Image

model, preprocess = relsim(
    pretrained=True,
    checkpoint_dir="thaoshibe/relsim-qwenvl25-lora"
)

img1 = preprocess(Image.open("image_path_1"))
img2 = preprocess(Image.open("image_path_2"))
similarity = model(img1, img2)
print(f"relsim score: {similarity:.3f}")

We present a qualitative gallery of image retrieval results using (i) attribute-based metrics (e.g., LPIPS, CLIP, DINO) and (ii) relational-based metrics (ours: relsim).


πŸ” Dataset Viewer

This is a data viewer for the datasets used in the paper. Please click on the image to view the corresponding dataset. You can also see the dataset on HuggingFace.






Seed Groups (Live View)
500+ {Image Group, Anonymous Caption}
or HuggingFace datasets/seed-groups

Anonymous Captions (Live View)
114k+ {Image, Anonymous Caption}
or HuggingFace datasets/anonymous-captions-114k


πŸ“š Citation

@inproceedings{relsim,
title={Relational Visual Similarity},
author={Nguyen, Thao and Mo, Sicheng and Singh, Krishna Kumar and Wang, Yilin and Shi, Jing and Kolkin, Nicholas and Shechtman, Eli and Lee, Yong Jae},
booktitle={PUT ARXIV LINK HERE},
year={2025},
}

You've reached the end.
This website template is adopted from visii (NeurIPS 2023) and DreamFusion (ICLR 2023), source code can be found here and here. Thank you! (.❛ α΄— ❛.).