Relatinal Visual Similarity

——— arXiv 2025 ———
Thao Nguyen1     Sicheng Mo3     Krishna Kumar Singh2     Yilin Wang2     Jing Shi2     Nicholas Kolkin2
Eli Shechtman2     Yong Jae Lee1,2, β˜…     Yuheng Li1, β˜…
β˜…: Equal advising
1. University of Wisconsin-Madison     2. Adobe Research     3. UCLA
arxiv Paper Code (Ready!) HuggingFace Model HuggingFace Dataset BibTeX Citation
Poster data viewer Data Viewer image collection Qualitative Gallery image collection Image Retrieval did you know Did You Know??

Thao: "I created 2 videos, but can't decide which one is better 😭 So I will put both here πŸ˜‚"[video-version-1] [video-version-2]

TL;DR: We introduce a new visual similarity notion: relational visual similarity (relsim),
which complements traditional attribute similarity (e.g., LPIPS, CLIP, DINO).


πŸ“œ Abstract

Humans do not just see attribute similarity---we also see relational similarity. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive.

peach-earth

An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. [reveal all] [hide all] (Above figure is modified from original version created by Chad Edward (thank you Chad!); original image link: Plate Tectonic Metaphor Illustrations (CMU))

How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer this question, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized---describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision Language Model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it---revealing a critical gap in visual computing.


βš’οΈ Usage Example

This is the example usage of the relsim. For more details, please check: github/relsim .
This code is tested on Python 3.10: (i) NVIDIA A100 80GB (torch2.5.1+cu124) and (ii) NVIDIA RTX A6000 48GB (torch2.9.1+cu128). Other hardware setup hasn't been tested, but it should still work. Please install pytorch and torchvision according to your machine configuration.
Install
pip install relsim
Quick Run
from relsim.relsim_score import relsim
from PIL import Image

model, preprocess = relsim(
    pretrained=True,
    checkpoint_dir="thaoshibe/relsim-qwenvl25-lora")

img1 = preprocess(Image.open("image_path_1"))
img2 = preprocess(Image.open("image_path_2"))
similarity = model(img1, img2)
print(f"relsim score: {similarity:.3f}")

We present a qualitative gallery of image retrieval results using (i) attribute-based metrics (e.g., LPIPS, CLIP, DINO) and (ii) relational-based metrics (ours: relsim).


πŸ” Dataset Viewer

This is a data viewer for the datasets used in the paper. Please click on the image to view the corresponding dataset. You can also see the dataset on HuggingFace.






Seed Groups (Live View)
500+ {Image Group, Anonymous Caption}
or HuggingFace datasets/seed-groups

Anonymous Captions (Live View)
114k+ {Image, Anonymous Caption}
or HuggingFace datasets/anonymous-captions-114k


❓ Did You Know?

Relational similarity isn't new!
In 1997, in the highly influential paper: Structure Mapping in Analogy and Similarity published in American Psychologist, Prof. Dedre Gentner and Prof. Arthur B. Markman proposed a hand-drawn version of Similarity Space, with two axes: x-axis: attributes shared, and y-axis: relations shared.
Click to reveal

Figure cropped from the original paper.

Believe it or not, we spent months struggling to recreate this theoretical figure (yes we did!), ... and ... after ... a lot of effort ... to ..... pull ..... things .....together .......... We finally made it!
Click to reveal
After almost 30 years, thanks to relsim, the theory can now be brought to life!
Cool, isn’t it?? (Λ΅ β€’Μ€ α΄— - Λ΅ ) ✧

πŸ“š Citation

@misc{nguyen2025relationalvisualsimilarity,
title={Relational Visual Similarity},
author={Thao Nguyen and Sicheng Mo and Krishna Kumar Singh and Yilin Wang and Jing Shi and Nicholas Kolkin and Eli Shechtman and Yong Jae Lee and Yuheng Li},
year={2025},
eprint={2512.07833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.07833},
}

You've reached the end.
This website template is adopted from visii (NeurIPS 2023) and DreamFusion (ICLR 2023), source code can be found here and here. You are more than welcome to use this website's source code for your own project, just add credit back to here. Thank you! (.❛ α΄— ❛.).