Personal Camera Roll
Visual Question Answering
explore
The project lives across a paper, a conversational agent, a benchmark dataset, and three interactive pages — a live question board, an agent demo, and a dataset viewer. Pick where you'd like to start.
abstract
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., "Name of the food I tried yesterday?") to more open-ended ones (e.g., "Recommend some dishes I have never eaten before").
Given the vast nature of the personal camera roll — multiple years, hundreds to thousands of photos — a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs.
We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agent systems.
Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.
data & reproducibility
Everything is open and reproducible:
- 50 users, 31,476 images, 2,500 QA pairs — sourced from YFCC100M under each photo's original Creative Commons license and served by Flickr's CDN; nothing is re-uploaded.
- Manual annotations — every question/answer pair was hand-written to mimic real-world camera-roll usage, then grounded in the user's actual photos.
- Agent traces — every benchmark question ships with the full hierarchical-memory + tool-call trace that produced its answer, so results are step-by-step reproducible.
- Code — camroll-agent loop, the four-tool retrieval set, eval harness, and data-prep pipeline all live in the GitHub repo.
The interactive pages on this site all run client-side — they fetch only the per-user JSON files from this folder; no personal data ever touches a server.
cite
@article{nguyen2026camroll,
title = {Personal Camera Roll Visual Question Answering},
author = {Nguyen, Thao and Li, Yuheng and Singh, Krishna Kumar
and Kim, Donghyun and Lee, Yong Jae},
journal = {arXiv preprint arXiv:2606.XXXXX},
year = {2026},
note = {Thao Nguyen and Yuheng Li contributed equally.},
url = {https://thaoshibe.github.io/camroll/}
}