camroll Personal Camera Roll Visual Question Answering · paper, code, dataset, demo
research project · 2026

Personal Camera Roll
Visual Question Answering

if an AI could see your whole camera roll, what would you ask?
1UW–Madison 2Adobe Research 3Korea University
* equal contribution
Pre-print · 2026

explore

The project lives across a paper, a conversational agent, a benchmark dataset, and three interactive pages — a live question board, an agent demo, and a dataset viewer. Pick where you'd like to start.

abstract

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., "Name of the food I tried yesterday?") to more open-ended ones (e.g., "Recommend some dishes I have never eaten before").

Given the vast nature of the personal camera roll — multiple years, hundreds to thousands of photos — a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs.

We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agent systems.

Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

users
50
real personal albums
images
31,476
long-horizon visual streams
qa pairs
2,500
manually annotated
agent
camroll-agent
hierarchical memory + tools

how it works

[ teaser figure — drop a PNG/GIF of the agent retrieving photos here ]
Figure 1. Given a free-form question, the agent picks among four retrieval tools (caption search, vector search, date filter, event lookup), reads the returned photos, and answers in natural language — with citations back to the source photos.

data & reproducibility

Everything is open and reproducible:

  • 50 users, 31,476 images, 2,500 QA pairs — sourced from YFCC100M under each photo's original Creative Commons license and served by Flickr's CDN; nothing is re-uploaded.
  • Manual annotations — every question/answer pair was hand-written to mimic real-world camera-roll usage, then grounded in the user's actual photos.
  • Agent traces — every benchmark question ships with the full hierarchical-memory + tool-call trace that produced its answer, so results are step-by-step reproducible.
  • Code — camroll-agent loop, the four-tool retrieval set, eval harness, and data-prep pipeline all live in the GitHub repo.

The interactive pages on this site all run client-side — they fetch only the per-user JSON files from this folder; no personal data ever touches a server.

cite

If the camroll dataset or camroll-agent helps your work, we'd love a citation:
@article{nguyen2026camroll,
  title   = {Personal Camera Roll Visual Question Answering},
  author  = {Nguyen, Thao and Li, Yuheng and Singh, Krishna Kumar
             and Kim, Donghyun and Lee, Yong Jae},
  journal = {arXiv preprint arXiv:2606.XXXXX},
  year    = {2026},
  note    = {Thao Nguyen and Yuheng Li contributed equally.},
  url     = {https://thaoshibe.github.io/camroll/}
}