camroll Personal AI Agent for Camera Roll VQA · paper, code, dataset, demo
research project · 2026

Personal AI Agent
for Camera Roll VQA

if an AI could see your whole camera roll, what would you ask?
1UW–Madison 2Korea University 3Adobe Research
equal advising
Pre-print · 2026

explore

The project lives across a paper, a conversational agent, a benchmark dataset, and two interactive pages — a live question board and an agent demo. Pick where you'd like to start.

abstract

We study the personal AI agent for camera roll VQA setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., "Name of the food I tried yesterday?") to more open-ended ones (e.g., "Recommend some dishes I have never eaten before").

Given the vast nature of the personal camera roll — multiple years, hundreds to thousands of photos — a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs.

We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agent systems.

Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

users
50
real personal albums
images
31,476
long-horizon visual streams
qa pairs
2,500
manually annotated
agent
camroll-agent
hierarchical memory + tools

how it works

camroll-agent is an AI agent that does VQA on a personal camera roll.

  • index your camera roll into a hierarchical queryable memory (events << captions << images).
  • the agent answers questions over that memory using 5 atomic tools: search, grep, list_by_date, get, and view_image.
Hierarchical memory diagram
Hierarchical memory for personal camera rolls, organized from low-level visual pixels (I) to higher semantic abstractions (captions C, events E). Agent interactions are designed accordingly, ranging from expensive tool (view, get) to cheaper one (search, grep, list).

dataset

  • Dataset: 50 users, 31,476 images, 2,500 QA pairs — sourced from YFCC100M under each photo's original Creative Commons license and served by Flickr's CDN; nothing is re-uploaded.
  • Manual annotations — every question/answer pair was hand-written to mimic real-world camera-roll usage, then grounded in the user's actual photos.

For dataset access, please contact yuhli@adobe.com and krishsin@adobe.com.

cite

If the camroll dataset or camroll-agent helps your work, we'd love a citation:
@misc{camroll,
      title={Personal AI Agent for Camera Roll VQA}, 
      author={Thao Nguyen and Krishna Kumar Singh and Donghyun Kim and Yong Jae Lee and Yuheng Li},
      year={2026},
      eprint={2606.05275},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05275}, 
}