Yo'LLaVA

📜 Abstract

Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog).
Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask,

What should I buy for my dog's birthday?;
as opposed to a generic inquiry about
What should I buy for a dog's birthday?.
Similarly, when looking at a friend's image, the interest lies in seeing their activities, rather than merely observing generic human actions my friend is holding a cat vs. a man is holding a cat.
In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA).

👤 Personalizing Large Multimodal Models

Given a handful of images of a person or a subject I¹, . . . , Iⁿ (e.g., 5 images of your friend <thao>).
Our goal is to embed this subject into a pre-trained LMM (in our case, LLaVA), so that both the user and model can communicate using an identifier (e.g., <thao>) for that subject, while also retaining the broad pre-trained knowledge.

After being personalized, our method (Yo’LLaVA) can:

(1) recognize the subject in new images during testing
(e.g., Yo’LLaVA can determine whether <thao> is in a photo or not)
(2) support visual question answering about the subject
(e.g., given a new photo, one can ask about <thao>’s location)
(3) support text-only conversations without any test-time reference images about the subject
(e.g., ask questions about intrinsic attributes of <thao> like its color, etc.)

🖼️ Examples of Personalized Conversation

Concept Library:

<T>

<bo>

<mam>

<Y>

<characterC>

<characterE>

🌋 Yo'LLaVA: Your Personalized Language & Vision Assistant 👵🏻

⚙️ Training Pipeline:

We define a personalized soft-prompt for the subject as:

<sks> is <token₁><token₂>...<token_k>

Here, <sks> is a newly added vocabulary token that serves as an identifier for the subject, allowing both the user and the model to reference this subject when asking or answering questions. The tokens <token₁><token₂>...<token_k> are soft tokens that are learned to embed visual details about the subject.

🛠 Training Dataset Creation:
To help the model learn the new visual concept, we generate conversational training data triplets {Image, Question, Answer}:

(1) Learning to Engage in Natural Conversations.
We create more generic conversations for training (e.g., visual Q&A), which focus on the subject’s visual characteristics.

Note: No input image are given during training!
Q: What type of object is <sks>?
A: <sks> is a stuffed animal.

(2) Enhancing Recognition with Hard Negative Mining.
We create a mixture of positive and negative examples helps the model understand the visual attributes of the subject.
- Positive: Provided by user.
  
  Question: Can you see if <sks> is in this photo?
  Anaswer: Yes, <sks> is in this photo.
- Negative: A diverse range of items visually similar but not identical to <sks>. Either sample or retrieve them from LAION-5B.
  
  Q: Can you check if <sks> is in this photo?
  A: I have analyzed the image, and I can confirm that <sks> is not present in the photo.

📥 BibTeX


@misc{nguyen2024yollavapersonalizedlanguagevision,
      title={Yo'LLaVA: Your Personalized Language and Vision Assistant}, 
      author={Thao Nguyen and Haotian Liu and Yuheng Li and Mu Cai and Utkarsh Ojha and Yong Jae Lee},
      year={2024},
      eprint={2406.09400},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.09400}, 
}

💌 Acknowledgement

🤗 This work was supported in part by NSF CAREER IIS2150012, Adobe Data Science award, Microsoft Accelerate Foundation Models Research Program, and Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) and (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training).

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thank you (.❛ ᴗ ❛.). ▶ thaoshibe.github.io's clustrmaps 🌎.

🌋 Yo'LLaVA 👵🏻

Your Personalized Language and Vision Assistant

——— NeurIPS 2024 ———