Visual Instruction Inversion:
Image Editing via Visual Prompting

——— NeurIPS 2023 ———
University of Wisconsin - Madison
Paper Poster Code

TL;DR: A framework for inverting visual prompts into editing instructions for text-to-image diffusion models.


Abstract

Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas.

We present a method for image editing via visual prompting. Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions.

Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.



Prior Work: Text-conditioned image editing 📍 Ours: Visual Prompting Image Editing
Text Prompt Test Image Output Visual Prompt: Before → After Test Image Output
"Make it a drawing"
"Turn it into an aerial photo"
Text-conditioned scheme (Prior work):
Model takes an input image and a text prompt to perform the desired edit.
Visual prompting scheme (Ours):
Given a pair of before-after images of an edit, our goal is to learn an implicit text-based editing instruction, and then apply it to new images.

💡 Use Case: What if the desired edit is difficult to describe in words?
E.g., describing how to transform the Starbucks logo into a "3D version". In this case, a visual prompt is more efficient than a text prompt.

Before:
before

After:
after
🧚 Inspired by this reddit, we tested Visii + InstructPix2Pix with Starbucks and Gandour logos.

Test:
test
<ins>
+ "Wonder Woman"
ours
<ins>
+ "Scarlet Witch"
ours
<ins>
+ "Daenerys Targaryen"
ours
<ins>
+ "Neytiri in Avatar"
ours
<ins>
+ "She-Hulk"
ours
<ins>
+ "Maleficent"
ours

How does it work?

Given an example before-and-after image pair, we optimize the latent text instruction that converts the “before” image to the “after” image using a frozen image editing diffusion model, e.g., InstructPix2Pix.

framework

Hybrid instruction

We only optimize a fixed number of tokens, so we have the flexibility to concatenate additional information to the learned instruction during inference. Users can input extra information to combine or guide the learned instruction according to their preferences.

ins-concat
Instruction Optimization: We optimize instruction embedding ‹ins›. Instruction Concatenation: During test time, we can add extra information into the learned instruction ‹ins› to further guide the edit.

ins-concat-2 We can concatenate extra information into the learned instruction ‹ins› to navigate the edit. This allows us to achieve more fine-grained control over the resulting images.

Citation

@inproceedings{nguyen2023visual,
    title={Visual Instruction Inversion: Image Editing via Visual Prompting},
    author={Nguyen, Thao and Li, Yuheng and Ojha, Utkarsh and Lee, Yong Jae},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=l9BsCh8ikK}
}

This website template is adopted from DreamFusion and Imagic, source code can be found here and here. Photo credit: Bo the Shiba. Thank you! (.❛ ᴗ ❛.).