What do we learn from inverting CLIP models?

Changes 18/NOV/2024

Added invert-ga-overengineered.py
Usage: Same as other, see below or args @ code
Better text embeddings -> Better model inversion:

~ Who needs a diffusion model? Img2Img with the 'text encoder'! 😉 ~

Input Image: ResNet neuron. As interpreted by CLIP ViT.

Changes 30/AUG/2024

Added Gradient Ascent (GA): Uses an input image instead of a text prompt.
Optimizes text embeddings for cosine similarity with image embeddings
Prints CLIP's 'opinion' about image to console
Uses text embeddings for inversion image generation
⚠️ Same as without GA, innocent images (prompts) can lead to nefarious and NSFW inversions.
Refer to the paper by the original authors for details (see below).
✅ Usage example (only use this code for --use_image; use invert.py for a text --prompt):

python invert-ga.py --num_iters 3400 --use_image "in/catshoe.jpg" --img_size 64 --tv 0.0005 --batch_size 13 --bri 0.4 --con 0.4 --sat 0.4 --save_every 10 --print_every 10 --model_name ViT-L/14

✅ Added support for ViT-L/14@336 (to all code), usage example:

python invert.py --num_iters 3400 --prompt "an ai robot" --img_size 64 --tv 0.005 --batch_size 13 --bri 0.4 --con 0.4 --sat 0.4 --save_every 10 --print_every 10 --model_name ViT-L/14@336px

GA + Inversion examples (generated with my improved ViT-L/14 fine-tune):

Original CLIP Gradient Ascent Script: Used with permission by Twitter / X: @advadnoun

Original README.MD by the authors:

What do we learn from inverting CLIP models?

Warning: This paper contains sexually explicit images and language, offensive visuals and terminology, discussions on pornography, gender bias, and other potentially unsettling, distressing, and/or offensive content for certain readers.

Paper

Installing requirements:

pip install requirements.txt

How to run:

python invert.py \ --num_iters 3400 \  # Number of iterations during the inversion process. --prompt "The map of the African continent" \  # The text prompt to invert. --img_size 64 \  # Size of the image at iteration 0. --tv 0.005 \  # Total Variation weight. --batch_size 13 \  # How many augmentations to use at each iteration. --bri 0.4 \  # ColorJitter Augmentation brightness degree. --con 0.4 \  # ColorJitter Augmentation contrast degree. --sat 0.4 \  # ColorJitter Augmentation saturation degree. --save_every 100 \  # Frequency at which to save intermediate results. --print_every 100 \  # Frequency at which to print intermediate information. --model_name ViT-B/16 # ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16']

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
demo		demo
figures		figures
helpers		helpers
in		in
.gitignore		.gitignore
README.md		README.md
app.py		app.py
invert-ga-overengineered.py		invert-ga-overengineered.py
invert-ga.py		invert-ga.py
invert.py		invert.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Changes 18/NOV/2024

Input Image: ResNet neuron. As interpreted by CLIP ViT.

Changes 30/AUG/2024

Original README.MD by the authors:

What do we learn from inverting CLIP models?

About

Uh oh!

Releases

Packages

Languages

zer0int/CLIPInversion

Folders and files

Latest commit

History

Repository files navigation

Changes 18/NOV/2024

Input Image: ResNet neuron. As interpreted by CLIP ViT.

Changes 30/AUG/2024

Original README.MD by the authors:

What do we learn from inverting CLIP models?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages