Robot Holmes and the Vision-Language Murder Mysteries

Robot Holmes and the MLington Murder Mysteries

About Me Johannes Kolbe • Data Scientist at celebrate company
◦ Focus on Computer Vision ◦ Some Expertise in NLP • M.Sc. in Computer Science at TU Berlin • Hugging Face Fellow ◦ Leading CV Study Group on discord: https://huggingface.co/join/discord ◦ Past Study Groups: https://github.com/huggingface/community-events/ tree/main/computer-vision-study-group @[email protected] linkedin.com/in/johko @johko990 huggingface.co/johko

Old SAN 4 Paper: Stacked Attention Networks for Image Question
Answering, Yang, He et al., 2015

City of Science 6

City of Science Visionsworth 7

City of Science Visionsworth Languageshire 8

City of Science Visionsworth Languageshire Vision-Language Village 9

Visual Question Answering Image-Text Retrieval Image Captioning Visual Grounding Text-to-Image
Generation ViLaVi Services 10

Image Captioning an illustration of a city street at night
11

Image-Text Retrieval an illustration of a city street at night
Prompt Image Collection Result 12

Visual Question Answering What is in the middle of the
street? a police box. Could it also be a Tardis? yes. Is it a Tardis? no. 13

Visual Grounding a police box Prompt 14

Visual Grounding a police box, a street lamp Prompt 15

CLIP 17 Paper: Learning Transferable Visual Models From Natural Language
Supervision by Radford, Kim et al. 2021, OpenAI

CLIP inside a fancy victorian era shop a dark city
street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time inside a fancy victorian era shop a dark city street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time 9% 85% 1% 60% 18

street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time 19

inside a fancy victorian era shop a dark city street
in victorian era london at night an old manor in the woods behind a big iron fence a busy city street in victorian era at day time 20 T 4T 3T 2T 1

I 4 I 3 I 2 I 1 21 T
4T 3T 2T 1

street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time I 1 I 2 I 3 I 4 T 4 T 3 T 2 T 1 22

CLIP - Interrogation 23 1% 96% 1% prompted with: CLIP
Playground A piece of red cloth A piece from a murderer’s clothing A piece of cloth an innocent citizen lost

CLIP - Interrogation 24 97% 2% prompted with: CLIP Playground
A piece of red cloth A piece from a murderer’s clothing A piece of cloth an innocent citizen lost

CLIP - Interrogation 25 49% 50% prompted with: CLIP Playground
A piece of red cloth A piece of cloth an innocent citizen lost A piece of cloth from a murderer’s clothing

OWL-ViT 26 Paper: Simple Open-Vocabulary Object Detection with Vision Transformers
by Minderer, Gritsenko et al. 2022, Google

OWL-ViT a police box, a street lamp 27

OWL-ViT Training I II a police box, a street lamp,
a bridge, a horse inside a fancy victorian era shop a dark city street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time 28

OWL-ViT Training a police box, a street lamp, a bridge,
a horse T 4 T 3 T 2 T 1 I 4 I 3 I 2 I 1 a street lamp, a street lamp, a police box, a police box 29

OWL-ViT - Interrogation 30 murderer lamp lamp post

OWL-ViT - Interrogation 31 woman lamp lamp post

OWL-ViT - Interrogation 32 figure in red dress lamp lamp
post

OWL-ViT - Interrogation 33 murderer in red dress lamp lamp
post

BLIP-2 “Q” 34 Paper: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen
Image Encoders and Large Language Models by Junnan Li, Dongxu Li et al. 2023, Salesforce

BLIP-2 “Q” Vision Transfor mer Text Transfom er 35

BLIP-2 “Q” Vision Transfor mer Text Transfor mer ViT/L-16 I
4 I 3 I 2 I 1 L 1 L 2 L 3 L x L 1 L 2 L 3 L x LLM no. 37 Is there a Tardis in the picture?

BLIP2 - Interrogation what is shown in the image? a
robot is he an innocent citizen? yes. could he be responsible for murders? no.. 38

BLIP2 - Interrogation why not? because he is a robot
can robots not commit murders or be responsible for them? no.. so he could potentially commit a murder? yes.. 39

BLIP2 - Interrogation could he also be a potential mob
leader? yes… would you be shocked if i told you he is dead? yes... do you think he was a mob leader? yes... 40

42 THE END?

Hugging Face Transformers 43 https://github.com/huggingface/transformers

CLIP from PIL import Image import requests from transformers import
CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the probabilities 44

Spaces 45 Zero-Shot Marvel Super Hero Classifier: https://huggingface.co/spaces/johko/Marvel-Heroes-Classific ation-using-clip Pictionary:
https://huggingface.co/spaces/johko/CLIPictionary

OWL-ViT import requests from PIL import Image import torch from
transformers import OwlViTProcessor, OwlViTForObjectDetection processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32") model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) texts = [["a photo of a cat", "a photo of a dog"]] inputs = processor(text=texts, images=image, return_tensors="pt") outputs = model(**inputs) 46

OWL-ViT # Target image sizes (height, width) to rescale box
predictions [batch_size, 2] target_sizes = torch.Tensor([image.size[::-1]]) # Convert outputs (bounding boxes and class logits) to COCO API results = processor.post_process(outputs=outputs, target_sizes=target_sizes) i = 0 # Retrieve predictions for the first image for the corresponding text queries text = texts[i] boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"] score_threshold = 0.1 for box, score, label in zip(boxes, scores, labels): box = [round(i, 2) for i in box.tolist()] if score >= score_threshold: print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}") 47

Spaces OWL-ViT Open Vocabulary Detection: https://huggingface.co/spaces/johko/OWL-ViT 48 Image Guided OWL-ViT:
https://huggingface.co/spaces/johko/image-guided-o wlvit

BLIP-2 from PIL import Image import requests from transformers import
Blip2Processor, Blip2ForConditionalGeneration import torch processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16 ) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt").to(device, torch.float16) generated_ids = model.generate(**inputs) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) 49

Space BLIP-2 with Transformers: https://huggingface.co/spaces/taesiri/BLIP-2 50 IntstructBLIP: https://huggingface.co/spaces/RamAnanth1/InstructBLIP

Thank You 51

Robot Holmes and the Vision-Language Murder Mys...

Robot Holmes and the Vision-Language Murder Mysteries

More Decks by Johannes Kolbe

Other Decks in Programming

Featured

Transcript