Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Robot Holmes and the Vision-Language Murder Mys...

Robot Holmes and the Vision-Language Murder Mysteries

Johannes Kolbe

December 12, 2024
Tweet

More Decks by Johannes Kolbe

Other Decks in Programming

Transcript

  1. About Me Johannes Kolbe • Data Scientist at celebrate company

    ◦ Focus on Computer Vision ◦ Some Expertise in NLP • M.Sc. in Computer Science at TU Berlin • Hugging Face Fellow ◦ Leading CV Study Group on discord: https://huggingface.co/join/discord ◦ Past Study Groups: https://github.com/huggingface/community-events/ tree/main/computer-vision-study-group @[email protected] linkedin.com/in/johko @johko990 huggingface.co/johko
  2. 3

  3. 5

  4. Visual Question Answering What is in the middle of the

    street? a police box. Could it also be a Tardis? yes. Is it a Tardis? no. 13
  5. 16

  6. CLIP 17 Paper: Learning Transferable Visual Models From Natural Language

    Supervision by Radford, Kim et al. 2021, OpenAI
  7. CLIP inside a fancy victorian era shop a dark city

    street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time inside a fancy victorian era shop a dark city street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time 9% 85% 1% 60% 18
  8. CLIP inside a fancy victorian era shop a dark city

    street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time 19
  9. inside a fancy victorian era shop a dark city street

    in victorian era london at night an old manor in the woods behind a big iron fence a busy city street in victorian era at day time 20 T 4T 3T 2T 1
  10. CLIP inside a fancy victorian era shop a dark city

    street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time I 1 I 2 I 3 I 4 T 4 T 3 T 2 T 1 22
  11. CLIP - Interrogation 23 1% 96% 1% prompted with: CLIP

    Playground A piece of red cloth A piece from a murderer’s clothing A piece of cloth an innocent citizen lost
  12. CLIP - Interrogation 24 97% 2% prompted with: CLIP Playground

    A piece of red cloth A piece from a murderer’s clothing A piece of cloth an innocent citizen lost
  13. CLIP - Interrogation 25 49% 50% prompted with: CLIP Playground

    A piece of red cloth A piece of cloth an innocent citizen lost A piece of cloth from a murderer’s clothing
  14. OWL-ViT Training I II a police box, a street lamp,

    a bridge, a horse inside a fancy victorian era shop a dark city street in victorian era london at night an old manor in the woods a busy city street in victorian era at day time 28
  15. OWL-ViT Training a police box, a street lamp, a bridge,

    a horse T 4 T 3 T 2 T 1 I 4 I 3 I 2 I 1 a street lamp, a street lamp, a police box, a police box 29
  16. BLIP-2 “Q” 34 Paper: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen

    Image Encoders and Large Language Models by Junnan Li, Dongxu Li et al. 2023, Salesforce
  17. 36

  18. BLIP-2 “Q” Vision Transfor mer Text Transfor mer ViT/L-16 I

    4 I 3 I 2 I 1 L 1 L 2 L 3 L x L 1 L 2 L 3 L x LLM no. 37 Is there a Tardis in the picture?
  19. BLIP2 - Interrogation what is shown in the image? a

    robot is he an innocent citizen? yes. could he be responsible for murders? no.. 38
  20. BLIP2 - Interrogation why not? because he is a robot

    can robots not commit murders or be responsible for them? no.. so he could potentially commit a murder? yes.. 39
  21. BLIP2 - Interrogation could he also be a potential mob

    leader? yes… would you be shocked if i told you he is dead? yes... do you think he was a mob leader? yes... 40
  22. 41

  23. CLIP from PIL import Image import requests from transformers import

    CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the probabilities 44
  24. OWL-ViT import requests from PIL import Image import torch from

    transformers import OwlViTProcessor, OwlViTForObjectDetection processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32") model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) texts = [["a photo of a cat", "a photo of a dog"]] inputs = processor(text=texts, images=image, return_tensors="pt") outputs = model(**inputs) 46
  25. OWL-ViT # Target image sizes (height, width) to rescale box

    predictions [batch_size, 2] target_sizes = torch.Tensor([image.size[::-1]]) # Convert outputs (bounding boxes and class logits) to COCO API results = processor.post_process(outputs=outputs, target_sizes=target_sizes) i = 0 # Retrieve predictions for the first image for the corresponding text queries text = texts[i] boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"] score_threshold = 0.1 for box, score, label in zip(boxes, scores, labels): box = [round(i, 2) for i in box.tolist()] if score >= score_threshold: print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}") 47
  26. BLIP-2 from PIL import Image import requests from transformers import

    Blip2Processor, Blip2ForConditionalGeneration import torch processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16 ) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt").to(device, torch.float16) generated_ids = model.generate(**inputs) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() print(generated_text) 49