Upgrade to Pro — share decks privately, control downloads, hide ads and more …

apidays Paris 2024 - Embeddings: Core Concepts ...

apidays
December 26, 2024

apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

Embeddings: Core Concepts for Developers
Jocelyn Matthews, DevRel, Head of Developer Community at Pinecone

apidays Paris 2024 - The Future API Stack for Mass Innovation
December 3 - 5, 2024

------

Check out our conferences at https://www.apidays.global/

Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8

Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io

Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/

apidays

December 26, 2024
Tweet

More Decks by apidays

Other Decks in Programming

Transcript

  1. What are embeddings? Embeddings are numerical representations that capture the

    essential features and relationships of discrete objects, like words or documents, in a continuous vector space.
  2. Embeddings: • Are dynamic and context-sensitive. • Capture the essence

    of the data they represent • Are influenced by the context in which they are used • Adaptability makes them powerful Humans think in sensations, words, ideas. Computers think in numbers
  3. You don’t need to memorize this now Vector: a list

    of numbers that tell us about something Vector space: an environment in which vectors exist Semantics: the study of meaning communicated through language
  4. Vectors A vector is a mathematical structure with a size

    and a direction. For example, we can think of the vector as a point in space, with the “direction” being an arrow from (0,0,0) to that point in the vector space.
  5. Vectors As developers, it might be easier to think of

    a vector as an array containing numerical values. For example: vector = [0,-2,...4]
  6. Vectors When we look at a bunch of vectors in

    one space, we can say that some are closer to one another, while others are far apart. Some vectors can seem to cluster together, while others could be sparsely distributed in the space.
  7. An example you can bank on 🏦 Where is the

    Bank of England? 🌱 Where is the grassy bank?​ 🛩️ How does a plane bank? 🐝 “the bees decided to have a mutiny against their queen” 🐝 “flying stinging insects rebelled in opposition to the matriarch”
  8. Word arithmetic king – man + woman = queen Image,

    Peter Sutor, “Metaconcepts: Isolating Context in Word Embeddings”
  9. Word arithmetic king – man + woman = queen “Distributed

    Representations of Words and Phrases and their Compositionality”
  10. Word arithmetic king – man + woman = queen “adding

    the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen. This describes a gender relationship.” – MIT Technology Review, 2015
  11. Word arithmetic Paris - France + Poland = Warsaw “In

    this case, the vector difference between Paris and France captures the concept of capital city.” – MIT Technology Review, 2015
  12. Together and apart Coffee Hospital Music Restaurant School Cup Caffeine

    Morning Galaxy Dinosaur Doctor Patient Surgery Volcano Unicorn Song Melody Instrument Asteroid Bacteria Food Menu Waiter Nebula Dragon Teacher Classroom Student Volcano Spaceship Exam
  13. Dimensionality! Coffee Hospital Music Restaurant School Cup Caffeine Morning Galaxy

    Dinosaur Doctor Patient Surgery Volcano Unicorn Song Melody Instrument Asteroid Bacteria Food Menu Waiter Nebula Dragon Teacher Classroom Student Volcano Spaceship Exam
  14. What’s The Fallacy? Why "Green : Blue :: Orange :

    Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  15. What’s The Fallacy? Why "Green : Blue :: Orange :

    Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  16. What’s The Fallacy? Why "Green : Blue :: Orange :

    Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  17. What’s The Fallacy? Why "Green : Blue :: Orange :

    Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  18. Check the vectors The distance between red and orange is

    incredibly similar to blue and green… But when we tested things trying to verify, we got interesting results which show the "understanding" of the relationship This actually yields this # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance
  19. Check the vectors The distance between red and orange is

    incredibly similar to blue and green… But when we played around to verify, we got interesting results revealing the semantic "understanding" of the relationship This actually yields this # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance ('purple', np.float64(0.006596347059928065)) Purple
  20. Why not 'orange'? The code's result of ('purple', np.float64(0.006596347059928065)) suggests

    that, in the embedding space used by the model, "red" and "purple" have a closer semantic relationship than "red" and "orange". The embedding model used in the code has determined that "red" and "purple" are closer semantically. This is likely due to the specific contexts and relationships captured by the model during training. It yields 'purple' instead of orange because the cosine distance and direction calculations between the embeddings of "red" and other terms result in "purple" being the closest match to the target distance and direction from "green" to "blue". # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance
  21. What are embeddings? Embeddings are numerical representations that capture the

    essential features and relationships of discrete objects, like words or documents, in a continuous vector space.
  22. The most important thing to understand Embeddings are numerical representations

    of data that: capture semantic meaning and allow for efficient comparison of similarity.
  23. Key points about embeddings 1. They can represent various data

    types, not just text. 2. Dimensionality 3. Context sensitivity affects interpretation and application.
  24. Applications of embeddings include: - Semantic search - Question-answering applications

    - Image search - Audio search - Recommender systems - Anomaly detection “Generate your own embeddings” (Inference API)
  25. © 2024 Pinecone – All rights reserved 45 1. Questions?

    #hallwaytrack 2. Recording? YouTube! 3. Slides? Ask me