Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering in the Large Language Models era by Ismaël Mejía

Data Engineering in the Large Language Models era by Ismaël Mejía

Data Engineering in the Large Language Models era

The free lunch is over, we have to 'really' deal with unstructured data!
When engineers think about unstructured data, basically the first idea that comes to mind is those pesky legacy files we need to transform to extract into some 'good old' structured table somewhere. But the recent improvements on Machine Learning and the growing popularity of Large Language Models (LLMs) have opened a Pandora's box of interest and requirements for Data Engineers. Users want to access and analyze data from unstructured data sources using natural language processing and we should also maintain unstructured sources.
In this talk we will go into detail about what we need to do to get up to speed with the recent developments, we will talk about processing audio, images and text, using vector embeddings as well as the requirements for unstructured data pipelines and how we can achieve them by relying on Microsoft Fabric, AI services and open-source technologies like Apache Spark and SynapseML.

About Ismaël:

Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data team. He has more than a decade of experience architecting systems for startups and financial companies. He has been focused on distributed data and data engineering, he is a contributor to Apache Beam, Apache Avro and many other open-source projects. He is also a member of the Apache Software Foundation (ASF).

LinkedIn: https://www.linkedin.com/in/iemejia/
Twitter: https://twitter.com/iemejia

Azure Zurich User Group

October 04, 2023
Tweet

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. Data Engineering in the Large Language
    Models (LLM) era
    Ismaël Mejía
    Senior Cloud Advocate

    View full-size slide

  2. About me
     Software/Data Engineer
     ~10y experience in ‘Big-data’ / cloud systems
     Real-time (and batch) data at scale
     Apache Avro and Beam PMC/committer
     Apache Software Foundation (ASF) member
    ‘An open-source data systems person’, so why do I care about AI/LLMs?

    View full-size slide

  3. September 28, 2022

    View full-size slide

  4. November 30, 2022

    View full-size slide

  5. Artificial Intelligence
    1956
    Artificial Intelligence
    The field of computer science that seeks to create
    intelligent machines that can replicate or exceed
    human intelligence
    Machine Learning
    1997
    Machine Learning
    Subset of AI that enables machines to learn from
    existing data and improve upon that data to make
    decisions or predictions
    Deep Learning
    2017
    Deep Learning
    A machine learning technique in which layers of neural
    networks are used to process data and make decisions
    Generative AI
    2021 Create new written, visual, and auditory content given
    prompts or existing data

    View full-size slide

  6.  2016
     Human parity
     2017
     Human parity
     2018
     Human parity
     2019
     Human parity
     2020
     Human parity
     2021
     Human parity
     2021
     Human parity

    View full-size slide

  7. AI innovation fueled by research
     Redmond WA  Montreal
    QB
     New York
    NY
     Boston
    MA
     Cambridge,
    UK
     In
    di
    a
     Beijing,
    China
     Shanghai,
    China
     Global
    research
    centers
     Researchers
    employed
    worldwide
     AI-related
    patents
     AI research
    papers
    published
     To human
    parity on
    vision,
    speech, and
    language

    View full-size slide

  8. AI is already mainstream
     Top 3 common adopted AI use cases and benefits
    1 Intelligent document automation
     Automate processes and improve
    operational efficiency
    2 Sales and Demand Forecasting
    / Inventory Management
     Accelerate time-to-market
    3 Hyper-personalization for up-sell
    and cross-sell
     Build digital trust and improve user
    experiences

    View full-size slide

  9. Also LLM revolution of new products / features
    • Content generation
    • Summarization
    • Semantic search
    • Prompt Engineering
    • Copilots
    • Agents / Assistants

    View full-size slide

  10. User expectations
    What if I have that!
    • Be able to interact like that,
    have an ‘expert’ assistant
    • Use my own data
    • Integrate my systems / workflow

    View full-size slide

  11. AI/LLM revolution consequences for data practitioners
    1. Better tools to do our work faster
    2. Make structured and unstructured data accesible
    3. Offer similar features/services to others

    View full-size slide

  12. Microsoft Analytics Portfolio
    Data Factory Synapse DW Purview Event Hub
    Data Explorer Azure AI Power BI Synapse Spark Azure Databricks

    View full-size slide

  13. Data analytics for the era of AI

    View full-size slide

  14. Data
    Integration
    Data
    Warehouse
    Real Time
    Analytics
    Business
    Intelligence
    Data
    Science
    Data
    Lake
    Governance
    Spark
    Engines
    Power BI
    +
    Synapse
    Marrying the ease of use of Power BI with
    the scalability and depth of Synapse

    View full-size slide

  15. Data
    Integration
    Data
    Engineering
    Data
    Warehousing
    Data
    Science
    Real Time
    Analytics
    Business
    Intelligence
    OneLake
    Microsoft Fabric Unified
    analytics
    platform
    Lake
    centric
    and open
    Empower
    every
    Office user
    Pervasive
    security and
    governance

    View full-size slide

  16. AI/LLM revolution consequences for data practitioners
    1. Better tools to do our work faster
    2. Make structured and unstructured data accesible
    3. Offer similar features/services to others

    View full-size slide

  17. Microsoft Fabric
    Data analytics for the era of AI
    Complete
    Analytics
    Platform
    Everything, unified
    SaaS-ified
    Secured and governed
    Lake Centric
    and Open
    OneLake
    One copy
    Open at every tier
    Empower
    Every Business
    User
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action
    AI
    Powered
    Copilot accelerated
    GPT on your data
    AI-driven insights

    View full-size slide

  18. Complete
    Analytics
    Platform
    Everything, unified
    SaaS-ified
    Secured and governed
    AI
    Powered
    Empower
    Every Business
    User
    Lake Centric
    and Open
    Complete
    Analytics
    Platform
    Everything, unified
    SaaS-ified
    Secured and governed
    OneLake
    One copy
    Open at every tier
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action
    Copilot accelerated
    GPT on your data
    AI-driven insights
    Microsoft Fabric
    Data analytics for the era of AI
    Empower
    Every Business
    User
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action
    Empower
    Every Business
    User
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action

    View full-size slide

  19. Persona
    optimized
    experiences

    View full-size slide

  20. AI
    Powered
    Copilot accelerated
    GPT on your data
    AI-driven insights
    Complete
    Analytics
    Platform
    Empower
    Every Business
    User
    Lake Centric
    and Open
    Everything, unified
    SaaS-ified
    Secured and governed
    OneLake
    One copy
    Open at every tier
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action
    AI
    Powered
    Copilot accelerated
    GPT on your data
    AI-driven insights
    Microsoft Fabric
    Data analytics for the era of AI

    View full-size slide

  21. Copilot in Power BI
    Create beautiful and
    insightful reports just by
    chatting with Copilot
    Define metrics and
    calculations for your data
    model just by describing
    them in natural language
    Use Copilot to find
    and summarize insights
    in your data
    Stay focused on your business
    outcomes and unlock insights in
    your data with Copilot

    View full-size slide

  22. Copilot in Power BI

    View full-size slide

  23. Copilot in Notebooks
    Use Copilot to enrich, model,
    analyze and explore your data
    in notebooks
    Work with Copilot to understand
    how best to analyze your data
    Chat with Copilot to create and
    configure ML models
    Write code faster with inline code
    suggestions from Copilot
    Use Copilot to summarize and
    explain code to understand
    how it works

    View full-size slide

  24. Copilot in Notebooks

    View full-size slide

  25. Microsoft Fabric
    Data analytics for the era of AI
    Complete
    Analytics
    Platform
    Everything, unified
    SaaS-ified
    Secured and governed
    Lake Centric
    and Open
    OneLake
    One copy
    Open at every tier
    Empower
    Every Business
    User
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action
    AI
    Powered
    Copilot accelerated
    GPT on your data
    AI-driven insights

    View full-size slide

  26. Lake Centric
    and Open
    OneLake
    One copy
    Open at every tier
    Complete
    Analytics
    Platform
    AI
    Powered
    Empower
    Every Business
    User
    Everything, unified
    SaaS-ified
    Secured and governed
    Familiar and intuitive
    Built into Microsoft 365
    Insight to action
    Copilot accelerated
    GPT on your data
    AI-driven insights
    Microsoft Fabric
    Data analytics for the era of AI
    Lake Centric
    and Open
    OneLake
    One copy
    Open at every tier

    View full-size slide

  27. OneLake for all Data
    “The OneDrive for Data”
    A single SaaS lake for the whole
    organization
    Provisioned automatically with the tenant
    All workloads automatically store their
    data in the OneLake workspace folders
    All the data is organized in an intuitive
    hierarchical namespace
    The data in OneLake is automatically
    indexed for discovery, MIP labels, lineage,
    PII scans, sharing, governance and
    compliance

    View full-size slide

  28. One Copy for all computes
    Real separation of compute and storage
    All the compute engines store their data
    automatically in OneLake
    The data is stored in a single common format
    Delta – Parquet, an open standards format,
    is the storage format for all tabular data in
    Analytics vNext
    Once data is stored in the lake, it is directly
    accessible by all the engines without needing
    any import/export
    All the compute engines have been fully
    optimized to work with Delta Parquet as their
    native format
    Shared universal security model is enforced
    across all the engines
    Serverless
    Compute
    Customers
    360
    Finance
    Service
    Telemetry
    Business
    KPIs
    Delta –
    Parquet
    FormatÅ
    Delta –
    Parquet
    Format
    Delta –
    Parquet
    Format
    Delta –
    Parquet
    Format
    T-SQL
    Spark KQL
    Analysis
    Services

    View full-size slide

  29. Taking One Copy to the Next Level
    Shortcuts
    Customers
    360
    Finance
    Service
    Telemetry
    Business
    KPIs
    Amazon Google
    Azure
    Sharing data in OneLake is as easy as
    sharing files in OneDrive, removing the
    needs for data duplication
    With shortcuts, data throughout OneLake
    can be composed together without any
    data movement
    Shortcuts also allow instant linking of
    data already existing in Azure and in
    other clouds, without any data duplication
    and movement, making OneLake a multi-
    cloud data lake
    With support for industry standard APIs,
    OneLake data can be directly accessed by
    any application or service

    View full-size slide

  30. that different

    View full-size slide

  31. Making Unstructured data accesible
    • Query/Index logs to extract information
    e.g. Observability
    • Just put everything in some database we will care later

    View full-size slide

  32. Does every data have some hidden structure?
    • File Format Metadata (e.g. headers in images)
    • Automatic structure ‘extraction’ - Parsing
    • Look for structure conceptually - Features
    • Manual structure (aka Human labeling)
    LLMs
    • Represent data in a model space (aka embeddings)

    View full-size slide

  33. Embedding Model 0.027 -0.001 0.002 … 0.011
    Image
    Audio
    Text
    Semantic Search & Power Of Embeddings

    View full-size slide

  34. Cosine Similarity
    Cosine similarity - Wikipedia

    View full-size slide

  35. Cosine Similarity
    Cosine similarity - Wikipedia

    View full-size slide

  36. Vector Databases

    View full-size slide

  37. OpenAI Fine-Tuning API
    Fine-tuning - OpenAI API
    {"messages": [
    {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
    {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"},
    {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]
    }

    View full-size slide

  38.  Turing
     Rich language
    understanding
     Z-Code
     100 languages translation
     Florence
     Breakthrough visual
    recognition
     OpenAI
     GPT-3/GPT-4
     Human-like language
    generation
     DALL-E
     Realistic image generation
     Codex
     Advanced code generation
     Azure AI
    services
     Vision
     Speech
     Language
     Decision
     OpenAI Service
     Cognitive Search
     Form Recognizer
     Immersive Reader
     Bot Service
     Video Analyzer
     Better search and Q&A
     Better customer
    engagement and support
     Better matching and
    content moderation
     Better email management
    and meeting preparation
     Better knowledge
    management
     Better meeting
    management
     Better reading and
    writing assistance
     Better content
    moderation
    ChatGPT
    Conversation generation

    View full-size slide

  39. Data Science in Microsoft Fabric
    End-to-end data science for predictive business insights
    Developer friendly
    • SaaS experiences with
    quick setup
    • Starter pools with fast
    cluster startup
    • Code authoring
    experiences in
    Notebooks and IDE
    • VSCode integration
    • Git integration (CI/CD)
    Data Centric
    • Easy and secure access
    to lake centric data
    • Open Delta Lake
    support promotes
    reproducibility
    • Native integration with
    data infrastructure
    Secure collaboration
    • Unified platform for all
    analytics roles incl.
    data scientists
    • Secure and easy
    sharing of data, code,
    models and
    experiments
    Rich ML tools
    • Supports MLFlow model
    and experiment
    management
    • MLFlow Autologging
    • Large set of built-in,
    scalable ML tools with
    SynapseML library
    • Serve predictions swiftly to
    PowerBI with Direct Lake
    mode

    View full-size slide

  40. Problem
    formulation/ideation
    Experiment &
    Model
    Enrich &
    Operationalize
    Insight
    Data discovery and
    pre-processing
    Prepare Model
    Evaluate
    Explore
    Data Wrangler
    SynapseML
    Batch
    PREDICT
    Direct Lake
    The Data Science Process in Fabric

    View full-size slide

  41. Built in model & experiment
    tracking enables data scientists to
    track and compare their different
    experiment runs and model
    versions.
    Automatically capture model metrics
    & parameters with built-in support
    for MLFlow auto-logging
    Users can create and manage model
    artifacts in Trident
    MLFlow compatible. Model registry is
    powered by AzureML
    Models and Experiments with MLFlow

    View full-size slide

  42. SynapseML and Microsoft AI services
    Distributed ML
    Model Training
    MLflow
    support
    Cognitive
    Services
    OpenAI
    LLMs

    View full-size slide

  43. Service Name API Type
    Vision
    OCR
    Analyze Image
    Recognize Text
    Read Image
    Recognize Domain Specific Content
    Generate Thumbnails
    Tag Image
    Describe Image
    Form Recognizer
    Analyze Layout
    Analyze Receipts
    Analyze Business Cards
    Analyze Invoices
    Analyze ID Documents
    Analyze Custom Model
    Analyze Documents
    List Custom Models
    Bing Bing Image Search
    Face
    Detect Face
    Find Similar Face
    Group Faces
    Identify Faces
    Verify Faces
    Speech
    Speech to Text
    Conversation Transcription
    Text to Speech
    Emotion Recognition
    SynapseML: Cognitive Services Built-in to Trident
    Service Name API Type
    OpenAI
    Completion*
    Chat*
    Embeddings*
    Text
    Entity Detector*
    Key Phrase Extractor*
    Language Detector*
    PII*
    Sentiment*
    Healthcare
    Analyze Text
    Translation
    Translate*
    Transliterate*
    Detect Language*
    Break Sentence*
    Dictionary Lookup*
    Dictionary Examples*
    Document Translator
    Azure Search Add Documents
    Anomaly Detection
    Detect Last Anomaly*
    Detect Anomalies*
    Simple Detect Anomalies
    Custom Detection
    Multivariate Detection
    Azure Maps
    Address Geocoding
    Reverse Geocoding
    Check Point in Polygon
    *Supported in Trident built-in endpoint at //build
    Private Preview

    View full-size slide

  44. AI Plug-ins for your data
    Create AI plug-ins to deliver
    custom generative AI experiences
    for your data
    Enable custom Q&A on
    your data in Fabric
    Define custom business
    semantics and grounding
    unique to your organization
    Deploy plug-ins to
    work seamlessly with
    Copilot in Business Chat

    View full-size slide

  45. Want to learn more
    microsoft.com/fabric
    Microsoft Fabric

    View full-size slide

  46. aka.ms/learnlive-get-started-microsoft-fabric
    Date Title
    August 29, 2023 Get started with end-to-end analytics and Lakehouses in
    Microsoft Fabric
    September 5, 2023 Use Apache Spark in Microsoft Fabric
    September 12, 2023 Work with Delta Lake tables in Microsoft Fabric
    September 19, 2023 Use Data Factory pipelines in Microsoft Fabric
    September 26, 2023 Ingest Data with Dataflows Gen2 in Microsoft Fabric
    October 3, 2023 Get started with data warehouses in Microsoft Fabric
    October 10, 2023 Get started with Real-Time Analytics in Microsoft
    October 17, 2023 Get started with data science in Microsoft Fabric
    October 24, 2023 Administer Microsoft Fabric

    View full-size slide

  47. aka.ms/fabric-csc
    Compete
    Benchmark your
    progress against
    friends and
    coworkers. It's
    always better when
    we learn together.
    Learn
    Increase your
    understanding with
    easy-to-read
    instruction and stay
    up on the bleeding-
    edge of technology.
    Develop skills
    By the end of the
    challenge, you will
    have marketable
    skills to better
    yourself and your
    career.

    View full-size slide

  48. Microsoft Fabric Community Resources
    ✓ Try Microsoft Fabric for free: https://aka.ms/try-fabric
    ✓ Join the Fabric community: https://aka.ms/fabriccommunity
    ✓ Share and vote for ideas to improve Fabric: https://aka.ms/fabricideas
    ✓ Read and comment our blog: https://aka.ms/fabricblog
    ▪ Product announcement: https://aka.ms/fabric
    ▪ Digital Event at Build (videos): https://aka.ms/build-with-analytics
    ▪ Product website: https://aka.ms/microsoft-fabric
    ▪ Documentation: https://aka.ms/fabric-docs
    ▪ Fabric e-book: https://aka.ms/fabric-get-started-ebook
    ▪ Microsoft Learn: https://aka.ms/learn-fabric
    ▪ End-to-end scenario tutorials: https://aka.ms/fabric-tutorials
    ▪ Fabric Notes: https://aka.ms/fabric-notes

    View full-size slide

  49. What’s coming next?

    View full-size slide

  50. Roadmap
    Problem
    formulation/ideation
    Experiment &
    Model
    Enrich &
    Operationalize
    Insight
    Data discovery and
    pre-processing
    • Tagging support
    • Hyperparameter
    tuning
    • AutoML (FLAML)
    • Model
    interpretability
    • DNN training
    • Built-in pre-trained
    AI models
    (Cognitive Services)
    • Data Wrangler on
    Spark
    • Explore PowerBI
    datasets from
    Notebooks
    (Semantic Link)
    • CI/CD
    • ALM support for ML
    models
    • Trident SDK for ML
    • Improved model batch
    scoring with
    containerization
    • Model endpoints
    • Trident ML model
    support in PBI dataflows
    • Monitoring of models
    • Feature store
    • Integration with
    PowerBI Metrics
    Copilot experiences and Azure Open AI Integration
    • Open Notebook
    from BI report
    visual (Semantic
    link)

    View full-size slide

  51. Automate the process of building
    machine learning models with
    FLAML
    Code-first integration to
    parallelize AutoML trials with
    Spark
    Run AutoML
    Integrated with MLFlow to
    automatically capture runs &
    metrics
    Microsoft Confidential: Content is shared under NDA
    AutoML with FLAML Private Preview

    View full-size slide

  52. Extra slides

    View full-size slide

  53. COMPUTE
    A shared pool of capacity that powers all
    capabilities in Microsoft Fabric, from data
    modeling and data warehousing to
    business intelligence.
    Pay-as-you-go (per sec billing with one
    minute minimum).
    STORAGE
    A single place to store all data.
    Pay-as-you-go ($ per GB / month).
    Microsoft Fabric simplicity
    Microsoft Fabric is a unified product for all your data and analytics workloads. Rather than provisioning and
    managing separate compute for each workload, with Microsoft Fabric, your bill is determined by two
    variables: the amount of compute you provision and the amount of storage you use.

    View full-size slide

  54. Pricing
    Pay as you go

    View full-size slide

  55. Azure OpenAI +
    Cognitive
    Search

    View full-size slide

  56. Azure OpenAI +
    Plug-in
    Introduction - OpenAI API
    OpenAI API
    Plug-Ins
    App Orchestrator
    API
    Data Source
    Others

    View full-size slide