Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering in the Large Language Models era by Ismaël Mejía

Data Engineering in the Large Language Models era by Ismaël Mejía

Data Engineering in the Large Language Models era

The free lunch is over, we have to 'really' deal with unstructured data!
When engineers think about unstructured data, basically the first idea that comes to mind is those pesky legacy files we need to transform to extract into some 'good old' structured table somewhere. But the recent improvements on Machine Learning and the growing popularity of Large Language Models (LLMs) have opened a Pandora's box of interest and requirements for Data Engineers. Users want to access and analyze data from unstructured data sources using natural language processing and we should also maintain unstructured sources.
In this talk we will go into detail about what we need to do to get up to speed with the recent developments, we will talk about processing audio, images and text, using vector embeddings as well as the requirements for unstructured data pipelines and how we can achieve them by relying on Microsoft Fabric, AI services and open-source technologies like Apache Spark and SynapseML.

About Ismaël:

Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data team. He has more than a decade of experience architecting systems for startups and financial companies. He has been focused on distributed data and data engineering, he is a contributor to Apache Beam, Apache Avro and many other open-source projects. He is also a member of the Apache Software Foundation (ASF).

LinkedIn: https://www.linkedin.com/in/iemejia/
Twitter: https://twitter.com/iemejia

Azure Zurich User Group

October 04, 2023
Tweet

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. About me  Software/Data Engineer  ~10y experience in ‘Big-data’

    / cloud systems  Real-time (and batch) data at scale  Apache Avro and Beam PMC/committer  Apache Software Foundation (ASF) member ‘An open-source data systems person’, so why do I care about AI/LLMs?
  2. Artificial Intelligence 1956 Artificial Intelligence The field of computer science

    that seeks to create intelligent machines that can replicate or exceed human intelligence Machine Learning 1997 Machine Learning Subset of AI that enables machines to learn from existing data and improve upon that data to make decisions or predictions Deep Learning 2017 Deep Learning A machine learning technique in which layers of neural networks are used to process data and make decisions Generative AI 2021 Create new written, visual, and auditory content given prompts or existing data
  3.  2016  Human parity  2017  Human parity

     2018  Human parity  2019  Human parity  2020  Human parity  2021  Human parity  2021  Human parity
  4. AI innovation fueled by research  Redmond WA  Montreal

    QB  New York NY  Boston MA  Cambridge, UK  In di a  Beijing, China  Shanghai, China  Global research centers  Researchers employed worldwide  AI-related patents  AI research papers published  To human parity on vision, speech, and language
  5. AI is already mainstream  Top 3 common adopted AI

    use cases and benefits 1 Intelligent document automation  Automate processes and improve operational efficiency 2 Sales and Demand Forecasting / Inventory Management  Accelerate time-to-market 3 Hyper-personalization for up-sell and cross-sell  Build digital trust and improve user experiences
  6. Also LLM revolution of new products / features • Content

    generation • Summarization • Semantic search • Prompt Engineering • Copilots • Agents / Assistants
  7. User expectations What if I have that! • Be able

    to interact like that, have an ‘expert’ assistant • Use my own data • Integrate my systems / workflow
  8. AI/LLM revolution consequences for data practitioners 1. Better tools to

    do our work faster 2. Make structured and unstructured data accesible 3. Offer similar features/services to others
  9. Microsoft Analytics Portfolio Data Factory Synapse DW Purview Event Hub

    Data Explorer Azure AI Power BI Synapse Spark Azure Databricks
  10. Data Integration Data Warehouse Real Time Analytics Business Intelligence Data

    Science Data Lake Governance Spark Engines Power BI + Synapse Marrying the ease of use of Power BI with the scalability and depth of Synapse
  11. Data Integration Data Engineering Data Warehousing Data Science Real Time

    Analytics Business Intelligence OneLake Microsoft Fabric Unified analytics platform Lake centric and open Empower every Office user Pervasive security and governance
  12. AI/LLM revolution consequences for data practitioners 1. Better tools to

    do our work faster 2. Make structured and unstructured data accesible 3. Offer similar features/services to others
  13. Microsoft Fabric Data analytics for the era of AI Complete

    Analytics Platform Everything, unified SaaS-ified Secured and governed Lake Centric and Open OneLake One copy Open at every tier Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights
  14. Complete Analytics Platform Everything, unified SaaS-ified Secured and governed AI

    Powered Empower Every Business User Lake Centric and Open Complete Analytics Platform Everything, unified SaaS-ified Secured and governed OneLake One copy Open at every tier Familiar and intuitive Built into Microsoft 365 Insight to action Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action
  15. AI Powered Copilot accelerated GPT on your data AI-driven insights

    Complete Analytics Platform Empower Every Business User Lake Centric and Open Everything, unified SaaS-ified Secured and governed OneLake One copy Open at every tier Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI
  16. Copilot in Power BI Create beautiful and insightful reports just

    by chatting with Copilot Define metrics and calculations for your data model just by describing them in natural language Use Copilot to find and summarize insights in your data Stay focused on your business outcomes and unlock insights in your data with Copilot
  17. Copilot in Notebooks Use Copilot to enrich, model, analyze and

    explore your data in notebooks Work with Copilot to understand how best to analyze your data Chat with Copilot to create and configure ML models Write code faster with inline code suggestions from Copilot Use Copilot to summarize and explain code to understand how it works
  18. Microsoft Fabric Data analytics for the era of AI Complete

    Analytics Platform Everything, unified SaaS-ified Secured and governed Lake Centric and Open OneLake One copy Open at every tier Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights
  19. Lake Centric and Open OneLake One copy Open at every

    tier Complete Analytics Platform AI Powered Empower Every Business User Everything, unified SaaS-ified Secured and governed Familiar and intuitive Built into Microsoft 365 Insight to action Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI Lake Centric and Open OneLake One copy Open at every tier
  20. OneLake for all Data “The OneDrive for Data” A single

    SaaS lake for the whole organization Provisioned automatically with the tenant All workloads automatically store their data in the OneLake workspace folders All the data is organized in an intuitive hierarchical namespace The data in OneLake is automatically indexed for discovery, MIP labels, lineage, PII scans, sharing, governance and compliance
  21. One Copy for all computes Real separation of compute and

    storage All the compute engines store their data automatically in OneLake The data is stored in a single common format Delta – Parquet, an open standards format, is the storage format for all tabular data in Analytics vNext Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines Serverless Compute Customers 360 Finance Service Telemetry Business KPIs Delta – Parquet FormatÅ Delta – Parquet Format Delta – Parquet Format Delta – Parquet Format T-SQL Spark KQL Analysis Services
  22. Taking One Copy to the Next Level Shortcuts Customers 360

    Finance Service Telemetry Business KPIs Amazon Google Azure Sharing data in OneLake is as easy as sharing files in OneDrive, removing the needs for data duplication With shortcuts, data throughout OneLake can be composed together without any data movement Shortcuts also allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement, making OneLake a multi- cloud data lake With support for industry standard APIs, OneLake data can be directly accessed by any application or service
  23. Making Unstructured data accesible <the old days> • Query/Index logs

    to extract information e.g. Observability • Just put everything in some database we will care later
  24. Does every data have some hidden structure? • File Format

    Metadata (e.g. headers in images) • Automatic structure ‘extraction’ - Parsing • Look for structure conceptually - Features • Manual structure (aka Human labeling) LLMs • Represent data in a model space (aka embeddings)
  25. OpenAI Fine-Tuning API Fine-tuning - OpenAI API {"messages": [ {"role":

    "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}] }
  26.  Turing  Rich language understanding  Z-Code  100

    languages translation  Florence  Breakthrough visual recognition  OpenAI  GPT-3/GPT-4  Human-like language generation  DALL-E  Realistic image generation  Codex  Advanced code generation  Azure AI services  Vision  Speech  Language  Decision  OpenAI Service  Cognitive Search  Form Recognizer  Immersive Reader  Bot Service  Video Analyzer  Better search and Q&A  Better customer engagement and support  Better matching and content moderation  Better email management and meeting preparation  Better knowledge management  Better meeting management  Better reading and writing assistance  Better content moderation ChatGPT Conversation generation
  27. Data Science in Microsoft Fabric End-to-end data science for predictive

    business insights Developer friendly • SaaS experiences with quick setup • Starter pools with fast cluster startup • Code authoring experiences in Notebooks and IDE • VSCode integration • Git integration (CI/CD) Data Centric • Easy and secure access to lake centric data • Open Delta Lake support promotes reproducibility • Native integration with data infrastructure Secure collaboration • Unified platform for all analytics roles incl. data scientists • Secure and easy sharing of data, code, models and experiments Rich ML tools • Supports MLFlow model and experiment management • MLFlow Autologging • Large set of built-in, scalable ML tools with SynapseML library • Serve predictions swiftly to PowerBI with Direct Lake mode
  28. Problem formulation/ideation Experiment & Model Enrich & Operationalize Insight Data

    discovery and pre-processing Prepare Model Evaluate Explore Data Wrangler SynapseML Batch PREDICT Direct Lake The Data Science Process in Fabric
  29. Built in model & experiment tracking enables data scientists to

    track and compare their different experiment runs and model versions. Automatically capture model metrics & parameters with built-in support for MLFlow auto-logging Users can create and manage model artifacts in Trident MLFlow compatible. Model registry is powered by AzureML Models and Experiments with MLFlow
  30. Service Name API Type Vision OCR Analyze Image Recognize Text

    Read Image Recognize Domain Specific Content Generate Thumbnails Tag Image Describe Image Form Recognizer Analyze Layout Analyze Receipts Analyze Business Cards Analyze Invoices Analyze ID Documents Analyze Custom Model Analyze Documents List Custom Models Bing Bing Image Search Face Detect Face Find Similar Face Group Faces Identify Faces Verify Faces Speech Speech to Text Conversation Transcription Text to Speech Emotion Recognition SynapseML: Cognitive Services Built-in to Trident Service Name API Type OpenAI Completion* Chat* Embeddings* Text Entity Detector* Key Phrase Extractor* Language Detector* PII* Sentiment* Healthcare Analyze Text Translation Translate* Transliterate* Detect Language* Break Sentence* Dictionary Lookup* Dictionary Examples* Document Translator Azure Search Add Documents Anomaly Detection Detect Last Anomaly* Detect Anomalies* Simple Detect Anomalies Custom Detection Multivariate Detection Azure Maps Address Geocoding Reverse Geocoding Check Point in Polygon *Supported in Trident built-in endpoint at //build Private Preview
  31. AI Plug-ins for your data Create AI plug-ins to deliver

    custom generative AI experiences for your data Enable custom Q&A on your data in Fabric Define custom business semantics and grounding unique to your organization Deploy plug-ins to work seamlessly with Copilot in Business Chat
  32. aka.ms/learnlive-get-started-microsoft-fabric Date Title August 29, 2023 Get started with end-to-end

    analytics and Lakehouses in Microsoft Fabric September 5, 2023 Use Apache Spark in Microsoft Fabric September 12, 2023 Work with Delta Lake tables in Microsoft Fabric September 19, 2023 Use Data Factory pipelines in Microsoft Fabric September 26, 2023 Ingest Data with Dataflows Gen2 in Microsoft Fabric October 3, 2023 Get started with data warehouses in Microsoft Fabric October 10, 2023 Get started with Real-Time Analytics in Microsoft October 17, 2023 Get started with data science in Microsoft Fabric October 24, 2023 Administer Microsoft Fabric
  33. aka.ms/fabric-csc Compete Benchmark your progress against friends and coworkers. It's

    always better when we learn together. Learn Increase your understanding with easy-to-read instruction and stay up on the bleeding- edge of technology. Develop skills By the end of the challenge, you will have marketable skills to better yourself and your career.
  34. Microsoft Fabric Community Resources ✓ Try Microsoft Fabric for free:

    https://aka.ms/try-fabric ✓ Join the Fabric community: https://aka.ms/fabriccommunity ✓ Share and vote for ideas to improve Fabric: https://aka.ms/fabricideas ✓ Read and comment our blog: https://aka.ms/fabricblog ▪ Product announcement: https://aka.ms/fabric ▪ Digital Event at Build (videos): https://aka.ms/build-with-analytics ▪ Product website: https://aka.ms/microsoft-fabric ▪ Documentation: https://aka.ms/fabric-docs ▪ Fabric e-book: https://aka.ms/fabric-get-started-ebook ▪ Microsoft Learn: https://aka.ms/learn-fabric ▪ End-to-end scenario tutorials: https://aka.ms/fabric-tutorials ▪ Fabric Notes: https://aka.ms/fabric-notes
  35. Roadmap Problem formulation/ideation Experiment & Model Enrich & Operationalize Insight

    Data discovery and pre-processing • Tagging support • Hyperparameter tuning • AutoML (FLAML) • Model interpretability • DNN training • Built-in pre-trained AI models (Cognitive Services) • Data Wrangler on Spark • Explore PowerBI datasets from Notebooks (Semantic Link) • CI/CD • ALM support for ML models • Trident SDK for ML • Improved model batch scoring with containerization • Model endpoints • Trident ML model support in PBI dataflows • Monitoring of models • Feature store • Integration with PowerBI Metrics Copilot experiences and Azure Open AI Integration • Open Notebook from BI report visual (Semantic link)
  36. Automate the process of building machine learning models with FLAML

    Code-first integration to parallelize AutoML trials with Spark Run AutoML Integrated with MLFlow to automatically capture runs & metrics Microsoft Confidential: Content is shared under NDA AutoML with FLAML Private Preview
  37. COMPUTE A shared pool of capacity that powers all capabilities

    in Microsoft Fabric, from data modeling and data warehousing to business intelligence. Pay-as-you-go (per sec billing with one minute minimum). STORAGE A single place to store all data. Pay-as-you-go ($ per GB / month). Microsoft Fabric simplicity Microsoft Fabric is a unified product for all your data and analytics workloads. Rather than provisioning and managing separate compute for each workload, with Microsoft Fabric, your bill is determined by two variables: the amount of compute you provision and the amount of storage you use.
  38. Azure OpenAI + Plug-in Introduction - OpenAI API OpenAI API

    Plug-Ins App Orchestrator API Data Source Others