Slide 1

Slide 1 text

Data Engineering in the Large Language Models (LLM) era Ismaël Mejía Senior Cloud Advocate

Slide 2

Slide 2 text

About me  Software/Data Engineer  ~10y experience in ‘Big-data’ / cloud systems  Real-time (and batch) data at scale  Apache Avro and Beam PMC/committer  Apache Software Foundation (ASF) member ‘An open-source data systems person’, so why do I care about AI/LLMs?

Slide 3

Slide 3 text

September 28, 2022

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

November 30, 2022

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Artificial Intelligence 1956 Artificial Intelligence The field of computer science that seeks to create intelligent machines that can replicate or exceed human intelligence Machine Learning 1997 Machine Learning Subset of AI that enables machines to learn from existing data and improve upon that data to make decisions or predictions Deep Learning 2017 Deep Learning A machine learning technique in which layers of neural networks are used to process data and make decisions Generative AI 2021 Create new written, visual, and auditory content given prompts or existing data

Slide 15

Slide 15 text

 2016  Human parity  2017  Human parity  2018  Human parity  2019  Human parity  2020  Human parity  2021  Human parity  2021  Human parity

Slide 16

Slide 16 text

AI innovation fueled by research  Redmond WA  Montreal QB  New York NY  Boston MA  Cambridge, UK  In di a  Beijing, China  Shanghai, China  Global research centers  Researchers employed worldwide  AI-related patents  AI research papers published  To human parity on vision, speech, and language

Slide 17

Slide 17 text

AI is already mainstream  Top 3 common adopted AI use cases and benefits 1 Intelligent document automation  Automate processes and improve operational efficiency 2 Sales and Demand Forecasting / Inventory Management  Accelerate time-to-market 3 Hyper-personalization for up-sell and cross-sell  Build digital trust and improve user experiences

Slide 18

Slide 18 text

Also LLM revolution of new products / features • Content generation • Summarization • Semantic search • Prompt Engineering • Copilots • Agents / Assistants

Slide 19

Slide 19 text

User expectations What if I have that! • Be able to interact like that, have an ‘expert’ assistant • Use my own data • Integrate my systems / workflow

Slide 20

Slide 20 text

AI/LLM revolution consequences for data practitioners 1. Better tools to do our work faster 2. Make structured and unstructured data accesible 3. Offer similar features/services to others

Slide 21

Slide 21 text

Microsoft Analytics Portfolio Data Factory Synapse DW Purview Event Hub Data Explorer Azure AI Power BI Synapse Spark Azure Databricks

Slide 22

Slide 22 text

Data analytics for the era of AI

Slide 23

Slide 23 text

Data Integration Data Warehouse Real Time Analytics Business Intelligence Data Science Data Lake Governance Spark Engines Power BI + Synapse Marrying the ease of use of Power BI with the scalability and depth of Synapse

Slide 24

Slide 24 text

Data Integration Data Engineering Data Warehousing Data Science Real Time Analytics Business Intelligence OneLake Microsoft Fabric Unified analytics platform Lake centric and open Empower every Office user Pervasive security and governance

Slide 25

Slide 25 text

AI/LLM revolution consequences for data practitioners 1. Better tools to do our work faster 2. Make structured and unstructured data accesible 3. Offer similar features/services to others

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Microsoft Fabric Data analytics for the era of AI Complete Analytics Platform Everything, unified SaaS-ified Secured and governed Lake Centric and Open OneLake One copy Open at every tier Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights

Slide 28

Slide 28 text

Complete Analytics Platform Everything, unified SaaS-ified Secured and governed AI Powered Empower Every Business User Lake Centric and Open Complete Analytics Platform Everything, unified SaaS-ified Secured and governed OneLake One copy Open at every tier Familiar and intuitive Built into Microsoft 365 Insight to action Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action

Slide 29

Slide 29 text

Persona optimized experiences

Slide 30

Slide 30 text

AI Powered Copilot accelerated GPT on your data AI-driven insights Complete Analytics Platform Empower Every Business User Lake Centric and Open Everything, unified SaaS-ified Secured and governed OneLake One copy Open at every tier Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI

Slide 31

Slide 31 text

Copilot in Power BI Create beautiful and insightful reports just by chatting with Copilot Define metrics and calculations for your data model just by describing them in natural language Use Copilot to find and summarize insights in your data Stay focused on your business outcomes and unlock insights in your data with Copilot

Slide 32

Slide 32 text

Copilot in Power BI

Slide 33

Slide 33 text

Copilot in Notebooks Use Copilot to enrich, model, analyze and explore your data in notebooks Work with Copilot to understand how best to analyze your data Chat with Copilot to create and configure ML models Write code faster with inline code suggestions from Copilot Use Copilot to summarize and explain code to understand how it works

Slide 34

Slide 34 text

Copilot in Notebooks

Slide 35

Slide 35 text

at scale

Slide 36

Slide 36 text

Microsoft Fabric Data analytics for the era of AI Complete Analytics Platform Everything, unified SaaS-ified Secured and governed Lake Centric and Open OneLake One copy Open at every tier Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights

Slide 37

Slide 37 text

Lake Centric and Open OneLake One copy Open at every tier Complete Analytics Platform AI Powered Empower Every Business User Everything, unified SaaS-ified Secured and governed Familiar and intuitive Built into Microsoft 365 Insight to action Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI Lake Centric and Open OneLake One copy Open at every tier

Slide 38

Slide 38 text

OneLake for all Data “The OneDrive for Data” A single SaaS lake for the whole organization Provisioned automatically with the tenant All workloads automatically store their data in the OneLake workspace folders All the data is organized in an intuitive hierarchical namespace The data in OneLake is automatically indexed for discovery, MIP labels, lineage, PII scans, sharing, governance and compliance

Slide 39

Slide 39 text

One Copy for all computes Real separation of compute and storage All the compute engines store their data automatically in OneLake The data is stored in a single common format Delta – Parquet, an open standards format, is the storage format for all tabular data in Analytics vNext Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines Serverless Compute Customers 360 Finance Service Telemetry Business KPIs Delta – Parquet FormatÅ Delta – Parquet Format Delta – Parquet Format Delta – Parquet Format T-SQL Spark KQL Analysis Services

Slide 40

Slide 40 text

Taking One Copy to the Next Level Shortcuts Customers 360 Finance Service Telemetry Business KPIs Amazon Google Azure Sharing data in OneLake is as easy as sharing files in OneDrive, removing the needs for data duplication With shortcuts, data throughout OneLake can be composed together without any data movement Shortcuts also allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement, making OneLake a multi- cloud data lake With support for industry standard APIs, OneLake data can be directly accessed by any application or service

Slide 41

Slide 41 text

that different

Slide 42

Slide 42 text

Making Unstructured data accesible • Query/Index logs to extract information e.g. Observability • Just put everything in some database we will care later

Slide 43

Slide 43 text

Does every data have some hidden structure? • File Format Metadata (e.g. headers in images) • Automatic structure ‘extraction’ - Parsing • Look for structure conceptually - Features • Manual structure (aka Human labeling) LLMs • Represent data in a model space (aka embeddings)

Slide 44

Slide 44 text

Embedding Model 0.027 -0.001 0.002 … 0.011 Image Audio Text Semantic Search & Power Of Embeddings

Slide 45

Slide 45 text

Cosine Similarity Cosine similarity - Wikipedia

Slide 46

Slide 46 text

Cosine Similarity Cosine similarity - Wikipedia

Slide 47

Slide 47 text

Vector Databases

Slide 48

Slide 48 text

OpenAI Fine-Tuning API Fine-tuning - OpenAI API {"messages": [ {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}] }

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

 Turing  Rich language understanding  Z-Code  100 languages translation  Florence  Breakthrough visual recognition  OpenAI  GPT-3/GPT-4  Human-like language generation  DALL-E  Realistic image generation  Codex  Advanced code generation  Azure AI services  Vision  Speech  Language  Decision  OpenAI Service  Cognitive Search  Form Recognizer  Immersive Reader  Bot Service  Video Analyzer  Better search and Q&A  Better customer engagement and support  Better matching and content moderation  Better email management and meeting preparation  Better knowledge management  Better meeting management  Better reading and writing assistance  Better content moderation ChatGPT Conversation generation

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Data Science in Microsoft Fabric End-to-end data science for predictive business insights Developer friendly • SaaS experiences with quick setup • Starter pools with fast cluster startup • Code authoring experiences in Notebooks and IDE • VSCode integration • Git integration (CI/CD) Data Centric • Easy and secure access to lake centric data • Open Delta Lake support promotes reproducibility • Native integration with data infrastructure Secure collaboration • Unified platform for all analytics roles incl. data scientists • Secure and easy sharing of data, code, models and experiments Rich ML tools • Supports MLFlow model and experiment management • MLFlow Autologging • Large set of built-in, scalable ML tools with SynapseML library • Serve predictions swiftly to PowerBI with Direct Lake mode

Slide 54

Slide 54 text

Problem formulation/ideation Experiment & Model Enrich & Operationalize Insight Data discovery and pre-processing Prepare Model Evaluate Explore Data Wrangler SynapseML Batch PREDICT Direct Lake The Data Science Process in Fabric

Slide 55

Slide 55 text

Notebooks

Slide 56

Slide 56 text

Built in model & experiment tracking enables data scientists to track and compare their different experiment runs and model versions. Automatically capture model metrics & parameters with built-in support for MLFlow auto-logging Users can create and manage model artifacts in Trident MLFlow compatible. Model registry is powered by AzureML Models and Experiments with MLFlow

Slide 57

Slide 57 text

SynapseML and Microsoft AI services Distributed ML Model Training MLflow support Cognitive Services OpenAI LLMs

Slide 58

Slide 58 text

Service Name API Type Vision OCR Analyze Image Recognize Text Read Image Recognize Domain Specific Content Generate Thumbnails Tag Image Describe Image Form Recognizer Analyze Layout Analyze Receipts Analyze Business Cards Analyze Invoices Analyze ID Documents Analyze Custom Model Analyze Documents List Custom Models Bing Bing Image Search Face Detect Face Find Similar Face Group Faces Identify Faces Verify Faces Speech Speech to Text Conversation Transcription Text to Speech Emotion Recognition SynapseML: Cognitive Services Built-in to Trident Service Name API Type OpenAI Completion* Chat* Embeddings* Text Entity Detector* Key Phrase Extractor* Language Detector* PII* Sentiment* Healthcare Analyze Text Translation Translate* Transliterate* Detect Language* Break Sentence* Dictionary Lookup* Dictionary Examples* Document Translator Azure Search Add Documents Anomaly Detection Detect Last Anomaly* Detect Anomalies* Simple Detect Anomalies Custom Detection Multivariate Detection Azure Maps Address Geocoding Reverse Geocoding Check Point in Polygon *Supported in Trident built-in endpoint at //build Private Preview

Slide 59

Slide 59 text

AI Plug-ins for your data Create AI plug-ins to deliver custom generative AI experiences for your data Enable custom Q&A on your data in Fabric Define custom business semantics and grounding unique to your organization Deploy plug-ins to work seamlessly with Copilot in Business Chat

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

Want to learn more microsoft.com/fabric Microsoft Fabric

Slide 62

Slide 62 text

aka.ms/learnlive-get-started-microsoft-fabric Date Title August 29, 2023 Get started with end-to-end analytics and Lakehouses in Microsoft Fabric September 5, 2023 Use Apache Spark in Microsoft Fabric September 12, 2023 Work with Delta Lake tables in Microsoft Fabric September 19, 2023 Use Data Factory pipelines in Microsoft Fabric September 26, 2023 Ingest Data with Dataflows Gen2 in Microsoft Fabric October 3, 2023 Get started with data warehouses in Microsoft Fabric October 10, 2023 Get started with Real-Time Analytics in Microsoft October 17, 2023 Get started with data science in Microsoft Fabric October 24, 2023 Administer Microsoft Fabric

Slide 63

Slide 63 text

aka.ms/fabric-csc Compete Benchmark your progress against friends and coworkers. It's always better when we learn together. Learn Increase your understanding with easy-to-read instruction and stay up on the bleeding- edge of technology. Develop skills By the end of the challenge, you will have marketable skills to better yourself and your career.

Slide 64

Slide 64 text

Microsoft Fabric Community Resources ✓ Try Microsoft Fabric for free: https://aka.ms/try-fabric ✓ Join the Fabric community: https://aka.ms/fabriccommunity ✓ Share and vote for ideas to improve Fabric: https://aka.ms/fabricideas ✓ Read and comment our blog: https://aka.ms/fabricblog ▪ Product announcement: https://aka.ms/fabric ▪ Digital Event at Build (videos): https://aka.ms/build-with-analytics ▪ Product website: https://aka.ms/microsoft-fabric ▪ Documentation: https://aka.ms/fabric-docs ▪ Fabric e-book: https://aka.ms/fabric-get-started-ebook ▪ Microsoft Learn: https://aka.ms/learn-fabric ▪ End-to-end scenario tutorials: https://aka.ms/fabric-tutorials ▪ Fabric Notes: https://aka.ms/fabric-notes

Slide 65

Slide 65 text

Thank you!

Slide 66

Slide 66 text

What’s coming next?

Slide 67

Slide 67 text

Roadmap Problem formulation/ideation Experiment & Model Enrich & Operationalize Insight Data discovery and pre-processing • Tagging support • Hyperparameter tuning • AutoML (FLAML) • Model interpretability • DNN training • Built-in pre-trained AI models (Cognitive Services) • Data Wrangler on Spark • Explore PowerBI datasets from Notebooks (Semantic Link) • CI/CD • ALM support for ML models • Trident SDK for ML • Improved model batch scoring with containerization • Model endpoints • Trident ML model support in PBI dataflows • Monitoring of models • Feature store • Integration with PowerBI Metrics Copilot experiences and Azure Open AI Integration • Open Notebook from BI report visual (Semantic link)

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

Automate the process of building machine learning models with FLAML Code-first integration to parallelize AutoML trials with Spark Run AutoML Integrated with MLFlow to automatically capture runs & metrics Microsoft Confidential: Content is shared under NDA AutoML with FLAML Private Preview

Slide 70

Slide 70 text

Extra slides

Slide 71

Slide 71 text

COMPUTE A shared pool of capacity that powers all capabilities in Microsoft Fabric, from data modeling and data warehousing to business intelligence. Pay-as-you-go (per sec billing with one minute minimum). STORAGE A single place to store all data. Pay-as-you-go ($ per GB / month). Microsoft Fabric simplicity Microsoft Fabric is a unified product for all your data and analytics workloads. Rather than provisioning and managing separate compute for each workload, with Microsoft Fabric, your bill is determined by two variables: the amount of compute you provision and the amount of storage you use.

Slide 72

Slide 72 text

Pricing Pay as you go

Slide 73

Slide 73 text

Azure OpenAI + Cognitive Search

Slide 74

Slide 74 text

Azure OpenAI + Plug-in Introduction - OpenAI API OpenAI API Plug-Ins App Orchestrator API Data Source Others