Upgrade to Pro — share decks privately, control downloads, hide ads and more …

12 April 2024 - Princeton AI Max Conference - Cloudera - Real-time AI Streaming

Timothy Spann
April 15, 2024
27

12 April 2024 - Princeton AI Max Conference - Cloudera - Real-time AI Streaming

Timothy Spann

April 15, 2024
Tweet

Transcript

  1. © 2024 Cloudera, Inc. All rights reserved. Building Real-Time Generative

    AI Pipelines Tim Spann Principal Developer Advocate April 12, 2024 AI Max Summit
  2. © 2024 Cloudera, Inc. All rights reserved. 3 This week

    in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  3. © 2024 Cloudera, Inc. All rights reserved. 4 Confidential—Restricted @PaasDev

    https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual https://linktr.ee/tspannhw
  4. © 2024 Cloudera, Inc. All rights reserved. 5 Tim Spann

    Twitter: @PaasDev Blog: datainmotion.dev Principal Developer Advocate Princeton Future of Data Meetup ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-HPE, ex-PwC, ex-EY. https://medium.com/@tspann https://github.com/tspannhw
  5. © 2024 Cloudera, Inc. All rights reserved. 6 Some common

    Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs • Python and JavaScript Libraries • Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps • Python and JavaScript Libraries • Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications • Latest open source LLMs are in HuggingFace • + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
  6. © 2024 Cloudera, Inc. All rights reserved. 7 Enterprise Knowledge

    Base / Chatbot / Q&A - Customer Support & Troubleshooting - Enable open ended conversations with user provided prompts Code assistant: - Provide relevant snippets of code as a response to a request written in natural language. - Assist with creating test cases and synthetic test data. - Reference other relevant data such as a company’s documentation to help provide more accurate responses. Social and emotional sensing - Gauge emotions and opinions based on a piece of text. - Understand and deliver a more nuanced message back based on sentiment. ENTERPRISE WIDE USE CASES FOR AN LLM Classification and Clustering - Categorize and sort large volumes of data into common themes and trends to support more informed decision making. Language Translation - Globalize your content by feeding web pages through LLMs for translation. - Combine with chatbots to provide multilingual support to your customer base. Document Summarization - Distill large amounts of text down to the most relevant points. Content Generation - Provide detailed and contextually relevant prompts to develop outlines, brainstorm ideas and approaches for content. L Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
  7. 9 Which Model and When? Use the right model for

    right job: closed or open-source Closed Source Usage can easily scale but so can your costs Rapidly improving AI models Most advanced AI models Excel at more specialized tasks Great for a wide range of tasks Open Source Better cost planning Compliance, privacy, and security risks More control over where & how models are deployed
  8. 10 Adoption of Generative AI is a Journey Identifying AI

    challenges in the enterprise Data integration barriers • Streamlined access to enterprise data Rigid model infrastructure • Modularity • Flexibility • AI Ops Lack of security and transparency • Model control • Built-in security • Visibility & governance What’s missing Challenges
  9. 11 Data = Organization Context Your data enables contextually accurate

    responses from LLMs Large Language Model User Query Contextually Inaccurate Response Data Organization Context User Query Large Language Model Contextually Accurate Response
  10. © 2024 Cloudera, Inc. All rights reserved. 12 CLOSED-SOURCE FOUNDATION

    MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION AI APPLICATIONS
  11. Live Q&A Travel Advisories Weather Reports Documents Social Media Databases

    Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … ARCHITECTURE INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  12. LLM USE CASE Vector DB AI Model Unstructured file types

    Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  13. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features • Python Integration • Parameters •

    JDK 21+ • JSON Flow Serialization • Rules Engine for Development Assistance • Run Process Group as Stateless • flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  14. 17 DataFlow Pipelines Can Help External Context Ingest Ingesting, routing,

    clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions
  15. 18 UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped,

    zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  16. 19 CLOUD ML/DL/AI/Vector Database Services • Cloudera ML • Amazon

    Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Weaviate, Pinecone, Milvus, Chroma DB, SOLR, …
  17. © 2024 Cloudera, Inc. All rights reserved. Extract Entities •

    Python 3.10+ • NLP, SpaCY • Extract locations • Extract organizations • Extract money • Extract time • Extract events • Extract countries • Extract objects, food, people, quantities https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py
  18. © 2024 Cloudera, Inc. All rights reserved. Extract Company Names

    • Python 3.10+ • Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  19. © 2024 Cloudera, Inc. All rights reserved. WatsonX SDK To

    Foundation • Python 3.10+ • LLM • WatsonX.AI Foundation Models • Inference • Secure • Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
  20. © 2024 Cloudera, Inc. All rights reserved. CaptionImage • Python

    3.10+ • Hugging Face • Salesforce/blip-image-captioning-large • Generate Captions for Images • Adds captions to FlowFile Attributes • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  21. © 2024 Cloudera, Inc. All rights reserved. RESNetImageClassification • Python

    3.10+ • Hugging Face • Transformers • Pytorch • Datasets • microsoft/resnet-50 • Adds classification label to FlowFile Attributes • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  22. © 2024 Cloudera, Inc. All rights reserved. NSFWImageDetection • Python

    3.10+ • Hugging Face • Transformers • Falconsai/nsfw_image_detection • Adds normal and nsfw to FlowFile Attributes • Gives score on safety of image • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  23. © 2024 Cloudera, Inc. All rights reserved. FacialEmotionsImageDetection • Python

    3.10+ • Hugging Face • Transformers • facial_emotions_image_detection • Image Classification • Adds labels/scores to FlowFile Attributes • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  24. © 2024 Cloudera, Inc. All rights reserved. Other Python Processors

    • Chunk Document, Parse Document • Prompt Chat GPT • Put Chroma, Query Chroma • Put Pinecone, Query Pinecone
  25. © 2024 Cloudera, Inc. All rights reserved. © 2023 Cloudera,

    Inc. All rights reserved. 32 FLINK SQL -> CLOUDERA MACHINE LEARNING MODELS
  26. © 2024 Cloudera, Inc. All rights reserved. © 2023 Cloudera,

    Inc. All rights reserved. 33 FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
  27. © 2024 Cloudera, Inc. All rights reserved. © 2023 Cloudera,

    Inc. All rights reserved. 34 SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af SELECT CALLLLM(CAST(messagetext as STRING)) as generatedtext, messagerealname, messageusername, messagetext,messageusertz, messageid, threadts, ts FROM flankslackmessages WHERE messagetype = 'message'
  28. © 2024 Cloudera, Inc. All rights reserved. 35 SSB MATERIALIZED

    VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  29. © 2024 Cloudera, Inc. All rights reserved. 36 © Cloudera,

    Inc. All rights reserved. Apache Flink SQL Democratize access to real-time data with just SQL
  30. © 2024 Cloudera, Inc. All rights reserved. Let’s do a

    metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia YES, FRANZ, IT’S KAFKA
  31. © 2024 Cloudera, Inc. All rights reserved. 40 Streams Replication

    Manager (SRM) • Event Replication engine for Kafka • Supports active-active, multi-cluster, cross DC replication scenarios • Leverage Kafka Connect for scalability and HA • Replicate data and configurations (ACL, partitioning, new topics, etc) • Offset translation for simplified failover • Integrate replication monitoring with SMM
  32. © 2024 Cloudera, Inc. All rights reserved. 44 Cloudera’s Open

    Data Lakehouse ❏ Multi-function analytics for Streaming, Data Engineering, Data Warehouse and AI/ML with integrated data services ❏ Common security and governance policies and data lineage with SDX integration ❏ Common dataset with all CDP analytics engines without data duplication and movement ❏ Deployment freedom with Multi-Hybrid Cloud Iceberg Tables DATA WAREHOUSE MACHINE LEARNING DATA ENGINEERING DATA FLOW STREAM PROCESSING Multi-Hybrid Cloud Metadata | Security | Encryption | Control | Governance
  33. © 2024 Cloudera, Inc. All rights reserved. 45 Compute Engine

    Interoperability & SDX Integration • Snapshot isolation ensures consistent data access and processing with various compute engines including Hive, Spark, Impala and Nifi • Security & Governance support (e.g. FGAC) through Ranger integration • Data lineage support through Atlas integration Apache Impala Iceberg Tables Ranger Atlas
  34. © 2024 Cloudera, Inc. All rights reserved. FLINK & ICEBERG

    INTEGRATION Robust Next Generation Architecture for Data Driven Business Unified Processing Engine Massive Open table format Iceberg Support for Flink APIs through SSB • Maximally open • Maximally flexible • Ultra high performance for MASSIVE data • Can be used as Source and Sink • Supports batch and streaming modes • Supports time travel
  35. © 2024 Cloudera, Inc. All rights reserved. NIFI & ICEBERG

    INTEGRATION • PutIceberg processor in CFM 2.1.6 • PutIcebergCDC
  36. © 2024 Cloudera, Inc. All rights reserved. CSP Community Edition

    • Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry. ◦ $>docker compose up • Licensed under the Cloudera Community License • Unsupported Commercially (Community Help - Ask Tim) • Community Group Hub for CSP • Find it on docs.cloudera.com (see QR Code) • Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB • Develop apps locally
  37. © 2024 Cloudera, Inc. All rights reserved. Open Source Edition

    • Apache NiFi in Docker • Try new features quickly • Develop applications locally • Docker NiFi ◦ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv vgEvjnaLjFEB apache/nifi:latest • Licensed under the ASF License • Unsupported • NiFi 1.25 and NiFi 2.0.0-M2 https://hub.docker.com/r/apache/nifi
  38. © 2024 Cloudera, Inc. All rights reserved. CEM, CDF, CSP

    • https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386 • https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-mi nifi-nifi-kafka-and-flink-ee03ee6722cb • https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb • https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b CDF • https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958 LLM, GenAI, HuggingFace, WatsonX, OLLAMA, Mistral, NiFi, Python, Slack, Pytorch • https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12 • https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe 98e580f • https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c • https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89 • https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f 7d28a9 • https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea6 e3b61 • https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
  39. CLOUDERA STREAM PROCESSING Two Major Capabilities: Enterprise Messaging and Powerful

    Stream Processing Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, Schema Registry for centralized schema management, and support for Kafka Connect and Cruise Control Cloudera Streaming Analytics (CSA) Powered By Apache Flink Cloudera Streams Messaging (CSM) Powered by Apache Kafka Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management made simple with SQL
  40. ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA Extend streams messaging services

    for Schema Mgmt, Replication & Monitoring Schema Registry Kafka Schema Governance Streams Replication Manager Kafka Replication Service for Disaster Recovery Streams Messaging Manager Management & Monitoring Service for all of your Kafka clusters
  41. Kafka Data Movement, Operations and Security Made Easier ENTERPRISE MANAGEMENT

    CAPABILITIES FOR APACHE KAFKA Kafka Connect Support Simple Data Movement Change Data Capture Connectors Build Custom Connectors with NiFi Ranger Security Improved ACL and Audit for Kafka, KConnect and Schema Registry Cruise Control Support Intelligent Rebalancing & Self-Healing of your Kafka Clusters
  42. CLOUDERA STREAM PROCESSING Two Major Capabilities: Enterprise Messaging and Powerful

    Stream Processing Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, Schema Registry for centralized schema management, and support for Kafka Connect and Cruise Control Cloudera Streaming Analytics (CSA) Powered By Apache Flink Cloudera Streams Messaging (CSM) Powered by Apache Kafka Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management made simple with SQL
  43. NEXT GENERATION STREAMING ANALYTICS WITH APACHE FLINK Low latency stateful

    stream processing • Flink is a distributed data processing systems ideally suited for real-time, event driven applications. • Unifies stream and batch processing • Advanced features - late arriving data, checkpointing, event time processing, Exactly Once Processing Real-Time Insights Event Processing Low Latency
  44. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts,

    and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  45. © 2024 Cloudera, Inc. All rights reserved. 61 LLMs ARE

    FOUNDATION MODELS Base models that can be adapted for a wide range of use cases Terabytes of Data (Multiple Formats) Foundation Models (Billions of Parameters) Train Adapt Question/Answering Sentiment Analysis Doc summarization … ++ more ➔ Historically, data scientists trained specialized models against narrow datasets to solve specific tasks. ➔ LLMs are Foundation models that can be adapted to perform a variety of tasks. ◆ It is faster to “adapt” a foundation model than it is to train a specialized model from scratch ◆ Decouples “knowledge” from “intelligence” ◆ Opens up AI use cases to software developers (instead of just specialised data scientists)