Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLIT 2024 - Seattle - Real-Time AI Streaming with Cloudera

Timothy Spann
April 15, 2024
32

NLIT 2024 - Seattle - Real-Time AI Streaming with Cloudera

NLIT 2024 - Seattle - Real-Time AI Streaming with Cloudera

Timothy Spann

April 15, 2024
Tweet

Transcript

  1. © 2024 Cloudera, Inc. All rights reserved. NLIT - Cloudera

    Streaming Tim Spann Principal Developer Advocate April 9-10, 2024
  2. © 2024 Cloudera, Inc. All rights reserved. 2 This week

    in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  3. © 2024 Cloudera, Inc. All rights reserved. 3 Confidential—Restricted @PaasDev

    https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual
  4. © 2024 Cloudera, Inc. All rights reserved. 5 Some common

    Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs • Python and JavaScript Libraries • Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps • Python and JavaScript Libraries • Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications • Latest open source LLMs are in HuggingFace • + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
  5. © 2024 Cloudera, Inc. All rights reserved. 6 Enterprise Knowledge

    Base / Chatbot / Q&A - Customer Support & Troubleshooting - Enable open ended conversations with user provided prompts Code assistant: - Provide relevant snippets of code as a response to a request written in natural language. - Assist with creating test cases and synthetic test data. - Reference other relevant data such as a company’s documentation to help provide more accurate responses. Social and emotional sensing - Gauge emotions and opinions based on a piece of text. - Understand and deliver a more nuanced message back based on sentiment. ENTERPRISE WIDE USE CASES FOR AN LLM Classification and Clustering - Categorize and sort large volumes of data into common themes and trends to support more informed decision making. Language Translation - Globalize your content by feeding web pages through LLMs for translation. - Combine with chatbots to provide multilingual support to your customer base. Document Summarization - Distill large amounts of text down to the most relevant points. Content Generation - Provide detailed and contextually relevant prompts to develop outlines, brainstorm ideas and approaches for content. L Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
  6. © 2024 Cloudera, Inc. All rights reserved. 7 Which Model

    and When? Use the right model for right job: closed or open-source Closed Source Usage can easily scale but so can your costs Rapidly improving AI models Most advanced AI models Excel at more specialized tasks Great for a wide range of tasks Open Source Better cost planning Compliance, privacy, and security risks More control over where & how models are deployed
  7. © 2024 Cloudera, Inc. All rights reserved. 8 Adoption of

    Generative AI is a Journey Identifying AI challenges in the enterprise Data integration barriers • Streamlined access to enterprise data Rigid model infrastructure • Modularity • Flexibility • AI Ops Lack of security and transparency • Model control • Built-in security • Visibility & governance What’s missing Challenges
  8. © 2024 Cloudera, Inc. All rights reserved. 9 Data =

    Organization Context Your data enables contextually accurate responses from LLMs Large Language Model User Query Contextually Inaccurate Response Data Organization Context User Query Large Language Model Contextually Accurate Response
  9. © 2024 Cloudera, Inc. All rights reserved. 10 CLOSED-SOURCE FOUNDATION

    MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION AI APPLICATIONS
  10. © 2024 Cloudera, Inc. All rights reserved. 11 Live Q&A

    Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … ARCHITECTURE INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  11. © 2024 Cloudera, Inc. All rights reserved. LLM USE CASE

    Vector DB AI Model Unstructured file types Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  12. © 2024 Cloudera, Inc. All rights reserved. FLINK & ICEBERG

    INTEGRATION Robust Next Generation Architecture for Data Driven Business Unified Processing Engine Massive Open table format Iceberg Support for Flink APIs through SSB • Maximally open • Maximally flexible • Ultra high performance for MASSIVE data • Can be used as Source and Sink • Supports batch and streaming modes • Supports time travel
  13. © 2024 Cloudera, Inc. All rights reserved. 15 © 2021

    Cloudera, Inc. All rights reserved. Streams Replication Manager (SRM) • Event Replication engine for Kafka • Supports active-active, multi-cluster, cross DC replication scenarios • Leverage Kafka Connect for scalability and HA • Replicate data and configurations (ACL, partitioning, new topics, etc) • Offset translation for simplified failover • Integrate replication monitoring with SMM
  14. © 2024 Cloudera, Inc. All rights reserved. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0

    Features • Python Integration • Parameters • JDK 21+ • JSON Flow Serialization • Rules Engine for Development Assistance • Run Process Group as Stateless • flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  15. © 2024 Cloudera, Inc. All rights reserved. 18 UNSTRUCTURED DATA

    WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  16. © 2024 Cloudera, Inc. All rights reserved. 19 CLOUD ML/DL/AI/Vector

    Database Services • Cloudera ML • Amazon Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Weaviate, Pinecone, Milvus, Chroma DB, SOLR, …
  17. © 2024 Cloudera, Inc. All rights reserved. 22 DataFlow Pipelines

    Can Help External Context Ingest Ingesting, routing, clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions
  18. © 2024 Cloudera, Inc. All rights reserved. Extract Company Names

    • Python 3.10+ • HuggingFace, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  19. © 2024 Cloudera, Inc. All rights reserved. Get Compound GTFS

    Data • Python 3.10+ • GTFS to JSON https://github.com/tspannhw/FLaNK-python-processors/blob/main/GetGTFSCompoundFeed.py
  20. © 2024 Cloudera, Inc. All rights reserved. Extract Text from

    Web VTT • Python 3.10+ • Web VTT to Text • Web Video Text Tracks Format Extractor https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API https://github.com/tspannhw/FLaNK-python-processors/blob/main/TranslateWebVTT.py WEBVTT 1 00:00:06.066 --> 00:00:07.166 Now let's talk about 2 00:00:07.166 --> 00:00:12.033 data retrieval, views, and materialized views.
  21. © 2024 Cloudera, Inc. All rights reserved. WatsonX SDK To

    Foundation • Python 3.10+ • LLM • WatsonX.AI Foundation Models • Inference • Secure • Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
  22. © 2024 Cloudera, Inc. All rights reserved. System / Process

    Monitoring • Python 3.10+ • psutil • Swap memory, disk, networks
  23. © 2024 Cloudera, Inc. All rights reserved. Generate Synthetic Records

    w/ Faker • Python 3.10+ • faker • Choose as many as you want • Attribute output
  24. © 2024 Cloudera, Inc. All rights reserved. Download a Wiki

    Page as HTML or WikiFormat (Text) • Python 3.10+ • Wikipedia-api • HTML or Text • Choose your wiki page dynamically
  25. © 2024 Cloudera, Inc. All rights reserved. Get GTFS Data

    • Python 3.10+ • GTFS from Transit URL • Alerts, Trip Updates or Vehicle Positions • Returns JSON • google.transit and google.protobuf
  26. © 2024 Cloudera, Inc. All rights reserved. CaptionImage • Python

    3.10+ • Hugging Face • Salesforce/blip-image-captioning-large • Generate Captions for Images • Adds captions to FlowFile Attributes • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  27. © 2024 Cloudera, Inc. All rights reserved. RESNetImageClassification • Python

    3.10+ • Hugging Face • Transformers • Pytorch • Datasets • microsoft/resnet-50 • Adds classification label to FlowFile Attributes • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  28. © 2024 Cloudera, Inc. All rights reserved. NSFWImageDetection • Python

    3.10+ • Hugging Face • Transformers • Falconsai/nsfw_image_detection • Adds normal and nsfw to FlowFile Attributes • Gives score on safety of image • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  29. © 2024 Cloudera, Inc. All rights reserved. FacialEmotionsImageDetection • Python

    3.10+ • Hugging Face • Transformers • facial_emotions_image_detection • Image Classification • Adds labels/scores to FlowFile Attributes • Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  30. © 2024 Cloudera, Inc. All rights reserved. Other Python Processors

    • Updated Pinecone (Vector DB Interface) • ChunkDocument, ParseDocument • ConvertCSVtoExcel • DetectObjectInImage • PromptChatGPT • PutChroma, QueryChroma (Vector DB Interface)
  31. © 2024 Cloudera, Inc. All rights reserved. 43 LLM, NiFi,

    Kafka & Flink Kafka topics Database Machine learning Flink SQL w/ SSB Lakehouse Data Viz Monitoring Architecture in the context of Codeless GenAI Pipelines DataFlow / NiFi Sources Sources Alerting
  32. © 2024 Cloudera, Inc. All rights reserved. 45 SSB ->

    CDF -> HUGGING FACE GOOGLE GEMINI
  33. © 2024 Cloudera, Inc. All rights reserved. 46 SSB UDF

    JS/JAVA + GenAI = Real-Time GenAI SQL https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af SELECT CALLLLM(CAST(messagetext as STRING)) as generatedtext, messagerealname, messageusername, messagetext,messageusertz, messageid, threadts, ts FROM flankslackmessages WHERE messagetype = 'message'
  34. © 2024 Cloudera, Inc. All rights reserved. FLaNK for Halifax

    Canada Transit — NiFi, Kafka, Flink, SQL, GTFS-RT | by Tim Spann | Cloudera | Dec, 2023 | Medium Never Get Lost in the Stream. NiFi-Kafka-Flink for getting to work… | by Tim Spann | Cloudera | Dec, 2023 | Medium Iteration 1: Building a System to Consume All the Real-Time Transit Data in the World At Once | by Tim Spann | Cloudera | Medium Watching Airport Traffic in Real-Time | by Tim Spann | Cloudera | Medium