Slide 1

Slide 1 text

© 2024 Cloudera, Inc. All rights reserved. Building Real-time Pipelines with FLaNK: A Case Study with Transit Data Tim Spann Principal Developer Advocate April 2, 2024

Slide 2

Slide 2 text

© 2024 Cloudera, Inc. All rights reserved. 2 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Field Engineer. Princeton/NYC Future of Data Meetups. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-HPE, ex-E&Y. https://medium.com/@tspann https://github.com/tspannhw

Slide 3

Slide 3 text

© 2024 Cloudera, Inc. All rights reserved. 3 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java, LLM, GenAI, Vector DB and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann

Slide 4

Slide 4 text

© 2024 Cloudera, Inc. All rights reserved. 4 @PaasDev https://www.meetup.com/futureofdata-princeton/ https://www.meetup.com/futureofdata-newyork/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual

Slide 5

Slide 5 text

© 2024 Cloudera, Inc. All rights reserved. FLaNK-MTA / Urban Transportation

Slide 6

Slide 6 text

© 2024 Cloudera, Inc. All rights reserved.

Slide 7

Slide 7 text

© 2024 Cloudera, Inc. All rights reserved.

Slide 8

Slide 8 text

© 2024 Cloudera, Inc. All rights reserved.

Slide 9

Slide 9 text

https://medium.com/@tspann/septa-transit-real-time-81082878b485 Philadelphia SEPTA

Slide 10

Slide 10 text

https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce Street Cameras

Slide 11

Slide 11 text

https://medium.com/cloudera-inc/subways-and-transit-updates-in-real-time-30c104c359ef NYC Subway

Slide 12

Slide 12 text

© 2024 Cloudera, Inc. All rights reserved. FLaNK for Halifax Canada Transit — NiFi, Kafka, Flink, SQL, GTFS-RT | by Tim Spann | Cloudera | Dec, 2023 | Medium Never Get Lost in the Stream. NiFi-Kafka-Flink for getting to work… | by Tim Spann | Cloudera | Dec, 2023 | Medium Iteration 1: Building a System to Consume All the Real-Time Transit Data in the World At Once | by Tim Spann | Cloudera | Medium Watching Airport Traffic in Real-Time | by Tim Spann | Cloudera | Medium

Slide 13

Slide 13 text

© 2024 Cloudera, Inc. All rights reserved. 13 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose

Slide 14

Slide 14 text

14 © Cloudera, Inc. All rights reserved. Apache Flink SQL Democratize access to real-time data with just SQL

Slide 15

Slide 15 text

Infer Tables from Kafka Topics with JSON or Avro

Slide 16

Slide 16 text

https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ● Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals

Slide 17

Slide 17 text

UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint

Slide 18

Slide 18 text

CLOUD ML/DL/AI/Vector Database Services • Cloudera ML • Amazon Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Weaviate, Pinecone, Milvus, Chroma DB, SOLR, …

Slide 19

Slide 19 text

© Cloudera, Inc. All rights reserved. 19 PROVENANCE

Slide 20

Slide 20 text

https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe98e580f

Slide 21

Slide 21 text

DataFlow Pipelines Can Help External Context Ingest Ingesting, routing, clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions

Slide 22

Slide 22 text

FLINK SQL -> CLOUDERA MACHINE LEARNING MODELS

Slide 23

Slide 23 text

FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI

Slide 24

Slide 24 text

SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af SELECT CALLLLM(CAST(messagetext as STRING)) as generatedtext, messagerealname, messageusername, messagetext,messageusertz, messageid, threadts, ts FROM flankslackmessages WHERE messagetype = 'message'

Slide 25

Slide 25 text

RESOURCES ● https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386 ● https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-mi nifi-nifi-kafka-and-flink-ee03ee6722cb ● https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb ● https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b ● https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958 ● https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12 ● https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe 98e580f ● https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c ● https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89 ● https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f 7d28a9 ● https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea6 e3b61 ● https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406

Slide 26

Slide 26 text

TH N Y U