Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Real-Time IoT Application with Apache Pulsar and Apache Pinot (Timothy Spann, Cloudera) | RTA Summit 2023

Building a Real-Time IoT Application with Apache Pulsar and Apache Pinot (Timothy Spann, Cloudera) | RTA Summit 2023

We will walk step-by-step with live code and demos on how to build a real-time IoT application with Pinot + Pulsar.

First, we stream sensor data from an edge device monitoring location conditions to Pulsar via a Python application.

We have our Apache Pinot “realtime” table connected to Pulsar via the pinot-pulsar stream ingestion connector.

Our data streams into the stream, and we visualize it with Superset.

https://medium.com/@tspann/building-a-real-time-iot-application-with-apache-pulsar-and-apache-pinot-1e3baf8c1824

Source Code
https://github.com/tspannhw/pulsar-thermal-pinot

Reference
https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion/apache-pulsar
https://dev.startree.ai/docs/pinot/recipes/pulsar

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. 2 Notes We will walk step-by-step with live code and

    demos on how to build a real-time IoT application with Pinot + Pulsar. First, we stream sensor data from an edge device monitoring location conditions to Pulsar via a Python application. We have our Apache Pinot “realtime” table connected to Pulsar via the pinot-pulsar stream ingestion connector. Our data streams into the stream, and we visualize it with Superset. https://medium.com/@tspann/building-a-real-time-iot-application-with-apache-pulsar-and-apach e-pinot-1e3baf8c1824 https://github.com/tspannhw/pulsar-thermal-pinot
  2. FLiPN-FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer

    Advocate, Cloudera Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://github.com/tspannhw/EverythingApacheNiFi https://medium.com/@tspann Apache NiFi x Apache Kafka x Apache Flink x Java
  3. 5 • Introduction to Pinot • Introduction to Apache Pulsar

    • NiFi to Pulsar to Pinot (FLiPN) • NiFi to Kafka to Pinot (P-FLaNK) • FLaNK Ingest • Demos
  4. 6 Assets Apache NiFi: Flows Apache Pinot: Real-Time Tables Apache

    Kafka: Topics Apache Pulsar: Topics Apache Flink SQL: Virtual Tables
  5. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription

    In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging
  6. 14 STREAMING FROM … TO .. WHILE .. Data distribution

    as a first class citizen IOT Devices LOG DATA SOURCES ON-PREM DATA SOURCES BIG DATA CLOUD SERVICES CLOUD BUSINESS PROCESS SERVICES * CLOUD DATA* ANALYTICS /SERVICE (Cloudera DW) App Logs Laptops /Servers Mobile Apps Security Agents CLOUD WAREHOUSE UNIVERSAL DATA DISTRIBUTION (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors
  7. 15 End to End Streaming Pipeline Example Enterprise sources Weather

    Errors Aggregates Alerts Stocks ETL Analytics Clickstream Market data Machine logs Social SQL
  8. © 2019 Cloudera, Inc. All rights reserved. 17 Apache Kafka

    • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Source System Source System Source System Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe Point-To-Point Request-Response
  9. 19 CSP Community Edition • Kafka, KConnect, SMM, SR, Flink,

    and SSB in Docker • Runs in Docker • Try new features quickly • Develop applications locally • Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ◦ $> docker compose up • Licensed under the Cloudera Community License • Unsupported • Community Group Hub for CSP • Find it on docs.cloudera.com under Applications
  10. 21 Cloudera Flow and Edge Management Enable easy ingestion, routing,

    management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance Advanced tooling to industrialize flow development (Flow Development Life Cycle) ACQUIRE • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG PROCESS HASH MERGE EXTRACT DUPLICATE SPLIT ENCRYPT TALL EVALUATE EXECUTE GEOENRICH SCAN REPLACE TRANSLATE CONVERT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT ROUTE RATE DISTRIBUTE LOAD DELIVER • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG
  11. 22 Cloudera DataFlow: Universal Data Distribution Service Process Route Filter

    Enrich Transform Distribute Connectors Any destination Deliver Ingest Active Passive Connectors Gateway Endpoint Connect & Pull Send Data born in the cloud Data born outside the cloud Universal Data Distribution Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
  12. 23 © 2023 Cloudera, Inc. All rights reserved. What is

    Apache NiFi? Apache NiFi is a scalable, real-time streaming data platform that collects, curates, and analyzes data so customers gain key insights for immediate actionable intelligence.
  13. 24 Apache NiFi Enable easy ingestion, routing, management and delivery

    of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  14. © 2023 Cloudera, Inc. All rights reserved. 25 Apache NiFi

    Pulsar Connector https://streamnative.io/apache-nifi-connector/
  15. © 2023 Cloudera, Inc. All rights reserved. 27 Flink SQL

    https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html • Streaming Analytics • Continuous SQL • Continuous ETL • Complex Event Processing • Standard SQL Powered by Apache Calcite
  16. 28 Flink SQL -- specify Kafka partition key on output

    SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  17. 29 © 2023 Cloudera, Inc. All rights reserved. SQL Stream

    Builder (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  18. Easy Docker Demo docker exec -it pinot-controller /bin/bash docker exec

    -it pinot-controller bin/pinot-admin.sh JsonToPinotSchema \ -timeColumnName ts \ -metrics "temperature,humidity,co2,totalvocppb,equivalentco2ppm,pressure,temperatureicp,cputempf"\ -dimensions "host,ipaddress" \ -pinotSchemaName=thermal \ -jsonFile=/data/thermal.json \ -outputDir=/config docker exec -it pinot-controller bin/pinot-admin.sh AddSchema \ -schemaFile /config/thermalschema.json \ -exec
  19. Local Apache Pinot Admin curl -X DELETE "http://localhost:9000/tables/thermal?type=realtime" -H "accept:

    application/json" curl -X DELETE "http://localhost:9000/schemas/thermal" -H "accept: application/json" docker exec -it pinot-controller bin/pinot-admin.sh AddSchema \ -schemaFile /config/thermalschema.json \ -exec curl -X POST "http://localhost:9000/tables" -H "accept: application/json" -H " ….