Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Streaming ETL Pipelines with Redpanda ...

Dunith Dhanushka
September 16, 2024
9

Building Streaming ETL Pipelines with Redpanda Ecosystem

Streaming ETL pipelines ingest and process data as it arrives, minimizing latencies. This is in contrast to traditional batch ETL pipelines which process data in batches, often causing inconsistencies with the source data set. Streaming ETL pipelines feed data to various downstream systems, including real-time analytics, BI, and machine learning as they need cleansed and normalized data for smooth operations.

Making streaming ETL pipelines serverless offers several benefits for businesses and technical teams. From a business perspective, serverless streaming ETL can reduce operational costs, improve scalability, and enhance agility. Since there is no need to provision or manage infrastructure, organizations can save on hardware, software, and IT resources.
From a technical standpoint, serverless streaming ETL simplifies the development and deployment process, enabling teams to focus on building business logic rather than managing infrastructure.

In this talk, we design, build, and run a streaming ETL pipeline using several serverless streaming data technologies. That includes Redpanda, a streaming data platform with Kafka API compatibility, Redpanda Connect as the streaming ETL engine, and Apache Pinot as the serving layer database.

Data professionals including developers, data engineers, and architects would benefit from this talk as they can learn how to piece together different serverless technologies to build a real-time data pipeline. They will also see it in action, scaling up and down to accommodate varying demand spikes, which simulates realistic situations.

Dunith Dhanushka

September 16, 2024
Tweet

Transcript

  1. © 2023 REDPANDA DATA About the presenter 2 Dunith Dhanushka

    Senior Developer Advocate, Redpanda Data • Event streaming, real-time analytics, and stream processing enthusiast • Frequent blogger, speaker, and an educator
  2. Agenda 1. Two ways of building data pipelines. 2. Streaming

    ETL use case - Payment processing. 3. Making it serverless. 4. Demo. 5. Wrap up.
  3. A data pipeline is a series of processes and tools

    that automate the flow, transformation, and storage of data from various sources to a final destination for analysis or use.
  4. Batch Pipelines • Comparatively easy to build, debug, and learn.

    • Increased latency is a concern for data freshness. • Most common in the space and easy to get started with.
  5. Streaming Pipelines • Extracts, transforms, and loads data as it

    is generated. • Ideal for latency-sensitve use cases, like fraud detection, recommendation, etc. • Challenging to implement, debug, and scale.
  6. Pipeline goals • PI redaction - Scrub sensitive fields for

    compliance. • Data transformation - Normalize and optimize data for downstream systems.
  7. Redpanda Serverless • A Kafka API-compatible streaming data platform. •

    Written in C++, offering more performance and resource efficiency than Kafka. • Simpler to work with and developers love it!
  8. Redpanda’s role in the solution • Support high-throughput low-latency payment

    data ingestion. • Offer scalable and cost-efficient long term data retention. • Store transformed data and allow scalable downstream consumption
  9. Decodable • A serverless platform for building real-time ETL pipelines.

    • Managed Apache Flink and Debezium as a service.
  10. Decodable’s role in the solution • Redact PIs and transform

    payment events with Flink SQL. • Manage the Flink job. • Scale the processing as needed.
  11. Benefits of making it serverless • Management overhead has been

    taken care by vendors. • Usage-based pricing, pay-as-you-grow! • Less learning curve for developers, reduced onboarding time. • On-demand scaling, storage, and compute.
  12. Concerns • Data sovereignty. • Security - data at rest

    as well as data in transit. • Interoperability
  13. © 2024 REDPANDA DATA 34 Keep Learning University Self-paced, online

    courses. https://university.redpanda.com Docs Get a peak under the hood. https://docs.redpanda.com/ Slack Engage with our community. https://redpanda.com/slack Blogs Keep up to date with Redpanda. https://redpanda.com/blog Code Check out the source. https://github.com/redpanda-data Serverless Or just get started in seconds! https://redpanda.com/try-redpanda