Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ETL is dead; long-live streams

nehanarkhede
November 09, 2016

ETL is dead; long-live streams

Slides from my keynote at QCon SF 2016 https://qconsf.com/sf2016/keynote/rise-real-time

nehanarkhede

November 09, 2016
Tweet

More Decks by nehanarkhede

Other Decks in Technology

Transcript

  1. “ #1: Single-server databases are replaced by a myriad of

    distributed data platforms that operate at company-wide scale
  2. A giant mess! App App App App search Hadoop DWH

    monitoring security MQ MQ cache cache
  3. Streaming platform DWH Hadoop security App App App App search

    NoSQL monitor ing request-response messaging OR stream processing streaming data pipelines changelogs
  4. “ BUT … ETL tools have been around for a

    long time, data coverage in data warehouses is still low! WHY?
  5. “ #3: Operational cost of ETL is high; it is

    slow; time and resource intensive
  6. “ #4: ETL tools were built to narrowly focus on

    connecting databases and the data warehouse in a batch fashion
  7. Old world: scale or timely data, pick one real-time scale

    batch EAI ETL real-time BUT not scalable scalable BUT batch
  8. new world: streaming, real-time and scalable real-time scale EAI ETL

    Streaming Platform real-time BUT not scalable real-time AND scalable scalable BUT batch batch
  9. mobile app web app APIs Streaming Platform Hadoop Security Monitoring

    Rec engine “A product was viewed” Event-Centric Thinking
  10. Streaming platform DWH Hadoop App App App App App App

    App App request-response messaging OR stream processing streaming data pipelines changelogs
  11. “ #3: Enable forward-compatible data architecture; the ability to add

    more applications that need to process the same data … differently
  12. app logs app logs app logs app logs #1: Extract

    as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” #3: Load into DWH DWH
  13. #1: Extract as unstructured text #2: Transform1 = data cleansing

    = “what is a product view” #4: Transform2 = drop PII fields” DWH #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” Cassandra #1: Extract as unstructured text again #3: Load cleansed data #3: Load cleansed data
  14. #1: Extract as structured product view events #2: Transforms =

    drop PII fields” #4:.1 Load product view stream #4: Load filtered product View stream DWH Cassandra Streaming Platform #4.2 Load filtered product view stream
  15. “ To enable forward compatibility, redefine the T in ETL:

    Data transformations, not data cleansing!
  16. #1: Extract once as structured product view events #2: Transform

    once = drop PII fields” and enrich with product metadata #4.1: Load product views stream #4: Load filtered and enriched product views stream DWH Cassandra Streaming Platform #4.2: Load filtered and enriched product views stream
  17. “ Forward compatibility = Extract clean-data once; Transform many different

    ways before Loading into respective destinations … as and when required
  18. “ In summary, needs of modern data integration solution? Scale,

    diversity, latency and forward compatibility
  19. Requirements for a modern streaming data integration solution - Fault

    tolerance - Parallelism - Latency - Delivery semantics - Operations and monitoring - Schema management
  20. Data integration: platform vs tool Central, reusable infrastructure for many

    use cases One-off, non-reusable solution for a particular use case
  21. New shiny future of etl: a streaming platform NoSQL RDBMS

    Hadoop DWH Apps Apps Apps Search Monitoring RT analytics
  22. “ Streaming platform serves as the central nervous system for

    a company’s data in the following ways ...
  23. “ #2: Serves as the source-of-truth pipeline for feeding all

    data processing destinations; Hadoop, DWH, NoSQL systems and more
  24. a short history of data integration drawbacks of ETL needs

    and requirements for a streaming platform new, shiny future of ETL: a streaming platform What does a streaming platform look like and how it enables Streaming ETL?
  25. “ What role does Kafka play in the new shiny

    future for data integration?
  26. The log 0 1 2 3 4 5 6 7

    next write reader 1 reader 2
  27. The log & pub-sub 0 1 2 3 4 5

    6 7 publisher subscriber 1 subscriber 2
  28. Kafka’s Connect API: Streaming data ingestion app Messaging APIs Messaging

    APIs Connect API Connect API app source sink Extract Load
  29. Kafka’s streams API: stream processing (transforms) Messaging API Streams API

    apps apps Connect API Connect API source sink Extract Load Transforms
  30. Kafka’s Connect API = Connectors Made Easy! - Scalability: Leverages

    Kafka for scalability - Fault tolerance: Builds on Kafka’s fault tolerance model - Management and monitoring: One way of monitoring all connectors - Schemas: Offers an option for preserving schemas from source to sink
  31. 2 visions for stream processing Real-time Mapreduce Event-driven microservices VS

    - Central cluster - Custom packaging, deployment & monitoring - Suitable for analytics-type use cases - Embedded library in any Java app - Just Kafka and your app - Makes stream processing accessible to any use case
  32. “ #2: Convenient DSL with all sorts of operators: join(),

    map(), filter(), windowed aggregates etc
  33. Streams API app sink source Connect API Connect API Transforms

    Load Extract New shiny future of ETL: Kafka
  34. A giant mess! App App App App search Hadoop DWH

    monitoring security MQ MQ cache cache
  35. All your data … everywhere … now Streaming platform DWH

    Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  36. VISION: All your data … everywhere … now Streaming platform

    DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs