Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ETL is dead; long-live streams

nehanarkhede
November 09, 2016

ETL is dead; long-live streams

Slides from my keynote at QCon SF 2016 https://qconsf.com/sf2016/keynote/rise-real-time

nehanarkhede

November 09, 2016
Tweet

More Decks by nehanarkhede

Other Decks in Technology

Transcript

  1. ETL is dead;
    long-live streams
    Neha Narkhede,
    Co-founder & CTO, Confluent

    View Slide


  2. Data and data systems have really
    changed in the past decade

    View Slide

  3. Old world: Two popular locations for data
    Operational databases Relational data warehouse
    DB
    DB
    DB
    DB DWH

    View Slide


  4. Several recent data trends are driving a
    dramatic change in the ETL architecture

    View Slide


  5. #1: Single-server databases are replaced
    by a myriad of distributed data
    platforms that operate at company-wide
    scale

    View Slide


  6. #2: Many more types of data sources
    beyond transactional data - logs, sensors,
    metrics...

    View Slide


  7. #3: Stream data is increasingly
    ubiquitous; need for faster processing
    than daily

    View Slide


  8. The end result? This is what data
    integration ends up looking like in
    practice

    View Slide

  9. App App App App
    search
    Hadoop
    DWH
    monitoring security
    MQ MQ
    cache
    cache

    View Slide

  10. A giant mess!
    App App App App
    search
    Hadoop
    DWH
    monitoring security
    MQ MQ
    cache
    cache

    View Slide


  11. We will see how transitioning to streams
    cleans up this mess and works towards...

    View Slide

  12. Streaming platform
    DWH Hadoop
    security
    App App App App
    search
    NoSQL
    monitor
    ing
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View Slide

  13. A short history of data
    integration

    View Slide


  14. Surfaced in the 1990s in retail
    organizations for analyzing buyer trends

    View Slide


  15. Extract data from databases
    Transform into destination warehouse schema
    Load into a central data warehouse

    View Slide


  16. BUT … ETL tools have been around for a
    long time, data coverage in data
    warehouses is still low! WHY?

    View Slide

  17. Etl has drawbacks

    View Slide


  18. #1: The need for a global schema

    View Slide


  19. #2: Data cleansing and curation is
    manual and fundamentally error-prone

    View Slide


  20. #3: Operational cost of ETL is high; it is
    slow; time and resource intensive

    View Slide


  21. #4: ETL tools were built to narrowly
    focus on connecting databases and the
    data warehouse in a batch fashion

    View Slide


  22. Early take on real-time ETL
    =
    Enterprise Application Integration (EAI)

    View Slide


  23. EAI: A different class of data integration
    technology for connecting applications in
    real-time

    View Slide


  24. EAI employed Enterprise Service Buses
    and MQs; weren’t scalable

    View Slide

  25. ETL and EAI are
    outdated!

    View Slide

  26. Old world: scale or timely data, pick one
    real-time
    scale
    batch
    EAI
    ETL
    real-time BUT
    not scalable
    scalable
    BUT batch

    View Slide


  27. Data integration and ETL in the modern
    world need a
    complete revamp

    View Slide

  28. new world: streaming, real-time and scalable
    real-time
    scale
    EAI
    ETL
    Streaming
    Platform
    real-time BUT
    not scalable
    real-time
    AND
    scalable
    scalable
    BUT batch
    batch

    View Slide


  29. Modern streaming world has new set of
    requirements for data integration

    View Slide


  30. #1: Ability to process high-volume and
    high-diversity data

    View Slide


  31. #2 Real-time from the grounds up; a
    fundamental transition to
    event-centric thinking

    View Slide

  32. Event-Centric Thinking
    Streaming
    Platform
    “A product was viewed”
    Hadoop
    Web
    app

    View Slide

  33. Event-Centric Thinking
    Streaming
    Platform
    “A product was viewed”
    Hadoop
    Web
    app
    mobile
    app
    APIs

    View Slide

  34. mobile
    app
    web
    app
    APIs
    Streaming
    Platform
    Hadoop
    Security
    Monitoring
    Rec
    engine
    “A product was viewed”
    Event-Centric Thinking

    View Slide


  35. Event-centric thinking, when applied at a
    company-wide scale, leads to this
    simplification ...

    View Slide

  36. Streaming platform
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View Slide


  37. #3: Enable forward-compatible
    data architecture; the ability to add more
    applications that need to process the
    same data … differently

    View Slide


  38. To enable forward compatibility, redefine
    the
    T in ETL:
    Clean data in; Clean data out

    View Slide

  39. app logs app logs
    app logs
    app logs
    #1: Extract as
    unstructured text
    #2: Transform1 = data cleansing
    = “what is a product view”
    #4: Transform2 =
    drop PII fields”
    #3: Load into DWH
    DWH

    View Slide

  40. #1: Extract as
    unstructured text
    #2: Transform1 = data cleansing =
    “what is a product view”
    #4: Transform2 =
    drop PII fields”
    DWH
    #2: Transform1 =
    data cleansing =
    “what is a product view”
    #4: Transform2 = drop PII fields”
    Cassandra
    #1: Extract as
    unstructured text
    again
    #3: Load cleansed data
    #3: Load cleansed data

    View Slide

  41. #1: Extract as
    structured product
    view events
    #2: Transforms = drop
    PII fields”
    #4:.1 Load product
    view stream
    #4: Load
    filtered product
    View stream
    DWH
    Cassandra
    Streaming
    Platform
    #4.2 Load
    filtered product
    view stream

    View Slide


  42. To enable forward compatibility, redefine
    the
    T in ETL:
    Data transformations, not data cleansing!

    View Slide

  43. #1: Extract once as
    structured product
    view events
    #2: Transform once =
    drop PII fields” and enrich
    with product metadata #4.1: Load product
    views stream
    #4: Load
    filtered and
    enriched product
    views stream
    DWH
    Cassandra
    Streaming
    Platform
    #4.2: Load filtered
    and enriched
    product views
    stream

    View Slide


  44. Forward compatibility =
    Extract clean-data once; Transform many
    different ways before Loading into respective
    destinations … as and when required

    View Slide


  45. In summary, needs of modern data
    integration solution?
    Scale, diversity, latency and forward
    compatibility

    View Slide

  46. Requirements for a modern streaming data
    integration solution
    - Fault tolerance
    - Parallelism
    - Latency
    - Delivery semantics
    - Operations and
    monitoring
    - Schema management

    View Slide

  47. Data integration:
    platform vs tool
    Central, reusable
    infrastructure for
    many use cases
    One-off, non-reusable
    solution for a
    particular use case

    View Slide

  48. New shiny future of etl: a streaming platform
    NoSQL
    RDBMS
    Hadoop
    DWH
    Apps Apps Apps
    Search Monitoring
    RT
    analytics

    View Slide


  49. Streaming platform serves as the
    central nervous system for a
    company’s data in the following ways ...

    View Slide


  50. #1: Serves as the real-time, scalable
    messaging bus for applications; no
    EAI

    View Slide


  51. #2: Serves as the source-of-truth
    pipeline for feeding all data processing
    destinations; Hadoop, DWH, NoSQL
    systems and more

    View Slide


  52. #3: Serves as the building block for
    stateful stream processing
    microservices

    View Slide


  53. Batch data integration
    Streaming

    View Slide


  54. Batch ETL
    Streaming

    View Slide

  55. a short history of data integration
    drawbacks of ETL
    needs and requirements for a streaming platform
    new, shiny future of ETL: a streaming platform
    What does a streaming platform look like and how
    it enables Streaming ETL?

    View Slide

  56. Apache kafka: a
    distributed
    streaming platform

    View Slide

  57. 57
    Confidential
    Apache kafka 6 years
    ago
    57

    View Slide

  58. 58
    Confidential
    > 1,400,000,000,000
    messages processed / day
    58

    View Slide

  59. Now Adopted at 1000s of companies worldwide

    View Slide


  60. What role does Kafka play in the new
    shiny future for data integration?

    View Slide


  61. #1: Kafka is the de-facto storage of
    choice for stream data

    View Slide

  62. The log
    0 1 2 3 4 5 6 7
    next write
    reader 1 reader 2

    View Slide

  63. The log & pub-sub
    0 1 2 3 4 5 6 7
    publisher
    subscriber 1 subscriber 2

    View Slide


  64. #2: Kafka offers a scalable
    messaging backbone for application
    integration

    View Slide

  65. Kafka messaging APIs: scalable eai
    app
    Messaging APIs
    produce(message) consume(message)

    View Slide


  66. #3: Kafka enables building streaming
    data pipelines (E & L in ETL)

    View Slide

  67. Kafka’s Connect API: Streaming data ingestion
    app
    Messaging APIs
    Messaging APIs
    Connect API
    Connect API
    app
    source sink
    Extract Load

    View Slide


  68. #4: Kafka is the basis for stream
    processing and transformations

    View Slide

  69. Kafka’s streams API: stream processing (transforms)
    Messaging API
    Streams API
    apps
    apps
    Connect API
    Connect API
    source sink
    Extract Load
    Transforms

    View Slide

  70. Kafka’s connect API
    =
    E and L in Streaming ETL

    View Slide

  71. Connectors!
    NoSQL
    RDBMS
    Hadoop
    DWH
    Search Monitoring
    RT
    analytics
    Apps Apps Apps

    View Slide

  72. How to keep data centers in-sync?

    View Slide

  73. Sources and sinks
    Connect API
    Connect API
    source sink
    Extract Load

    View Slide

  74. changelogs

    View Slide

  75. Transforming changelogs

    View Slide

  76. Kafka’s Connect API = Connectors Made Easy!
    - Scalability: Leverages Kafka for scalability
    - Fault tolerance: Builds on Kafka’s fault tolerance model
    - Management and monitoring: One way of monitoring all
    connectors
    - Schemas: Offers an option for preserving schemas
    from source to sink

    View Slide

  77. Kafka all the things!
    Connect API

    View Slide

  78. Kafka’s streams API
    =
    The T in STREAMING ETL

    View Slide


  79. Stream processing =
    transformations on stream data

    View Slide

  80. 2 visions for stream processing
    Real-time Mapreduce Event-driven microservices
    VS

    View Slide

  81. 2 visions for stream processing
    Real-time Mapreduce Event-driven microservices
    VS
    - Central cluster
    - Custom packaging,
    deployment &
    monitoring
    - Suitable for
    analytics-type use
    cases
    - Embedded library
    in any Java app
    - Just Kafka and
    your app
    - Makes stream
    processing
    accessible to any
    use case

    View Slide

  82. Vision 1: real-time mapreduce

    View Slide

  83. Vision 2: event-driven microservices => Kafka’s
    streams API
    Streams API
    microservice
    Transforms

    View Slide


  84. Kafka’s Streams API = Easiest way to do
    stream processing using Kafka

    View Slide


  85. #1: Powerful and lightweight Java
    library; need just Kafka and your app
    app

    View Slide


  86. #2: Convenient DSL with all sorts of
    operators: join(), map(), filter(), windowed
    aggregates etc

    View Slide

  87. Word count program using Kafka’s streams API

    View Slide


  88. #3: True event-at-a-time stream
    processing; no microbatching

    View Slide


  89. #4: Dataflow-style windowing based on
    event-time; handles late-arriving data

    View Slide


  90. #5: Out-of-the-box support for local
    state; supports fast stateful processing

    View Slide

  91. External state

    View Slide

  92. local state

    View Slide

  93. Fault-tolerant local state

    View Slide


  94. #6: Kafka’s Streams API allows
    reprocessing; useful to upgrade apps or
    do A/B testing

    View Slide

  95. reprocessing

    View Slide

  96. Real-time dashboard for security monitoring

    View Slide

  97. Kafka’s streams api: simple is beautiful
    Vision 1
    Vision 2

    View Slide

  98. Logs unify batch and stream processing

    View Slide

  99. Streams API
    app
    sink
    source
    Connect API
    Connect API
    Transforms
    Load
    Extract
    New shiny future of ETL: Kafka

    View Slide

  100. A giant mess!
    App App App App
    search
    Hadoop
    DWH
    monitoring security
    MQ MQ
    cache
    cache

    View Slide

  101. All your data … everywhere … now
    Streaming platform
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View Slide

  102. VISION: All your data … everywhere … now
    Streaming platform
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View Slide

  103. Thank you!
    @nehanarkhede

    View Slide