Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Lake, Real-time Analytics, or Both? Exploring Presto and ClickHouse

Ahana
March 22, 2023

Data Lake, Real-time Analytics, or Both? Exploring Presto and ClickHouse

Big data these days means big, fast, or both, and there’s a lot of technologies that promise to fulfill various pieces of that big data architecture.

Join us for this webinar we’re delivering in partnership with Altinity where we’ll explore open source big data solutions. We’ll contrast Presto, the leading SQL Query engine for data lakes, with ClickHouse, the DBMS champ for real-time analytics. After framing the problem with relevant use cases we’ll dig into solutions using Presto and ClickHouse.

You can expect a deep dive into the plumbing that exposes key trade-offs between approaches. Join us for an enlightening discussion filled with practical advice for your next big data project!

Ahana

March 22, 2023
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Data Lake, Real-time
    Analytics, or Both
    Robert Hodges – Rohan Pednekar
    1

    View Slide

  2. Let’s make some introductions
    2
    Robert Hodges
    Database geek with 30+ years
    on DBMS systems. Day job:
    CEO at Altinity
    Rohan Pednekar
    Sr. Product Manager at Ahana,
    Open Source Evangelist

    View Slide

  3. …And introduce our companies
    3
    Altinity.Cloud Platform for
    ClickHouse
    Real-time data in the cloud,
    on Kubernetes, and on-prem
    Fully managed Presto Service
    on AWS
    Query your AWS S3 Data
    Lakes with SQL

    View Slide

  4. 4
    Let’s discuss data lake and
    real-time analytic approaches
    Data Lakes & Real Time Analytics

    View Slide

  5. What are data lake analytics?
    Beyond
    Enterprise Data
    IoT, Third-party,
    Telemetry, Event
    1000X
    More Data
    Terabytes to
    Petabytes
    Open &
    Flexible
    Open Source,
    Open Formats
    Reporting &
    Dashboarding
    Data
    Science
    In-data lake
    transformation
    Data Lake
    5

    View Slide

  6. What are real-time analytics?
    6
    Rapid Data
    Ingest
    (Millions of
    rows/sec.)
    Financial
    Services
    Network
    Management
    Security Event & Incident
    Monitoring (SEIM)
    Observability
    Real-time
    Monitoring
    Unaggregated source data
    BI tools - Slicing & dicing queries
    (Stable, sub-second response)
    Aggregated
    data in views
    APIs - Instant aggregates
    (< 20ms response)

    View Slide

  7. 7
    How do you know which approach
    is best for you?
    Data Lakes & Real Time Analytics

    View Slide

  8. Presto: SQL
    Query Engine for
    big data
    8

    View Slide

  9. Today’s Challenges for Data Engineers & Data Architects
    ● Storage and Compute
    ● Diverse Data Sources
    ● Managing different SQL dialects
    ● Onboarding time
    ● Cost of proprietary systems
    9

    View Slide

  10. What do we need for cloud based analytics?
    10

    View Slide

  11. What do we need for cloud based analytics?
    11

    View Slide

  12. What do we need for cloud based analytics?
    12

    View Slide

  13. What do we need for cloud based analytics?
    13

    View Slide

  14. Cloud Data Warehouse is an answer
    14

    View Slide

  15. Where can we save money?
    15

    View Slide

  16. Where can we save money? Storage!
    16

    View Slide

  17. Metadata tables -> Catalog
    17

    View Slide

  18. Metadata tables -> Catalog
    18

    View Slide

  19. You probably already use something else here
    19

    View Slide

  20. May as well be Open Source!
    20

    View Slide

  21. But what about SQL?
    21

    View Slide

  22. Welcome to the Open SQL Data Lakehouse!
    22

    View Slide

  23. What is Presto?
    23
    ● Open source,
    distributed SQL query
    engine for the data lake
    & lakehouse
    ● Designed from ground
    up for fast analytic
    queries against data of
    any size
    ● Query in place - no
    need to move data
    ● Federated querying -
    join data from different
    source formats

    View Slide

  24. Outcome - Presto for Data Lake Analytics
    ● Storage-Compute segregation
    ● Query Federation
    ● Unified SQL access
    ● Faster Onboarding and No Data Downtimes
    ● Better price performance
    24

    View Slide

  25. Let’s look at an
    eCommerce app
    powered by
    Presto
    25

    View Slide

  26. Open SQL Data Lakehouse for eCommerce: Powered by Presto
    26
    Reporting &
    Dashboarding
    Data
    Science
    Data Governance
    Unified SQL Engine
    Ingestion Cloud Data Lakes
    Batch
    Streaming
    Open SQL Data Lakehouse
    Source Tables
    Derived Tables

    View Slide

  27. Presto Scalable Architecture
    27

    View Slide

  28. Demo Time
    1. Query S3 Data
    2. Join AWS glue table and MySQL table with Presto
    28

    View Slide

  29. Real-time
    Analytics with
    ClickHouse
    29

    View Slide

  30. Real-time analytic challenges
    ● Load millions of rows per second from event stream fire hoses
    ● Fixed, low latency response to arbitrary slicing/dicing queries
    ● ~10ms response to requests from services
    ● Scale to very large datasets
    ● High cost efficiency
    30

    View Slide

  31. Understands SQL
    Runs on bare metal to cloud
    Shared nothing architecture
    Stores data in columns
    Parallel and vectorized execution
    Scales to many petabytes
    Is Open source (Apache 2.0)
    ClickHouse is a SQL Data Warehouse
    It’s a popular engine for
    real-time analytics
    ClickHouse
    Event
    Streams
    ELT
    Object
    Storage
    Interactive
    Graphics
    Dashboards
    APIs
    31

    View Slide

  32. Seeing is believing
    32
    Demo Time!

    View Slide

  33. Round up the usual performance suspects
    33
    Data
    Partitioning
    Codecs
    Compression Skip
    Indexes
    Projections
    Sharding
    Vectorized Query
    Data
    Types
    Read
    Replicas
    Tiered Storage
    Primary key index
    In-RAM dictionaries

    View Slide

  34. “One Big Table” design: multiple entities in a single table
    34
    Restart
    ● msg_type=’restart’
    ● sensor_id
    ● time
    Reading
    ● msg_type=’reading’
    ● sensor_id
    ● time
    ● temperature Error
    ● msg_type=’err’
    ● sensor_id
    ● time
    ● message

    View Slide

  35. What does the sensor table look like?
    35
    CREATE TABLE IF NOT EXISTS readings_zstd (
    sensor_id Int32 Codec(DoubleDelta, ZSTD(1)),
    sensor_type UInt16 Codec(ZSTD(1)),
    location LowCardinality(String) Codec(ZSTD(1)),
    time DateTime Codec(DoubleDelta, ZSTD(1)),
    date ALIAS toDate(time),
    temperature Decimal(5,2) Codec(T64, ZSTD(10))
    )
    Engine = MergeTree
    PARTITION BY toYYYYMM(time)
    ORDER BY (location, sensor_id, time);
    Optimized data
    types
    Codecs + ZSTD
    compression
    ALIAS column
    Sorting by key
    columns + time
    Time-based
    partitioning

    View Slide

  36. Linear query scaling using -If combinators
    36
    --Query over 1 Billion rows
    set max_threads = 16;
    SELECT
    toYYYYMM(time),
    countIf(msg_type = 'reading'),
    countIf(msg_type = 'restart'),
    min(temperature),
    round(avg(temperature)),
    max(temperature)
    FROM test.readings_multi
    WHERE sensor_id BETWEEN 0 and 10000
    GROUP BY month ORDER BY month ASC;

    View Slide

  37. msg_type sensor_id time temperature
    sensor_id restart_time
    time temperature
    sensor_id restart_time
    time temperature
    What about joins within a big table schema?
    Use case: join restarts with temperature readings
    37
    sensor_id uptime
    time temperature
    Restart times
    msg_type
    ‘restart’
    sensor_id time
    Temperature readings
    Temperatures after restart
    msg_type sensor_id time temperature
    JOIN key
    msg_type
    ‘reading’
    sensor_id time temperature

    View Slide

  38. msg_type sensor_id time temperature
    msg_type sensor_id time temperature
    Aggregation can implement joins!
    38
    sensor_id uptime
    time temperature
    Restart and temperature records
    msg_type sensor_id time
    msg_type sensor_id time temperature
    Temperatures after restart
    sensor_id
    restart_time: t1
    reading_time: [t1, t2, t3, t4, …]
    temp: [76.44, 90.39, 82.08, 48.12, ..]
    236
    236
    236
    236

    t1
    t2
    t3
    t4
    ...
    76.44
    90.39
    82.08
    48.12
    ...
    30
    90
    150
    210
    ...
    GROUP BY
    key
    Grouped array values
    ARRAY JOIN to pivot
    on arrays

    View Slide

  39. Finding the last restart is an aggregation task!
    39
    236 2019-01-10 20:00:13 restart
    sensor_id time msg_type 236 2019-01-10 21:07:56 restart
    sensor_id time msg_type
    236 2019-01-10 21:07:56 restart
    sensor_id time msg_type
    Merge
    GROUP BY key
    Max value
    Matching
    row value

    View Slide

  40. Use materialized views to “index” data
    Finding the last restart on a sensor
    40
    Block lands in
    source table
    Block(s) land
    in materialized
    view target
    table
    SELECT
    sensor_id,
    max(time) AS time
    FROM readings_multi
    WHERE msg_type = 'restart'
    GROUP BY sensor_id
    “Last point query”
    MergeTree Table
    AggregatingMergeTree
    Table

    View Slide

  41. Outcome - ClickHouse for Real-Time Analytics
    ● Convenient integration to ingest: event streams, object storage, ELT, …
    ● Fast response on unaggregated source data
    ● Pre-aggregated response within time to render a web pages server-side
    ● Scale resources to maintain constant response
    ● Cost-efficient user-facing tenant APIs and visualization
    41

    View Slide

  42. Mixing and
    Matching
    42

    View Slide

  43. 43
    Let’s look at how you might deploy
    these architectures together
    Data Lakes & Real Time Analytics

    View Slide

  44. Presto Clickhouse Connector
    44

    View Slide

  45. Query Federation
    Join AWS glue table and Clickhouse Table with Presto
    select name, sum(totalprice) as total
    from clickhouse.ahana.customer AS c
    LEFT JOIN
    glue.ecom.orders AS o
    ON c.custkey=o.custkey
    GROUP BY name
    ORDER BY total DESC LIMIT 10;
    45

    View Slide

  46. ClickHouse can read data from S3
    46
    S3 Object Storage
    Parquet
    File
    s3()
    Table
    Function
    SELECT
    max(a),
    sum(b)
    FROM s3(...)

    View Slide

  47. Example of reading Parquet data in ClickHouse
    SELECT max(temperature), min(temperature)
    FROM
    s3('https://s3.us-east-1.amazonaws.com/.../readings*.parquet',
    'Parquet')
    WHERE sensor_type=1
    max(temperature)|min(temperature)|
    ----------------+----------------+
    125.62| -11.11|
    47

    View Slide

  48. Wrap-up
    48

    View Slide

  49. Summary points
    ● Data lakes with Presto gives data engineers & data architects more flexibility,
    better price performance and 1 unified interface for their data
    ● Real-time analytics with ClickHouse offer fast reaction and constant query
    response on rapidly arriving data
    ● You can mix approaches, too
    ○ Deploy Clickhouse with Presto to get access to your real-time data along with your other data
    sources and data lakes
    ○ Read data lake files directly from ClickHouse
    49

    View Slide

  50. 50
    Thank you!
    Questions?
    https://altinity.com
    Altinity.Cloud
    Contact Altinity
    https://ahana.io
    Ahana Cloud
    Contact Ahana

    View Slide

  51. Quick,
    Iterative
    Exploration
    Real-time analytics in action: service log management
    51
    Question Answer
    Why do node.js
    backends fail to
    process transactions?
    Slice and dice
    queries on detailed
    log data
    Bug introduced in
    latest upgrade

    View Slide

  52. And here’s the code…
    52
    SELECT sensor_id, reading_time, temp, reading_time,
    reading_time - restart_time AS uptime
    FROM (
    WITH toDateTime('2019-04-17 11:00:00') as start_of_range
    SELECT sensor_id, groupArrayIf(time, msg_type = 'reading') AS
    reading_time,
    groupArrayIf(temperature, msg_type = 'reading') AS temp,
    anyIf(time, msg_type = 'restart') AS restart_time
    FROM test.readings_multi rm
    WHERE (sensor_id = 2555)
    AND time BETWEEN start_of_range AND start_of_range + 600
    GROUP BY sensor_id)
    ARRAY JOIN reading_time, temp Not everyone’s cup of tea,
    but it works!!!

    View Slide

  53. 53
    It is a capital mistake to theorize
    before one has data.
    Sherlock Holmes
    (aka Arthur Conan Doyle)
    A Scandal in Bohemia
    A famous data scientist on the subject of data…

    View Slide

  54. ClickHouse Server Architecture
    54
    Query Parser Query Interpreter Query Pipeline
    Query
    Columnar data in block storage
    Columnar data in object
    storage
    OS Page Cache
    Kafka Event
    Stream
    MergeTree AggregatingMergeTree S3 Kafka
    Table Engines
    Table Primary
    Key Indexes
    Column blocks
    from storage
    Joined data
    (hash tables)
    Intermediate Results
    (hash tables)

    View Slide