$30 off During Our Annual Pro Sale. View Details »

Druid Ecosystem @ Yahoo!

Imply
April 02, 2019

Druid Ecosystem @ Yahoo!

Presentation by Niketh Sabbineni, Principal Engineer @ Yahoo!, for the San Francisco Bay Area Druid Meetup at Unity.

Imply

April 02, 2019
Tweet

More Decks by Imply

Other Decks in Technology

Transcript

  1. Niketh Sabbineni
    [email protected]
    [email protected]
    Druid Ecosystem @ Yahoo

    View Slide

  2. Who am I
    2 Yahoo Confidential & Proprietary
    ▪ Principal Engineer @ Yahoo
    ▪ CTO @ Bookpad
    ▪ SDE @ Amazon

    View Slide

  3. Flurry Overview
    3 Yahoo Confidential & Proprietary
    ▪ Measure → Analyse → Insights → Action
    ▪ 1M Apps
    ▪ 2.1B Devices
    ▪ 100B+ Events (daily)
    ▪ 10B Sessions (daily)
    ▪ Raw data well over 20PB

    View Slide

  4. Features
    4 Yahoo Confidential & Proprietary
    ▪ Realtime
    ▪ Crash
    ▪ Technical
    ▪ Audience
    ▪ Retention
    ▪ ....
    ▪ ....
    ▪ Free Free Free!

    View Slide

  5. Why Druid ?
    5 Yahoo Confidential & Proprietary
    ▪ Realtime + Batch
    ▪ Horizontally Scalable
    ▪ Sub Second Query Latency
    ▪ Resilient to failures
    ▪ Custom plugins

    View Slide

  6. Architecture
    6 Yahoo Confidential & Proprietary
    Collectors
    Hbase
    Storm Druid
    Kafka
    Map
    Reduce

    View Slide

  7. Architecture
    7 Yahoo Confidential & Proprietary
    Collectors
    Hbase
    Storm Druid
    Kafka
    Map
    Reduce
    Druid
    Metrics
    Cluster
    UI
    Programmatic
    Alerts
    Hive Pivot
    External

    View Slide

  8. Architecture
    8 Yahoo Confidential & Proprietary
    Collectors
    Hbase
    Storm Druid
    kafka
    Map
    Reduce
    Druid
    Metrics
    Cluster
    UI
    Programmatic
    Alerts
    Hive Pivot
    Collectors
    Hbase
    Storm Druid
    kafka
    Map
    Reduce
    R
    e
    p
    li
    c
    a
    ti
    o
    n
    External

    View Slide

  9. Architecture
    9 Yahoo Confidential & Proprietary
    ● 300 Historicals - 256GB Ram / 7TB SSD
    ● 80 Middle Managers
    ● HDFS / Kafka
    ● 5 clusters in Flurry
    ● 18 Clusters in Yahoo/Oath
    ● Imply Pivot / Superset
    ● Hive / SQL

    View Slide

  10. Lessons Learnt
    10 Yahoo Confidential & Proprietary
    ▪ Querying
    ▪ Ingestion
    ▪ Monitoring

    View Slide

  11. Querying
    11 Yahoo Confidential & Proprietary
    ▪ Column Types - String/Float/Double/Long/Custom
    ▪ Heterogenous Nodes - Ensure constant ram/disk ratio
    ▪ Cost Balancer - diskNormalized
    ▪ Partitioning - Broker uses Shardspec & less paging
    ▪ Spill to Disk
    ▪ Sketch (Count Distinct) Size - Adjust sketch sizes
    ▪ LimitSpec

    View Slide

  12. Ingestion
    12 Yahoo Confidential & Proprietary
    ▪ Query / Segment Granularity (50% space)
    ▪ Staggered Runs with Replication (30% compute)
    ▪ Partitioning - Call result in smaller segment sizes
    ▪ Reindexing ( Dimensions )
    ▪ Late Arriving Data - Periodic Backfills

    View Slide

  13. Monitoring
    13 Yahoo Confidential & Proprietary
    ▪ Health Checks - Processes / Disks / Ram
    ▪ Querying - Time, Failure counts, GC Time, Paging Time
    ▪ Ingestion Tasks - Ingest lag, Waiting task counts
    ▪ Coordinator - Load Queue Size, Disk Size
    ▪ SSL Certificate Expiry
    ▪ Metrics Cluster
    ▪ Use Pivot / Turnilo for root cause analysis
    ▪ Kafka / HDFS - Name node storage

    View Slide

  14. Q/A
    14 Yahoo Confidential & Proprietary
    ▪ Niketh Sabbineni
    [email protected]
    [email protected]
    ▪ Ankit Kothari
    [email protected]

    View Slide

  15. Flurry Demo
    15 Yahoo Confidential & Proprietary

    View Slide