$30 off During Our Annual Pro Sale. View Details »

Query Processing Resiliency in Pinot (Vivek Iyer & Jia Guo, LinkedIn) | RTA Summit 2023

Query Processing Resiliency in Pinot (Vivek Iyer & Jia Guo, LinkedIn) | RTA Summit 2023

As with any distributed system, starting from the moment when Pinot receives a query request to the moment the user receives a response, several hosts are at play. As the number of these hosts increases, the blast-radius for failures also increases. LinkedIn’s large Pinot deployment faced a number of resiliency issues related to failures, slowness, and undesirable user access-patterns. Providing a robust query processing framework resilient to such issues becomes paramount. To improve the resiliency and stability of our production clusters, we strengthened the Query Processing Resiliency in Pinot by providing two main capabilities:

1. Adaptive Server Selection at Pinot Broker to route queries intelligently to the best available servers
2. RunTime Query Killing to protect the Pinot Servers from crashing

Adaptive Server Selection

When the Pinot Broker routes queries, it has the option to pick some select servers based on the replication factor. Pinot currently uses a light-weight Server-Selection layer which follows a naive round-robin approach. Pinot Servers do the bulk of the heavy-lifting during query processing. Therefore, selecting the right server(s) to process a query has significant impact on the query processing latency (eg: a single slow server becoming overall bottleneck).

LinkedIn developed an intelligent yet light-weight Server Selection framework to adaptively route queries to the best available server(s). The framework collects various stats regarding the health and capabilities of each Server like Server Latency, Number of In-flight Requests. Using these Server Statistics, the broker now intelligently routes queries to the best available server.

LinkedIn has rolled out this framework to its entire production deployment. In the first part of this presentation, we will cover the design, implementation details and the benefits observed in LinkedIn production. It has improved the overall Availability of our deployment and significantly reduced the need for manual intervention.

Query Pre-emption

In the second part of the presentation, we plan to cover the Query Killing framework design, instrumentation instructions, production benefits, and future extensions.

Pinot has successfully supported high-qps analytical queries on large data sets within LinkedIn. However, resource usage accounting/control remains a challenge. A typical scenario is when unplanned expensive queries / adhoc data exploration queries land, Pinot brokers/servers can be brought down by out of memory errors or held up by a bunch of CPU intensive queries. To deal with this, we introduced a light-weight yet responsive framework to account the CPU and memory footprint of each query. This framework employs a lockless thread usage sampling algorithm, which can easily instrument generic multi-threaded query execution code with adjustable accuracy-overhead trade-off. The usage stats sampled from all threads serving the same queries will be periodically aggregated and ranked. On top of these collected usage numbers, we implemented a mechanism to interrupt culprit queries which degrade the cluster availability/latency SLA. Our experiment and production numbers demonstrate that this framework can effectively find and kill the offending queries while incurring minimum performance overhead.

StarTree
PRO

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Strengthening Pinot’s Query Processing Engine with
    Adaptive Server Selection and Runtime Query Killing
    Vivek Iyer Vaidyanathan
    Software Engineer,
    LinkedIn
    Jia Guo
    Software Engineer,
    LinkedIn
    Query Processing Resiliency
    Real-Time Analytics Summit 2023

    View Slide

  2. Adaptive Server
    Selection

    View Slide

  3. Agenda
    1
    Adaptive Server Selection
    2
    Problem Statement
    3
    Design
    Regression Benchmarking
    4 Results after prod rollout at LinkedIn

    View Slide

  4. Problem Statement
    Query Routing in Pinot
    •Pinot Servers host the segments that
    contain the data to be queried
    •Each segment is hosted on multiple servers
    controlled by replication factor
    •Pinot Broker receives the query
    •Broker uses a round-robin approach to
    pick the servers to process a query
    Issues with Round-robin Routing
    • Pinot Servers are susceptible to both
    transient and permanent slowness issues -
    GC Pause, network issues, and disk failures
    • With round-robin selection, queries are sent
    to servers regardless of server
    performance, which can result in
    slower/failed responses.
    • A more intelligent approach is needed to
    optimize server selection and improve
    overall performance.

    View Slide

  5. Scale @ LinkedIn
    250K+ 5000+ 1400+ 4500+
    Queries Per
    Second
    Pinot Server Hosts Pinot Broker
    Hosts
    Pinot Tables

    View Slide

  6. Pain Points @ Linkedin Scale
    ● Pinot team spends ~72+ engineering hours every quarter to
    troubleshoot server slowness issues
    ● Customers face availability degradation when there are Pinot
    server failures
    ● Breaches to Latency SLAs when Pinot Servers experience slowness
    or failures - 30+ events per quarter
    ● Elevated urgency when triaging latency increase alerts

    View Slide

  7. Agenda
    1
    Adaptive Server Selection
    2
    Problem Statement
    3
    Design
    Regression Benchmarking
    4 Results after prod rollout at LinkedIn

    View Slide

  8. Building Blocks of the feature
    Stats Collector
    • Local to each broker
    • Stats are stored in-memory
    • Per-server stats that are maintained
    # of in-inflight queries
    EWMA of in-flight queries
    EWMA of latencies observed
    • Uses an intelligent selection policy to pick the
    best server
    • Decisions are made based on local state
    • Selection policies supported
    Uses the # of in-inflight queries
    Uses latencies
    Sophisticated policy using latency
    and # of in-flight queries
    Server Selector
    Smarter Query Routing at Broker
    Design Components

    View Slide

  9. Workflow
    Pinot Server
    S1
    Pinot Broker
    Queries
    Pinot Server
    S2
    Pinot Server
    S3
    Pinot Broker
    Server Selector
    Stats Collector
    Q1
    S1
    Stats
    Q1 -> S1
    Q2
    S2
    Stats
    Q2 -> S2
    S3
    Stats
    Q4
    Q4 -> S3
    Q3
    Q3 -> S2
    Stats Collector
    ● Async stats collection for minimal
    overhead
    ● Before routing to server, # of in-flight
    requests are updated
    ● After receiving response from server,
    latency and # of inflight requests are
    updated

    View Slide

  10. • Minimal overhead to query
    processing
    • Quick Detection: Must quickly detect
    slow servers and tune down QPS.
    • Auto Recovery: Must cope and
    quickly react when servers recover
    • Well-behaved: Must avoid herd
    behaviors
    • Used independently, signals like
    current Q size and Latency are raw
    and delayed
    Server Selection Policy
    Broker Broker
    Pinot Server
    S1
    Pinot Server
    S2
    Pinot Server
    S3
    Queries Queries
    Broker Broker
    Slow Server Server Server
    Queries Queries
    Before After
    Broker Broker
    Recovered
    Server
    Server Server
    Queries Queries
    Auto Recovery
    Herd Behavior

    View Slide

  11. Hybrid Server Selector
    ● Rank each server based on a score. Pick the server with the lowest score.
    ● EWMA smoothens the values giving priority to changes in recent past
    ● Avoids herd behavior by forecasting the future Q size size for a server
    ● Reacts faster to changes on the server by using an exponential function
    ServerScore = (estimated_queue)N x Latency
    EWMA
    estimated_queue = Q + Q
    EWMA
    + 1

    View Slide

  12. Agenda
    1
    Adaptive Server Selection
    2
    Problem Statement
    3
    Design
    Regression Benchmarking
    4 Results after prod rollout at LinkedIn

    View Slide

  13. Latency Overhead
    Query Routing Phase Total Query Processing

    View Slide

  14. Agenda
    1
    Adaptive Server Selection
    2
    Problem Statement
    3
    Design
    Regression Benchmarking
    4 Results after prod rollout at LinkedIn

    View Slide

  15. Prolonged Server Slowness

    View Slide

  16. Transient Server Slowness

    View Slide

  17. 60
    Before
    Number of slow server latency alerts per quarter
    2
    After
    Number of slow server alerts per quarter
    Outcome Highlight

    View Slide

  18. 72
    Before
    Number of engineering hours spent in debugging
    transient server slowness issues
    8
    After
    Number of engineering hours spent in debugging
    transient server slowness issues
    Outcome Highlight

    View Slide

  19. Prevention of Latency Degradation
    ● Single server slowness causes latency
    degradation for ~ 33.33% of queries
    when RG = 3
    ● Adaptive Server Selection reduces the
    chances of latency degradation when
    one or more servers slow down.
    At LinkedIn, this work helped prevent latency degradation
    in the event of slowness on servers for more than 90% of
    queries in production.

    View Slide

  20. OOM Protection
    Using Automatic
    Query Killing

    View Slide

  21. Agenda
    1
    Query Killing
    2
    Motivation and Challenges
    3
    Design
    Result after prod rollout at LinkedIn

    View Slide

  22. ● CPU/memory intensive query can silently slow down other queries
    before it actually triggers a OOM
    Pain Points @ LinkedIn
    ● Users face SLAs breach & large scale availability degradation
    ● The user gets no proper warnings for expensive queries
    ● OOM exceptions are hard to triage/reproduce subsequently, and
    our oncalls typically spend 4 hours to root-cause each
    occurrence.

    View Slide

  23. Problem Statement
    Goals
    ● OOM protection for servers and brokers
    ● Kill high-risk queries on the fly
    Challenges
    ● No runtime memory tracking for Pinot
    queries
    ● Java’s opaque memory management
    ● Overhead of memory accounting
    ● Correctly aggregating accounting stats
    for multi threaded execution with minimal
    overhead.
    Broker
    Server
    Result Result Result
    Gather Phase OOM
    Scatter
    Phase
    OOM
    Server
    Scatter
    Phase
    OOM
    Server
    Scatter
    Phase
    OOM

    View Slide

  24. Agenda
    1
    2
    Motivation and Challenges
    3
    Design:
    ● Stats collection
    ● Global Accounting
    Results after prod rollout at LinkedIn
    Query Killing

    View Slide

  25. Generic Status/Usage Reporting for
    Multi-threaded Query Execution Code
    ● Tree-like runtime query status
    context
    Getting Usage Statistics with Instrumentation
    Runner
    Thread
    Worker Threads
    Q1_0*
    Q1_4
    Q1_3
    Q1_2
    Q1_1
    e.g. ThreadMXBean
    .getThreadAllocatedBytes()
    Per Thread Stats Sample,
    Volatile Primitive
    Per Thread Task Status,
    AtomicReference
    *Q1_0 denotes
    task 0 for query 1
    ● Thread-reported generic usages
    ● Lock Free:low overhead
    Query
    Execution
    Thread

    View Slide

  26. Query Execution Thread (worker thread as an example)
    Setup Query
    Status
    Work on a block
    of data
    Report Usage
    Finished?
    N
    Return Result
    Y
    Getting Usage Statistics with Instrumentation

    View Slide

  27. Setting Up Query Status: Example of a Worker Thread
    Worker
    Threads
    Runner
    Thread
    Get Context from the Runner Thread
    Setup Context When a Worker Thread Spawns
    Clear Context When a Worker Thread Finishes
    Getting Usage Statistics with Instrumentation: Example

    View Slide

  28. Getting Usage Statistics with Instrumentation: Example
    Instrumenting Query Execution Code: Example of Segment Processing Code
    One-shot Usage Collection:
    Inject 1 line of code
    in the operator execution
    codepath
    DocIdSetOperator
    ProjectionOperator
    TransformOperator
    A Block of Data
    GroupByOperator
    AggregationFunction
    One-shot Usage Collection:
    Inject 1 line of code
    in the operator execution
    codepath

    View Slide

  29. High Level Query Killing Workflow
    Accounting
    Thread
    Query Execution
    Thread
    Setup Task
    Status
    Work on a
    block of data
    Report Usage
    Status?
    Return With
    Error
    Continue operator
    execution
    Return Result
    Finish
    Record Thread
    Status/Usage
    Aggregate Usage
    by Query
    Terminate the
    Query With Most
    Memory
    OOM
    Risk?
    Reschedule after
    X ms
    N
    Y
    Building Accounting/Killing upon Execution Instrumentation
    Kill query threads
    Accountant records
    thread level usage

    View Slide

  30. Query Usage Accounting Algorithm
    200B
    Q3_4
    T1
    3000B
    Q2_1
    T1*
    Current
    Task
    T1 Current
    Usage Sample
    T1 Recorded Partially Finished:
    Q2: 400B + 3000B
    Q3: 5000B
    Q2_1 ≠ Q3_4
    Inspect Thread
    Task Status
    If the task is
    the same as
    previous?
    Merge the
    finished task to
    the `partial
    finished`
    aggregator
    N
    *For simplicity we demonstrate only 1 thread from the threadpool

    View Slide

  31. Query Usage Accounting Algorithm
    200B
    T1
    200B
    Q3_4
    T1 Recorded Partially Finished:
    Q2: 3400B
    Q3: 5000B
    Q3_4
    Inspect Thread
    Task Status
    Record New
    Usage Stats
    If the task is
    the same as
    previous?
    Merge the
    finished task to
    the `partial
    finished`
    aggregator
    Record New
    Task Status
    Y
    N
    T1
    Current
    Task
    T1 Current
    Usage Sample

    View Slide

  32. Partially Finished:
    Q2: 3400B
    Q3: 5000B
    Query Usage Accounting Algorithm
    Aggregated:
    Q2: 3400B
    Q3: 5000B + 200B
    200B
    T1 Recorded
    Q3_4
    Inspect Thread
    Task Status
    Record New
    Usage Stats
    If the task is
    the same as
    previous?
    OOM
    Risk?
    Merge the
    finished task to
    the `partial
    finished`
    aggregator
    Record New
    Task Status
    Aggregate Usage
    per Query
    Y
    N
    Y
    N

    View Slide

  33. Global Stats Aggregation
    ● Handles fixed/non-fixed threadpool
    ● Lives outside of query code path
    ● Sampling, only tracking big queries.
    Ignoring short lived ones.
    ● Use O(Thread) space/time,
    not proportional to qps
    ● Returns killing code & usage info
    Query Usage Aggregation Algorithm
    Inspect Thread
    Task Status
    Record New
    Usage Stats
    If the task is
    the same as
    previous?
    OOM
    Risk?
    Merge the
    finished task to
    the `partial
    finished`
    aggregator
    Record New
    Task Status
    Aggregate Usage
    per Query
    Y
    N
    Y
    N

    View Slide

  34. Agenda
    1
    2
    Motivation and Challenges
    3
    Design
    Results after prod rollout at LinkedIn
    Query Killing

    View Slide

  35. Production Results - Overhead, Observability, Perf Tuning
    Overhead = 1% (Filtered) * 35.987% = 0.3%
    Negligible Overhead Observability
    ● Publishing heap usage. Alert on broker and server
    `queryKilled` metric
    ● Internal dashboard filtering the killed and top resource
    intensive queries from centralized logs and group by
    unique request ids
    ● Return the killing messages to customer and give a
    warning to not retry
    Perf Optimization
    ● G1GC can be quite ‘lazy’ and cause heap usage
    shootup & long major GC pauses
    ● Shenandoah GC (SGC) keeps the heap usage lower
    ● SGC helps with eliminating risk of false positives and
    potentially improves tail latency

    View Slide

  36. ~10
    Queries Triggered OOM/quarter
    > 85%
    Prevented OOM crashes and cascading impact of
    resource intensive queries by killing more than 85%
    of such queries
    Outcome Highlight

    View Slide

  37. 40hrs
    Time spent triaging OOMs/quarter
    < 4hr
    More than 90% toil reduction to
    (1) Identify resource intensive queries and
    (2) RC OOM crashes, chase culprit queries
    Outcome Highlight

    View Slide

  38. 1
    2
    3
    4
    Future
    Work
    Fair Scheduler & Workload Management
    Query Killing: Killing decision propagation
    Adaptive Server Selection:Enriched stats &
    enhanced server selection algorithms
    Query Admission Control

    View Slide

  39. Thank you!
    Looking forward to resiliency being used
    and improved
    Adaptive Server Selection Doc
    Query Killing Doc

    View Slide