Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Query Processing Resiliency in Pinot (Vivek Iye...

Query Processing Resiliency in Pinot (Vivek Iyer & Jia Guo, LinkedIn) | RTA Summit 2023

As with any distributed system, starting from the moment when Pinot receives a query request to the moment the user receives a response, several hosts are at play. As the number of these hosts increases, the blast-radius for failures also increases. LinkedIn’s large Pinot deployment faced a number of resiliency issues related to failures, slowness, and undesirable user access-patterns. Providing a robust query processing framework resilient to such issues becomes paramount. To improve the resiliency and stability of our production clusters, we strengthened the Query Processing Resiliency in Pinot by providing two main capabilities:

1. Adaptive Server Selection at Pinot Broker to route queries intelligently to the best available servers
2. RunTime Query Killing to protect the Pinot Servers from crashing

Adaptive Server Selection

When the Pinot Broker routes queries, it has the option to pick some select servers based on the replication factor. Pinot currently uses a light-weight Server-Selection layer which follows a naive round-robin approach. Pinot Servers do the bulk of the heavy-lifting during query processing. Therefore, selecting the right server(s) to process a query has significant impact on the query processing latency (eg: a single slow server becoming overall bottleneck).

LinkedIn developed an intelligent yet light-weight Server Selection framework to adaptively route queries to the best available server(s). The framework collects various stats regarding the health and capabilities of each Server like Server Latency, Number of In-flight Requests. Using these Server Statistics, the broker now intelligently routes queries to the best available server.

LinkedIn has rolled out this framework to its entire production deployment. In the first part of this presentation, we will cover the design, implementation details and the benefits observed in LinkedIn production. It has improved the overall Availability of our deployment and significantly reduced the need for manual intervention.

Query Pre-emption

In the second part of the presentation, we plan to cover the Query Killing framework design, instrumentation instructions, production benefits, and future extensions.

Pinot has successfully supported high-qps analytical queries on large data sets within LinkedIn. However, resource usage accounting/control remains a challenge. A typical scenario is when unplanned expensive queries / adhoc data exploration queries land, Pinot brokers/servers can be brought down by out of memory errors or held up by a bunch of CPU intensive queries. To deal with this, we introduced a light-weight yet responsive framework to account the CPU and memory footprint of each query. This framework employs a lockless thread usage sampling algorithm, which can easily instrument generic multi-threaded query execution code with adjustable accuracy-overhead trade-off. The usage stats sampled from all threads serving the same queries will be periodically aggregated and ranked. On top of these collected usage numbers, we implemented a mechanism to interrupt culprit queries which degrade the cluster availability/latency SLA. Our experiment and production numbers demonstrate that this framework can effectively find and kill the offending queries while incurring minimum performance overhead.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Strengthening Pinot’s Query Processing Engine with Adaptive Server Selection and

    Runtime Query Killing Vivek Iyer Vaidyanathan Software Engineer, LinkedIn Jia Guo Software Engineer, LinkedIn Query Processing Resiliency Real-Time Analytics Summit 2023
  2. Agenda 1 Adaptive Server Selection 2 Problem Statement 3 Design

    Regression Benchmarking 4 Results after prod rollout at LinkedIn
  3. Problem Statement Query Routing in Pinot •Pinot Servers host the

    segments that contain the data to be queried •Each segment is hosted on multiple servers controlled by replication factor •Pinot Broker receives the query •Broker uses a round-robin approach to pick the servers to process a query Issues with Round-robin Routing • Pinot Servers are susceptible to both transient and permanent slowness issues - GC Pause, network issues, and disk failures • With round-robin selection, queries are sent to servers regardless of server performance, which can result in slower/failed responses. • A more intelligent approach is needed to optimize server selection and improve overall performance.
  4. Scale @ LinkedIn 250K+ 5000+ 1400+ 4500+ Queries Per Second

    Pinot Server Hosts Pinot Broker Hosts Pinot Tables
  5. Pain Points @ Linkedin Scale • Pinot team spends ~72+

    engineering hours every quarter to troubleshoot server slowness issues • Customers face availability degradation when there are Pinot server failures • Breaches to Latency SLAs when Pinot Servers experience slowness or failures - 30+ events per quarter • Elevated urgency when triaging latency increase alerts
  6. Agenda 1 Adaptive Server Selection 2 Problem Statement 3 Design

    Regression Benchmarking 4 Results after prod rollout at LinkedIn
  7. Building Blocks of the feature Stats Collector • Local to

    each broker • Stats are stored in-memory • Per-server stats that are maintained # of in-inflight queries EWMA of in-flight queries EWMA of latencies observed • Uses an intelligent selection policy to pick the best server • Decisions are made based on local state • Selection policies supported Uses the # of in-inflight queries Uses latencies Sophisticated policy using latency and # of in-flight queries Server Selector Smarter Query Routing at Broker Design Components
  8. Workflow Pinot Server S1 Pinot Broker Queries Pinot Server S2

    Pinot Server S3 Pinot Broker Server Selector Stats Collector Q1 S1 Stats Q1 -> S1 Q2 S2 Stats Q2 -> S2 S3 Stats Q4 Q4 -> S3 Q3 Q3 -> S2 Stats Collector • Async stats collection for minimal overhead • Before routing to server, # of in-flight requests are updated • After receiving response from server, latency and # of inflight requests are updated
  9. • Minimal overhead to query processing • Quick Detection: Must

    quickly detect slow servers and tune down QPS. • Auto Recovery: Must cope and quickly react when servers recover • Well-behaved: Must avoid herd behaviors • Used independently, signals like current Q size and Latency are raw and delayed Server Selection Policy Broker Broker Pinot Server S1 Pinot Server S2 Pinot Server S3 Queries Queries Broker Broker Slow Server Server Server Queries Queries Before After Broker Broker Recovered Server Server Server Queries Queries Auto Recovery Herd Behavior
  10. Hybrid Server Selector • Rank each server based on a

    score. Pick the server with the lowest score. • EWMA smoothens the values giving priority to changes in recent past • Avoids herd behavior by forecasting the future Q size size for a server • Reacts faster to changes on the server by using an exponential function ServerScore = (estimated_queue)N x Latency EWMA estimated_queue = Q + Q EWMA + 1
  11. Agenda 1 Adaptive Server Selection 2 Problem Statement 3 Design

    Regression Benchmarking 4 Results after prod rollout at LinkedIn
  12. Agenda 1 Adaptive Server Selection 2 Problem Statement 3 Design

    Regression Benchmarking 4 Results after prod rollout at LinkedIn
  13. 60 Before Number of slow server latency alerts per quarter

    2 After Number of slow server alerts per quarter Outcome Highlight
  14. 72 Before Number of engineering hours spent in debugging transient

    server slowness issues 8 After Number of engineering hours spent in debugging transient server slowness issues Outcome Highlight
  15. Prevention of Latency Degradation • Single server slowness causes latency

    degradation for ~ 33.33% of queries when RG = 3 • Adaptive Server Selection reduces the chances of latency degradation when one or more servers slow down. At LinkedIn, this work helped prevent latency degradation in the event of slowness on servers for more than 90% of queries in production.
  16. Agenda 1 Query Killing 2 Motivation and Challenges 3 Design

    Result after prod rollout at LinkedIn
  17. • CPU/memory intensive query can silently slow down other queries

    before it actually triggers a OOM Pain Points @ LinkedIn • Users face SLAs breach & large scale availability degradation • The user gets no proper warnings for expensive queries • OOM exceptions are hard to triage/reproduce subsequently, and our oncalls typically spend 4 hours to root-cause each occurrence.
  18. Problem Statement Goals • OOM protection for servers and brokers

    • Kill high-risk queries on the fly Challenges • No runtime memory tracking for Pinot queries • Java’s opaque memory management • Overhead of memory accounting • Correctly aggregating accounting stats for multi threaded execution with minimal overhead. Broker Server Result Result Result Gather Phase OOM Scatter Phase OOM Server Scatter Phase OOM Server Scatter Phase OOM
  19. Agenda 1 2 Motivation and Challenges 3 Design: • Stats

    collection • Global Accounting Results after prod rollout at LinkedIn Query Killing
  20. Generic Status/Usage Reporting for Multi-threaded Query Execution Code • Tree-like

    runtime query status context Getting Usage Statistics with Instrumentation Runner Thread Worker Threads Q1_0* Q1_4 Q1_3 Q1_2 Q1_1 e.g. ThreadMXBean .getThreadAllocatedBytes() Per Thread Stats Sample, Volatile Primitive Per Thread Task Status, AtomicReference *Q1_0 denotes task 0 for query 1 • Thread-reported generic usages • Lock Free:low overhead Query Execution Thread
  21. Query Execution Thread (worker thread as an example) Setup Query

    Status Work on a block of data Report Usage Finished? N Return Result Y Getting Usage Statistics with Instrumentation
  22. Setting Up Query Status: Example of a Worker Thread Worker

    Threads Runner Thread Get Context from the Runner Thread Setup Context When a Worker Thread Spawns Clear Context When a Worker Thread Finishes Getting Usage Statistics with Instrumentation: Example
  23. Getting Usage Statistics with Instrumentation: Example Instrumenting Query Execution Code:

    Example of Segment Processing Code One-shot Usage Collection: Inject 1 line of code in the operator execution codepath DocIdSetOperator ProjectionOperator TransformOperator A Block of Data GroupByOperator AggregationFunction One-shot Usage Collection: Inject 1 line of code in the operator execution codepath
  24. High Level Query Killing Workflow Accounting Thread Query Execution Thread

    Setup Task Status Work on a block of data Report Usage Status? Return With Error Continue operator execution Return Result Finish Record Thread Status/Usage Aggregate Usage by Query Terminate the Query With Most Memory OOM Risk? Reschedule after X ms N Y Building Accounting/Killing upon Execution Instrumentation Kill query threads Accountant records thread level usage
  25. Query Usage Accounting Algorithm 200B Q3_4 T1 3000B Q2_1 T1*

    Current Task T1 Current Usage Sample T1 Recorded Partially Finished: Q2: 400B + 3000B Q3: 5000B Q2_1 ≠ Q3_4 Inspect Thread Task Status If the task is the same as previous? Merge the finished task to the `partial finished` aggregator N *For simplicity we demonstrate only 1 thread from the threadpool
  26. Query Usage Accounting Algorithm 200B T1 200B Q3_4 T1 Recorded

    Partially Finished: Q2: 3400B Q3: 5000B Q3_4 Inspect Thread Task Status Record New Usage Stats If the task is the same as previous? Merge the finished task to the `partial finished` aggregator Record New Task Status Y N T1 Current Task T1 Current Usage Sample
  27. Partially Finished: Q2: 3400B Q3: 5000B Query Usage Accounting Algorithm

    Aggregated: Q2: 3400B Q3: 5000B + 200B 200B T1 Recorded Q3_4 Inspect Thread Task Status Record New Usage Stats If the task is the same as previous? OOM Risk? Merge the finished task to the `partial finished` aggregator Record New Task Status Aggregate Usage per Query Y N Y N
  28. Global Stats Aggregation • Handles fixed/non-fixed threadpool • Lives outside

    of query code path • Sampling, only tracking big queries. Ignoring short lived ones. • Use O(Thread) space/time, not proportional to qps • Returns killing code & usage info Query Usage Aggregation Algorithm Inspect Thread Task Status Record New Usage Stats If the task is the same as previous? OOM Risk? Merge the finished task to the `partial finished` aggregator Record New Task Status Aggregate Usage per Query Y N Y N
  29. Agenda 1 2 Motivation and Challenges 3 Design Results after

    prod rollout at LinkedIn Query Killing
  30. Production Results - Overhead, Observability, Perf Tuning Overhead = 1%

    (Filtered) * 35.987% = 0.3% Negligible Overhead Observability • Publishing heap usage. Alert on broker and server `queryKilled` metric • Internal dashboard filtering the killed and top resource intensive queries from centralized logs and group by unique request ids • Return the killing messages to customer and give a warning to not retry Perf Optimization • G1GC can be quite ‘lazy’ and cause heap usage shootup & long major GC pauses • Shenandoah GC (SGC) keeps the heap usage lower • SGC helps with eliminating risk of false positives and potentially improves tail latency
  31. ~10 Queries Triggered OOM/quarter > 85% Prevented OOM crashes and

    cascading impact of resource intensive queries by killing more than 85% of such queries Outcome Highlight
  32. 40hrs Time spent triaging OOMs/quarter < 4hr More than 90%

    toil reduction to (1) Identify resource intensive queries and (2) RC OOM crashes, chase culprit queries Outcome Highlight
  33. 1 2 3 4 Future Work Fair Scheduler & Workload

    Management Query Killing: Killing decision propagation Adaptive Server Selection:Enriched stats & enhanced server selection algorithms Query Admission Control
  34. Thank you! Looking forward to resiliency being used and improved

    Adaptive Server Selection Doc Query Killing Doc