Powering Uber Marketplace’s Real-Time Data Needs with Elasticsearch

Uber 2017-03-09 Powering Uber Marketplace’s Real-Time Data Needs with Elasticsearch
Jae Hyeon Bae Isaac Brodsky

HOW MANY UBERXS WERE OPEN IN SF IN THE PAST
10 MINUTES IN FINANCIAL DISTRICT?

4 Different granularities,  across different time  ranges in real-time Dynamic
Pricing Supply & Demand

5 What are the conditions in different parts of the
marketplace? Should we recommend some actions? This is a sample image Recommendations to route partners

• How can we measure and visualize efficiency of marketplace?
• Dynamic pricing mean factors • Completed/Requested • Estimated Time to Arrive/Actual Time to Arrive • Time: right now and past several months • Location: city scope to hexagon Marketplace Health and Data Science 6

7 Marketplace Health and Data Science

TECHNICAL CHALLENGES

9 Complex Data Models Looking Finding nearby cars Waiting for
pickup Completed On trip Open Requested En route Completed On trip Cancel Accept Complete Start Request Cancel Cancel Cancel Accept Complete Start Request Cancel Cancel

Fine-Grained Geo/Temporal 10

Fine-Grained Geo/Temporal 11 1440 minutes in a day

Combinatorial Explosion of Dimensions 12 x Open Requested En route
Completed On trip x …

REQUIREMENTS

14 "FILTER": { "TYPE": "AND", "FIELDS": [ { "TYPE": "IN",
"DIMENSION": "CITY", "VALUES": [“17”] }, { "TYPE": "EQ", "DIMENSION": "STATUS", "VALUE": "OPEN" },… Flexible Boolean Filters Queries

15 "BY": [“HEXAGON_ID”,"STATUS"], "AGGREGATIONS": [ { "FUNCTION": "SUM", "METRIC": "APP_OPENED"
}, { "FUNCTION": "AVG", "METRIC": "APP_CLOSED" } ] Various Aggregation Features Queries

Queries 16 Time-Series Analysis

Low latency: end-to-end ingestion a few seconds Ingestion 17 Data
Pipeline Integration

• Single doc, Single metric • Entity, Idempotent • Join
Data Model 18

• Linearly scalable • Data retention: TTL • Monitoring, Interoperability
Scalable/Extendible 19

MULTIDIMENSIONAL GEO/TEMPORAL REALTIME ANALYTICS

• CompactionStrategy vs Index Deletion • Light-Weight-Transaction vs Upsert with
scripting • Allow Filtering, restricted IN vs Query DSL • Logged batch mutation vs Bulk API Key-Values Stores: Cassandra 21 scalability and high availability without compromising performance

• Pros: • Native time-series support with data roll-up •
Very efficient ingestion/query • Batch indexing through Hadoop, index hand-over • Cons: • Seven moving pieces: Historical, Broker, Coordinator, Indexing Service, Realtime, HDFS, Metadata Storage, ZooKeeper • Strict Immutability will require complex stream processing Druid 22 high-performance, column-oriented, distributed data store

ELASTICSEARCH DISTRIBUTED, RESTFUL SEARCH AND ANALYTICS, CAPABLE OF SOLVING A
GROWING NUMBER OF USE CASES

• Idempotent: Uniqueness/Exactly Once • Join • Performance and scalability
issue on large scale aggregation Document Storage - Double edged sword 24 Mutable Data Model, Killing Feature

Data Model 25 How to calculate Completed/Requested ratio with two
different event streams

• Stream processing should merge them: what if the trip
takes an hour? • Query should join them: what if we want to aggregate by T1 and L1? Data Model 26 With ETL/Query Join ETL/Query Completed Requested Requested Completed

Data Model 27 Entity With join on the same trip
ID: trip gets ‘completed:true’ flag on Trip Completed event using Upsert Trip Document Requested Completed Requested Completed …

Architecture 28 Elasticsearch Storage Manager Query Layer Samza Spark Kafka
HDFS Marketplace Services

• Samza for Kafka consumer • Bulk indexing client with
RestClient from Elasticsearch- Hadoop connector • Spark-Elasticsearch • Configurable to ignore a few exceptions • document already exists, version conflict, document missing Data Pipeline 29

Cluster Deployment 30 Elasticsearch Tier 2 Elasticsearch Tier 1 Elasticsearch
Tier0 Samza Samza Spark Query Layer Query Layer Query Layer Kafka

Query Planning 31 Routing Route queries to different Elasticsearch clusters.
Clusters for different data sources, SLAs, time ranges... Elasticsearch Tier 1 Elasticsearch Tier0 Samza Samza Query Layer Tier 1 Application

32 "DEMAND": { "CLUSTERS": { "TIER0": { "CLUSTERS": ["ES_CLUSTER_TIER0"], },
"TIER2": { "CLUSTERS": ["ES_CLUSTER_TIER2"] } }, "INDEX": "MARKETPLACE_DEMAND-", "SUFFIXFORMAT": “YYYYMM.WW", "ROUTING": “PRODUCT_ID”, } Routing Query Planning

• Group then A, then B: • "aggs": { "a":
{ "terms": { "field": "a" }, "aggs": { "b": { "terms": { "field": "b" } } } } } • Or: • "by": ["a", "b"] Query Planning 33 Reducing Complexity of Queries

• Terms * Terms * Terms... → OutOfMemoryError • Estimate
the cardinality of queries before executing, enforce sane limits • "cityId": { "cardinality": 500 }, "status": { "cardinality": 10 } Query Planning 34 Prohibiting Dangerous Queries

35 "DEMAND": { "FIELDS": { "DIMENSIONS": { "CITY": { "CARDINALITY":
10}, "HEXAGON_ID": { "CARDINALITY": 10000, "REQUIREDFILTERS": ["PRODUCT_ID", "CITY"] }, "PRODUCT_ID": { "CARDINALITY": 100} }, "METRICS": { "COUNT": NULL, "ETA": NULL } } } Estimating Query Cardinality Query Planning

• Arbitrary scripts → halting problem • Only permit trusted
scripts • while(true) { print ':(' } Query Planning 36 Prohibiting Dangerous Queries

Index Management 37 Delete Outdated Indices • Elasticsearch Curator, a
few difficulties to get along with • Lesson learned: Do not mix weekly format with monthly or daily format. • Joda time format specification: xxxx.ww • Week numbering is not aligned to calendar year, so dates like 201701.52 are possible. Joda may not be able to parse YYYYMM.ww.

Monitoring 38 General Metric Dashboard

No obvious cause from Elasticsearch metrics. Checked GC time, indexing
time, etc… but inter-node latency is 30 secs Monitoring 39 Ingestion latencies went up unexpectedly

• Elasticsearch won’t clearly tell you what’s wrong • Solution:
Patch Elasticsearch to tell you which hosts or shards are slow • Solution: Avoid actions that need to wait for all nodes in a cluster. Monitoring: Node To Node Latency 40 Network Hardware introduced some latency

Additional Lessons

• Heavy Aggregations kill client nodes • GC pause, or
even worse OOM • Data nodes are safe, even no circuit breaker triggered • Many shards and large fan-out Imperfect Circuit Breaker 42 Client Nodes Down on Heavy Aggregations

43 "BY": ["HEXAGON_ID", "STATUS"], "TIME": { "START": "2017-02-01", "END": "2016-02-02",
"GRANULARITY": "1M" } Heavy Aggregations Imperfect Circuit Breaker

• Aggregation bucket overhead? • org.elasticsearch.search.aggregations.Aggregator • In-flight Request circuit
breaker? • TransportChannel • Merge partial results from all shards, what if they are huge? • How to make response size smaller? Imperfect Circuit Breaker 44 Trial and Errors

Imperfect Circuit Breaker 45 Solutions: Small fan-outs with custom routing
• Skewed data routing is risky • More pressure on data nodes, more likely to trigger circuit breaker

• Hard to set up the optimal boundary • Additional
latency Imperfect Circuit Breaker 46 Solutions: Split the query with small time range "time": { "start": "2016-09-01", "end": “2016-10-01", "granularity": “1h" } "time": { "start": “2016-09-01", "end": “2016-09-10”, “granularity”: “1h” } "time": { "start": “2016-09-11", "end": “2016-09-20”, “granularity”: “1h” } "time": { "start": “2016-09-21", "end": “2016-09-30”, “granularity”: “1h” }

• #11401 • Record allocated memory for the single query
in ResponseHandler • Cannot track entire in-flight response memory size but intermittent malicious queries can be prevented Imperfect Circuit Breaker 47 Solutions: In-Flight Response Limit

• #22274 Allow an index to be partitioned with custom
routing • Really useful technique for scaling large indexes. Enables scaling very large indexes while avoiding hotspot problems. Workarounds are available. • Aggregation query profiler • Very helpful for debugging query performance. • Aggregation partitioning • Partially executing aggregations allows incrementally executing bigger queries. Looking Forward To Elasticsearch 5 And Beyond 48

49 More Questions? Visit us at the AMA

Powering Uber Marketplace’s Real-Time Data Need...

Powering Uber Marketplace’s Real-Time Data Needs with Elasticsearch

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript