Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Powering Uber Marketplace’s Real-Time Data Needs with Elasticsearch

Powering Uber Marketplace’s Real-Time Data Needs with Elasticsearch

Elasticsearch plays a key role in Uber’s Marketplace Dynamics core data system, aggregating business metrics to control critical marketplace behaviors like dynamic (surge) pricing, supply positioning, and assess overall marketplace diagnostics – all in real time.

In this talk, Jae and Isaac will share how Uber uses Elasticsearch to support multiple use cases at the company, handling more than 1,000 QPS at peak. They will not only address why they ultimately chose Elasticsearch, but will also delve into key technical challenges they’re solving, such as how to model Uber’s marketplace data to express aggregated metrics efficiently, and how to run multiple layers of Elasticsearch clusters depending on criticality, among others.

Jae Hyeon Bae l Technical Lead l Uber
Isaac Brodsky l Software Engineer l Uber

Elastic Co

March 09, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2

  2. HOW MANY UBERXS WERE OPEN IN SF IN THE PAST

    10 MINUTES IN FINANCIAL DISTRICT?
  3. 5 What are the conditions in different parts of the

    marketplace? Should we recommend some actions? This is a sample image Recommendations to route partners
  4. • How can we measure and visualize efficiency of marketplace?

    • Dynamic pricing mean factors • Completed/Requested • Estimated Time to Arrive/Actual Time to Arrive • Time: right now and past several months • Location: city scope to hexagon Marketplace Health and Data Science 6
  5. 9 Complex Data Models Looking Finding nearby cars Waiting for

    pickup Completed On trip Open Requested En route Completed On trip Cancel Accept Complete Start Request Cancel Cancel Cancel Accept Complete Start Request Cancel Cancel
  6. 14 "FILTER": { "TYPE": "AND", "FIELDS": [ { "TYPE": "IN",

    "DIMENSION": "CITY", "VALUES": [“17”] }, { "TYPE": "EQ", "DIMENSION": "STATUS", "VALUE": "OPEN" },… Flexible Boolean Filters Queries
  7. 15 "BY": [“HEXAGON_ID”,"STATUS"], "AGGREGATIONS": [ { "FUNCTION": "SUM", "METRIC": "APP_OPENED"

    }, { "FUNCTION": "AVG", "METRIC": "APP_CLOSED" } ] Various Aggregation Features Queries
  8. • CompactionStrategy vs Index Deletion • Light-Weight-Transaction vs Upsert with

    scripting • Allow Filtering, restricted IN vs Query DSL • Logged batch mutation vs Bulk API Key-Values Stores: Cassandra 21 scalability and high availability without compromising performance
  9. • Pros: • Native time-series support with data roll-up •

    Very efficient ingestion/query • Batch indexing through Hadoop, index hand-over • Cons: • Seven moving pieces: Historical, Broker, Coordinator, Indexing Service, Realtime, HDFS, Metadata Storage, ZooKeeper • Strict Immutability will require complex stream processing Druid 22 high-performance, column-oriented, distributed data store
  10. • Idempotent: Uniqueness/Exactly Once • Join • Performance and scalability

    issue on large scale aggregation Document Storage - Double edged sword 24 Mutable Data Model, Killing Feature
  11. • Stream processing should merge them: what if the trip

    takes an hour? • Query should join them: what if we want to aggregate by T1 and L1? Data Model 26 With ETL/Query Join ETL/Query Completed Requested Requested Completed
  12. Data Model 27 Entity With join on the same trip

    ID: trip gets ‘completed:true’ flag on Trip Completed event using Upsert Trip Document Requested Completed Requested Completed …
  13. • Samza for Kafka consumer • Bulk indexing client with

    RestClient from Elasticsearch- Hadoop connector • Spark-Elasticsearch • Configurable to ignore a few exceptions • document already exists, version conflict, document missing Data Pipeline 29
  14. Cluster Deployment 30 Elasticsearch Tier 2 Elasticsearch Tier 1 Elasticsearch

    Tier0 Samza Samza Spark Query Layer Query Layer Query Layer Kafka
  15. Query Planning 31 Routing Route queries to different Elasticsearch clusters.

    Clusters for different data sources, SLAs, time ranges... Elasticsearch Tier 1 Elasticsearch Tier0 Samza Samza Query Layer Tier 1 Application
  16. 32 "DEMAND": { "CLUSTERS": { "TIER0": { "CLUSTERS": ["ES_CLUSTER_TIER0"], },

    "TIER2": { "CLUSTERS": ["ES_CLUSTER_TIER2"] } }, "INDEX": "MARKETPLACE_DEMAND-", "SUFFIXFORMAT": “YYYYMM.WW", "ROUTING": “PRODUCT_ID”, } Routing Query Planning
  17. • Group then A, then B: • "aggs": { "a":

    { "terms": { "field": "a" }, "aggs": { "b": { "terms": { "field": "b" } } } } } • Or: • "by": ["a", "b"] Query Planning 33 Reducing Complexity of Queries
  18. • Terms * Terms * Terms... → OutOfMemoryError • Estimate

    the cardinality of queries before executing, enforce sane limits • "cityId": { "cardinality": 500 }, "status": { "cardinality": 10 } Query Planning 34 Prohibiting Dangerous Queries
  19. 35 "DEMAND": { "FIELDS": { "DIMENSIONS": { "CITY": { "CARDINALITY":

    10}, "HEXAGON_ID": { "CARDINALITY": 10000, "REQUIREDFILTERS": ["PRODUCT_ID", "CITY"] }, "PRODUCT_ID": { "CARDINALITY": 100} }, "METRICS": { "COUNT": NULL, "ETA": NULL } } } Estimating Query Cardinality Query Planning
  20. • Arbitrary scripts → halting problem • Only permit trusted

    scripts • while(true) { print ':(' } Query Planning 36 Prohibiting Dangerous Queries
  21. Index Management 37 Delete Outdated Indices • Elasticsearch Curator, a

    few difficulties to get along with • Lesson learned: Do not mix weekly format with monthly or daily format. • Joda time format specification: xxxx.ww • Week numbering is not aligned to calendar year, so dates like 201701.52 are possible. Joda may not be able to parse YYYYMM.ww.
  22. No obvious cause from Elasticsearch metrics. Checked GC time, indexing

    time, etc… but inter-node latency is 30 secs Monitoring 39 Ingestion latencies went up unexpectedly
  23. • Elasticsearch won’t clearly tell you what’s wrong • Solution:

    Patch Elasticsearch to tell you which hosts or shards are slow • Solution: Avoid actions that need to wait for all nodes in a cluster. Monitoring: Node To Node Latency 40 Network Hardware introduced some latency
  24. • Heavy Aggregations kill client nodes • GC pause, or

    even worse OOM • Data nodes are safe, even no circuit breaker triggered • Many shards and large fan-out Imperfect Circuit Breaker 42 Client Nodes Down on Heavy Aggregations
  25. 43 "BY": ["HEXAGON_ID", "STATUS"], "TIME": { "START": "2017-02-01", "END": "2016-02-02",

    "GRANULARITY": "1M" } Heavy Aggregations Imperfect Circuit Breaker
  26. • Aggregation bucket overhead? • org.elasticsearch.search.aggregations.Aggregator • In-flight Request circuit

    breaker? • TransportChannel • Merge partial results from all shards, what if they are huge? • How to make response size smaller? Imperfect Circuit Breaker 44 Trial and Errors
  27. Imperfect Circuit Breaker 45 Solutions: Small fan-outs with custom routing

    • Skewed data routing is risky • More pressure on data nodes, more likely to trigger circuit breaker
  28. • Hard to set up the optimal boundary • Additional

    latency Imperfect Circuit Breaker 46 Solutions: Split the query with small time range "time": { "start": "2016-09-01", "end": “2016-10-01", "granularity": “1h" } "time": { "start": “2016-09-01", "end": “2016-09-10”, “granularity”: “1h” } "time": { "start": “2016-09-11", "end": “2016-09-20”, “granularity”: “1h” } "time": { "start": “2016-09-21", "end": “2016-09-30”, “granularity”: “1h” }
  29. • #11401 • Record allocated memory for the single query

    in ResponseHandler • Cannot track entire in-flight response memory size but intermittent malicious queries can be prevented Imperfect Circuit Breaker 47 Solutions: In-Flight Response Limit
  30. • #22274 Allow an index to be partitioned with custom

    routing • Really useful technique for scaling large indexes. Enables scaling very large indexes while avoiding hotspot problems. Workarounds are available. • Aggregation query profiler • Very helpful for debugging query performance. • Aggregation partitioning • Partially executing aggregations allows incrementally executing bigger queries. Looking Forward To Elasticsearch 5 And Beyond 48