Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Real World High Performance Application using Elastic Stack

Building Real World High Performance Application using Elastic Stack

Gaurav Bahrani, CTO of MeTripping explains about their journey of using Elastic Stack as a primary data store

Aravind Putrevu

February 03, 2018
Tweet

More Decks by Aravind Putrevu

Other Decks in Technology

Transcript

  1. Introduction • Gaurav Bahrani, CTO, MeTripping ◦ Building intelligent search

    engine for travel ◦ Expertise in building large scale distributed systems ▪ SQL, Nosql, Big Data ▪ Database engines ▪ Fault-tolerant systems ◦ Ex-VPE Cloud Lending Solutions (Fin-tech startup), Ex-Yahoo, Ex-MS, Ex-HP
  2. Agenda 1. MeTripping - Introduction 2. MeTripping - Challenges 3.

    New architecture - Elasticsearch way 4. Learnings 5. Best practices
  3. MeTripping - Challenges • Tons of dynamic data ◦ 50MB

    of dynamic data per rank list page • Response time ◦ Static data + dynamic data + ML scoring < 30 - 45 secs (current performance) • Static data problem ◦ Multiple data sources and formats
  4. MeTripping - Static data problem • Sources, formats, APIs Posgres

    / Mongo / Couchbase / ES src-1 src-2 src-3 src-4 UI
  5. New Architecture • Merge data using data pipelines • Host

    data in Elasticsearch • Elasticsearch usage ◦ Nosql DB ◦ Geo queries ◦ Indexing ◦ Scoring ◦ Auto-complete, Search suggestions ES src-1 src-2 src-3 src-4 UI data pipeline
  6. Elasticsearch Setup • 2 node cluster ◦ t2.large (2 vCPUs,

    8GB RAM, SSD) • Standard ES docker (elasticsearch:5.6.7) • Code: elasticsearch-dsl python package • Development Tool: Sense Chrome plugin (need to move Kibana) • Monitoring ◦ Prometheus exporter for ES (justwatch/elasticsearch_exporter:1.0.2) • Indexes ◦ Locations: 100K docs (100+ fields) ◦ Hotels: 2M docs (50+ fields) ◦ Routes: 10M docs (25+ fields)
  7. Elasticsearch Learnings • Excellent Nosql DB • Better suited for

    query performance ◦ 10s inserts / second vs. 1000s queries / secord • Ease of indexing ◦ No need to spend tons of efforts on query optimization • Custom scoring is extremely powerful (using painless scripting language)
  8. Elasticsearch Best Practices • Avoid use of ‘type’ field •

    Disable dynamic schema discovery in production • Index only required columns (default: true) ◦ Significantly improves insert performance • Include ‘doc_values’ where needed (default: true) • Understand ‘text’ vs. ‘keyword’ data type differences ◦ Use ‘text’ data type only where fuzzy match needed • Use manageable size shards • Use replicas (cluster) for redundancy, scalability, and performance • Use aliases for easy index switchover (eases index refreshes / upgrades) • System planning ◦ CPU: For typical use-cases (index and search) ES is extremely efficient, so low CPU needs ◦ Memory: For best performance, complete index should fit in system memory ◦ Hard disk: Use SSD. Plan spare capacity for Index upgrades.