Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Real World High Performance Applicatio...

Building Real World High Performance Application using Elastic Stack

Gaurav Bahrani, CTO of MeTripping explains about their journey of using Elastic Stack as a primary data store

Aravind Putrevu

February 03, 2018
Tweet

More Decks by Aravind Putrevu

Other Decks in Technology

Transcript

  1. Introduction • Gaurav Bahrani, CTO, MeTripping ◦ Building intelligent search

    engine for travel ◦ Expertise in building large scale distributed systems ▪ SQL, Nosql, Big Data ▪ Database engines ▪ Fault-tolerant systems ◦ Ex-VPE Cloud Lending Solutions (Fin-tech startup), Ex-Yahoo, Ex-MS, Ex-HP
  2. Agenda 1. MeTripping - Introduction 2. MeTripping - Challenges 3.

    New architecture - Elasticsearch way 4. Learnings 5. Best practices
  3. MeTripping - Challenges • Tons of dynamic data ◦ 50MB

    of dynamic data per rank list page • Response time ◦ Static data + dynamic data + ML scoring < 30 - 45 secs (current performance) • Static data problem ◦ Multiple data sources and formats
  4. MeTripping - Static data problem • Sources, formats, APIs Posgres

    / Mongo / Couchbase / ES src-1 src-2 src-3 src-4 UI
  5. New Architecture • Merge data using data pipelines • Host

    data in Elasticsearch • Elasticsearch usage ◦ Nosql DB ◦ Geo queries ◦ Indexing ◦ Scoring ◦ Auto-complete, Search suggestions ES src-1 src-2 src-3 src-4 UI data pipeline
  6. Elasticsearch Setup • 2 node cluster ◦ t2.large (2 vCPUs,

    8GB RAM, SSD) • Standard ES docker (elasticsearch:5.6.7) • Code: elasticsearch-dsl python package • Development Tool: Sense Chrome plugin (need to move Kibana) • Monitoring ◦ Prometheus exporter for ES (justwatch/elasticsearch_exporter:1.0.2) • Indexes ◦ Locations: 100K docs (100+ fields) ◦ Hotels: 2M docs (50+ fields) ◦ Routes: 10M docs (25+ fields)
  7. Elasticsearch Learnings • Excellent Nosql DB • Better suited for

    query performance ◦ 10s inserts / second vs. 1000s queries / secord • Ease of indexing ◦ No need to spend tons of efforts on query optimization • Custom scoring is extremely powerful (using painless scripting language)
  8. Elasticsearch Best Practices • Avoid use of ‘type’ field •

    Disable dynamic schema discovery in production • Index only required columns (default: true) ◦ Significantly improves insert performance • Include ‘doc_values’ where needed (default: true) • Understand ‘text’ vs. ‘keyword’ data type differences ◦ Use ‘text’ data type only where fuzzy match needed • Use manageable size shards • Use replicas (cluster) for redundancy, scalability, and performance • Use aliases for easy index switchover (eases index refreshes / upgrades) • System planning ◦ CPU: For typical use-cases (index and search) ES is extremely efficient, so low CPU needs ◦ Memory: For best performance, complete index should fit in system memory ◦ Hard disk: Use SSD. Plan spare capacity for Index upgrades.