Slide 1

Slide 1 text

Building real world high performance application using Elasticsearch Gaurav Bahrani 3rd Feb 2018, CTO, MeTripping

Slide 2

Slide 2 text

Introduction ● Gaurav Bahrani, CTO, MeTripping ○ Building intelligent search engine for travel ○ Expertise in building large scale distributed systems ■ SQL, Nosql, Big Data ■ Database engines ■ Fault-tolerant systems ○ Ex-VPE Cloud Lending Solutions (Fin-tech startup), Ex-Yahoo, Ex-MS, Ex-HP

Slide 3

Slide 3 text

Agenda 1. MeTripping - Introduction 2. MeTripping - Challenges 3. New architecture - Elasticsearch way 4. Learnings 5. Best practices

Slide 4

Slide 4 text

MeTripping - Introduction (1)

Slide 5

Slide 5 text

MeTripping - Introduction (2) static data dynamic data

Slide 6

Slide 6 text

MeTripping - Challenges ● Tons of dynamic data ○ 50MB of dynamic data per rank list page ● Response time ○ Static data + dynamic data + ML scoring < 30 - 45 secs (current performance) ● Static data problem ○ Multiple data sources and formats

Slide 7

Slide 7 text

MeTripping - Static data problem ● Sources, formats, APIs Posgres / Mongo / Couchbase / ES src-1 src-2 src-3 src-4 UI

Slide 8

Slide 8 text

New Architecture ● Merge data using data pipelines ● Host data in Elasticsearch ● Elasticsearch usage ○ Nosql DB ○ Geo queries ○ Indexing ○ Scoring ○ Auto-complete, Search suggestions ES src-1 src-2 src-3 src-4 UI data pipeline

Slide 9

Slide 9 text

New Architecture - Improvements seen ● APIs reduced from 25+ to ~10 ● Avg. response time < 200ms

Slide 10

Slide 10 text

Elasticsearch Setup ● 2 node cluster ○ t2.large (2 vCPUs, 8GB RAM, SSD) ● Standard ES docker (elasticsearch:5.6.7) ● Code: elasticsearch-dsl python package ● Development Tool: Sense Chrome plugin (need to move Kibana) ● Monitoring ○ Prometheus exporter for ES (justwatch/elasticsearch_exporter:1.0.2) ● Indexes ○ Locations: 100K docs (100+ fields) ○ Hotels: 2M docs (50+ fields) ○ Routes: 10M docs (25+ fields)

Slide 11

Slide 11 text

Elasticsearch Learnings ● Excellent Nosql DB ● Better suited for query performance ○ 10s inserts / second vs. 1000s queries / secord ● Ease of indexing ○ No need to spend tons of efforts on query optimization ● Custom scoring is extremely powerful (using painless scripting language)

Slide 12

Slide 12 text

Elasticsearch Best Practices ● Avoid use of ‘type’ field ● Disable dynamic schema discovery in production ● Index only required columns (default: true) ○ Significantly improves insert performance ● Include ‘doc_values’ where needed (default: true) ● Understand ‘text’ vs. ‘keyword’ data type differences ○ Use ‘text’ data type only where fuzzy match needed ● Use manageable size shards ● Use replicas (cluster) for redundancy, scalability, and performance ● Use aliases for easy index switchover (eases index refreshes / upgrades) ● System planning ○ CPU: For typical use-cases (index and search) ES is extremely efficient, so low CPU needs ○ Memory: For best performance, complete index should fit in system memory ○ Hard disk: Use SSD. Plan spare capacity for Index upgrades.

Slide 13

Slide 13 text

Future tasks ● Create ES indexes in data pipeline using Spark ● Elasticsearch as GraphDB

Slide 14

Slide 14 text

Q & A

Slide 15

Slide 15 text

Thank You! Gaurav ([email protected])