Elasticsearch : Architecting a Data Platform

Slide 1

Slide 1 text

Aravind Putrevu Engineer | Evangelist, @aravindputrevu Managing your Big Billion Day Logs

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

@aravindputrevu sli.do/elastic !4 Agenda Data Platform Architectures Elasticsearch Cluster Sizing Optimal Bulk Size Distribute the Load 1 2 3 4 5 Optimizing Disk IO 6 Final Remarks

Slide 5

Slide 5 text

@aravindputrevu sli.do/elastic !5 Agenda Data Platform Architectures Elasticsearch Cluster Sizing Optimal Bulk Size Distribute the Load 1 2 3 4 5 Optimizing Disk IO 6 Final Remarks

Slide 6

Slide 6 text

@aravindputrevu sli.do/elastic !6 Agenda Optimal Bulk Size Distribute the Load 1 2 3 4 5 Optimizing Disk IO 6 Final Remarks Data Platform Architectures Elasticsearch Cluster Sizing

Slide 7

Slide 7 text

@aravindputrevu sli.do/elastic !7 Agenda Distribute the Load 1 2 3 4 5 6 Optimizing Disk IO Final Remarks Data Platform Architectures Elasticsearch Cluster Sizing Optimal Bulk Size

Slide 8

Slide 8 text

@aravindputrevu sli.do/elastic !8 Agenda Optimizing Disk IO 1 2 3 4 5 6 Data Platform Architectures Elasticsearch Cluster Sizing Optimal Bulk Size Distribute the Load Final Remarks

Slide 9

Slide 9 text

@aravindputrevu sli.do/elastic !9 Agenda 1 2 3 4 5 Final Remarks 6 Data Platform Architectures Elasticsearch Cluster Sizing Optimal Bulk Size Distribute the Load Optimizing Disk IO

Slide 10

Slide 10 text

Data Platform Architectures

Slide 11

Slide 11 text

life:universe user:soulmate _Search? outside the box city:restaurant car:model fridge:leftovers work:dreamjob

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Logging

Slide 14

Slide 14 text

Metrics

Slide 15

Slide 15 text

Security Analytics

Slide 16

Slide 16 text

APM

Slide 17

Slide 17 text

@aravindputrevu sli.do/elastic !17 The Elastic Journey of Data Beats Log Metrics Wire your{beat}

Slide 18

Slide 18 text

@aravindputrevu sli.do/elastic !18 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Elasticsearch Master Ingest Data Nodes Data Notes

Slide 19

Slide 19 text

@aravindputrevu sli.do/elastic !19 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Elasticsearch Master Ingest Data Nodes Data Notes Kibana Instances (X)

Slide 20

Slide 20 text

@aravindputrevu sli.do/elastic !20 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Elasticsearch Master Ingest Data Nodes Data Notes Logstash Nodes (X) Kibana Instances (X)

Slide 21

Slide 21 text

@aravindputrevu sli.do/elastic !21 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Data Web Social Sensors Elasticsearch Master Ingest Data Nodes Data Notes Logstash Nodes (X) Kibana Instances (X)

Slide 22

Slide 22 text

@aravindputrevu sli.do/elastic !22 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Data Web Social Sensors Elasticsearch Master Ingest Data Nodes Data Notes Logstash Nodes (X) Kibana Instances (X) Notification Queues Storage Metrics

Slide 23

Slide 23 text

@aravindputrevu sli.do/elastic !23 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Data Web Social Sensors Elasticsearch Master Ingest Data Nodes Data Notes Logstash Nodes (X) Kafka Redis Messaging Kibana Instances (X) Notification Queues Storage Metrics

Slide 24

Slide 24 text

@aravindputrevu sli.do/elastic !24 The Elastic Journey of Data Beats Log Metrics Wire your{beat} Data Web Social Sensors Elasticsearch Master Ingest Data Nodes Data Notes Logstash Nodes (X) Kafka Redis Messaging Kibana Instances (X) Notification Queues Storage Metrics X-Pack X-Pack X-Pack

Slide 25

Slide 25 text

@aravindputrevu sli.do/elastic !25 Provision and manage multiple Elastic Stack environments and provide search-aaS, logging-aaS, BI-aaS, data-aaS to your entire organization

Slide 26

Slide 26 text

@aravindputrevu sli.do/elastic !26 Hosted Elasticsearch & Kibana Includes X-Pack features Starts at $45/mo Available in Amazon Web Service Google Cloud Platform

Slide 27

Slide 27 text

Elasticsearch  Cluster Sizing

Slide 28

Slide 28 text

@aravindputrevu sli.do/elastic !28 Terminology Cluster my_cluster Server 1 Node A d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 Index twitter d6 d3 d2 d5 d1 d4 Index logs

Slide 29

Slide 29 text

@aravindputrevu sli.do/elastic !29 Partition Cluster my_cluster Server 1 Node A d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 Index twitter d6 d3 d2 d5 d1 d4 Index logs Shards 0 1 4 2 3 0 1

Slide 30

Slide 30 text

@aravindputrevu sli.do/elastic !30 Distribution Cluster my_cluster Server 1 Node A Server 2 Node B twitter shard P4 d1 d2 d6 d5 d10 d12 twitter shard P2 twitter shard P1 logs shard P0 d2 d5 d4 logs shard P1 d3 d4 d9 d7 d8 d11 twitter shard P3 twitter shard P0 d6 d3 d1

Slide 31

Slide 31 text

@aravindputrevu sli.do/elastic !31 Replication Cluster my_cluster Server 1 Node A Server 2 Node B twitter shard P4 d1 d2 d6 d5 d10 d12 twitter shard P2 twitter shard P1 logs shard P0 d2 d5 d4 logs shard P1 d3 d4 d9 d7 d8 d11 twitter shard P3 twitter shard P0 twitter shard R4 d1 d2 d6 d12 twitter shard R2 d5 d10 twitter shard R1 d6 d3 d1 d6 d3 d1 logs shard R0 d2 d5 d4 logs shard R1 d3 d4 d9 d7 d8 d11 twitter shard R3 twitter shard R0 • Primaries • Replicas

Slide 32

Slide 32 text

@aravindputrevu sli.do/elastic !32 Scaling Data

Slide 33

Slide 33 text

@aravindputrevu sli.do/elastic !33 Scaling Data

Slide 34

Slide 34 text

@aravindputrevu sli.do/elastic !34 Scaling Data

Slide 35

Slide 35 text

@aravindputrevu sli.do/elastic !35 Scaling Big Data ... ...

Slide 36

Slide 36 text

@aravindputrevu sli.do/elastic !36 Scaling • In Elasticsearch, shards are the working unit • More data -> More shards Big Data ... ... But how many shards?

Slide 37

Slide 37 text

@aravindputrevu sli.do/elastic !37 How much data? • ~1000 events per second • 60s * 60m * 24h * 1000 events => ~87M events per day • 1kb per event => ~82GB per day • 3 months => ~7TB

Slide 38

Slide 38 text

@aravindputrevu sli.do/elastic !38 Shard Size • It depends on many different factors ‒ document size, mapping, use case, kinds of queries being executed, desired response time, peak indexing rate, budget, ... • After the shard sizing*, each shard should handle 45GB • Up to 10 shards per machine * https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

Slide 39

Slide 39 text

@aravindputrevu sli.do/elastic !39 How many shards? • Data size: ~7TB • Shard Size: ~45GB* • Total Shards: ~160 • Shards per machine: 10* • Total Servers: 16 * https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing Cluster my_cluster 3 months of logs ...

Slide 40

Slide 40 text

@aravindputrevu sli.do/elastic !40 But... • How many indices? • What do you do if the daily data grows? • What do you do if you want to delete old data?

Slide 41

Slide 41 text

@aravindputrevu sli.do/elastic !41 Time-Based Data • Logs, social media streams, time-based events • Timestamp + Data • Do not change • Typically search for recent events • Older documents become less important • Hard to predict the data size

Slide 42

Slide 42 text

@aravindputrevu sli.do/elastic !42 Time-Based Data • Time-based Indices is the best option ‒ create a new index each day, week, month, year, ... ‒ search the indices you need in the same request

Slide 43

Slide 43 text

@aravindputrevu sli.do/elastic !43 Daily Indices Cluster my_cluster d6 d3 d2 d5 d1 d4 logs-2017-10-06

Slide 44

Slide 44 text

@aravindputrevu sli.do/elastic !44 Daily Indices Cluster my_cluster d6 d3 d2 d5 d1 d4 logs-2017-10-07 d6 d3 d2 d5 d1 d4 logs-2017-10-06

Slide 45

Slide 45 text

@aravindputrevu sli.do/elastic !45 Daily Indices Cluster my_cluster d6 d3 d2 d5 d1 d4 logs-2017-10-06 d6 d3 d2 d5 d1 d4 logs-2017-10-08 d6 d3 d2 d5 d1 d4 logs-2017-10-07

Slide 46

Slide 46 text

@aravindputrevu sli.do/elastic !46 Templates • Every new created index starting with 'logs-' will have ‒ 2 shards ‒ 1 replica (for each primary shard) ‒ 60 seconds refresh interval PUT _template/logs { "template": "logs-*", "settings": { "number_of_shards": 2, More on that later

Slide 47

Slide 47 text

@aravindputrevu sli.do/elastic !47 Alias Cluster my_cluster d6 d3 d2 d5 d1 d4 logs-2017-10-06 users Application logs-write logs-read

Slide 48

Slide 48 text

@aravindputrevu sli.do/elastic !48 Alias Cluster my_cluster d6 d3 d2 d5 d1 d4 logs-2017-10-06 users Application logs-write logs-read d6 d3 d2 d5 d1 d4 logs-2017-10-07

Slide 49

Slide 49 text

@aravindputrevu sli.do/elastic !49 Alias Cluster my_cluster d6 d3 d2 d5 d1 d4 logs-2017-10-06 users Application logs-write logs-read d6 d3 d2 d5 d1 d4 logs-2017-10-07 d6 d3 d2 d5 d1 d4 logs-2017-10-08

Slide 50

Slide 50 text

Detour: Rollover API https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-rollover-index.html

Slide 51

Slide 51 text

@aravindputrevu sli.do/elastic !51 Do not Overshard • 3 different logs • 1 index per day each • 1GB each • 5 shards (default): so 200mb / shard vs 45gb • 6 months retention • ~900 shards for ~180GB • we needed ~4 shards! don't keep default values! Cluster my_cluster access-... d6 d3 d2 d5 d1 d4 application-... d6 d5 d9 d5 d1 d7 mysql-... d10 d59 d3 d5 d0 d4

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Detour: Shrink API https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-shrink-index.html

Slide 54

Slide 54 text

@aravindputrevu sli.do/elastic !54 Scaling Big Data ... ... 1M users But what happens if we have 2M users?

Slide 55

Slide 55 text

@aravindputrevu sli.do/elastic !55 Scaling Big Data ... ... 1M users ... ... 1M users

Slide 56

Slide 56 text

@aravindputrevu sli.do/elastic !56 Scaling Big Data ... ... 1M users ... ... 1M users ... ... 1M users

Slide 57

Slide 57 text

@aravindputrevu sli.do/elastic !57 Scaling Big Data ... ... ... ... ... ... U s e r s

Slide 58

Slide 58 text

@aravindputrevu sli.do/elastic !58 Shards are the working unit • Primaries ‒ More data -> More shards ‒ write throughput (More writes -> More primary shards) • Replicas ‒ high availability (1 replica is the default) ‒ read throughput (More reads -> More replicas)

Slide 59

Slide 59 text

Optimal Bulk Size

Slide 60

Slide 60 text

@aravindputrevu sli.do/elastic !60 What is Bulk? Elasticsearch Master Ingest Data Nodes Data Notes X-Pack __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ _____ 1000  Beats Logstash Application 1000 index requests with 1 document 1 bulk request with 1000 documents

Slide 61

Slide 61 text

@aravindputrevu sli.do/elastic !61 What is the optimal bulk size? Elasticsearch Master Ingest Data Nodes Data Notes X-Pack __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ _____ 1000  Beats Logstash Application 4 * 250? 1 * 1000? 2 * 500?

Slide 62

Slide 62 text

@aravindputrevu sli.do/elastic !62 It depends... • on your application (language, libraries, ...) • document size (100b, 1kb, 100kb, 1mb, ...) • number of nodes • node size • number of shards • shards distribution

Slide 63

Slide 63 text

@aravindputrevu sli.do/elastic !63 Test it ;) Elasticsearch Master Ingest Data Nodes Data Notes X-Pack __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ __________ _____ 1000000  Beats Logstash Application 4000 * 250-> 160s 1000 * 1000-> 155s 2000 * 500-> 164s

Slide 64

Slide 64 text

@aravindputrevu sli.do/elastic !64 Test it ;) DATE=`date +%Y.%m.%d` LOG=logs/logs.txt exec_test () { curl -s -XDELETE "http://USER:PASS@HOST:9200/logstash-$DATE" sleep 10 export SIZE=$1 input { stdin{} } filter {} output { In Beats set "bulk_max_size" in the output.elasticsearch

Slide 65

Slide 65 text

@aravindputrevu sli.do/elastic !65 Test it ;) • 2 node cluster (m3.large) ‒ 2 vCPU, 7.5GB Memory, 1x32GB SSD • 1 index server (m3.large) ‒ logstash ‒ kibana # docs 100 500 1000 3000 5000 10000 time(s) 191.7 161.9 163.5 160.7 160.7 161.5

Slide 66

Slide 66 text

Distribute the Load

Slide 67

Slide 67 text

@aravindputrevu sli.do/elastic !67 Avoid Bottlenecks Elasticsearch X-Pack _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ 1000000  Beats Logstash Application single node Node 1 Node 2 round robin

Slide 68

Slide 68 text

@aravindputrevu sli.do/elastic !68 Clients • Most clients implement round robin ‒ you specify a seed list ‒ the client sniffs the cluster ‒ the client implement different selectors • Logstash allows an array (no sniffing) • Beats allows an array (no sniffing) • Kibana only connects to one single node output { elasticsearch {

Slide 69

Slide 69 text

@aravindputrevu sli.do/elastic !69 Load Balancer Elasticsearch X-Pack _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ 1000000  Beats Logstash Application LB Node 2 Node 1

Slide 70

Slide 70 text

@aravindputrevu sli.do/elastic !70 Coordinating-only Node Elasticsearch X-Pack _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ _________ 1000000  Beats Logstash Application Node 3  co-node Node 2 Node 1

Slide 71

Slide 71 text

@aravindputrevu sli.do/elastic !71 Test it ;) #docs time(s) 100 500 1000 NO Round Robin 191.7 161.9 163.5 Round Robin 189.7 159.7 159.0 • 2 node cluster (m3.large) ‒ 2 vCPU, 7.5GB Memory, 1x32GB SSD • 1 index server (m3.large) ‒ logstash (round robin configured) ‒ hosts => ["10.12.145.189", "10.121.140.167"] ‒ kibana

Slide 72

Slide 72 text

Optimizing Disk IO

Slide 73

Slide 73 text

@aravindputrevu sli.do/elastic !73 Durability index a doc time lucene flush buffer index a doc buffer index a doc buffer buffer segment

Slide 74

Slide 74 text

@aravindputrevu sli.do/elastic !74 refresh_interval • Dynamic per-index setting • Increase to get better write throughput to an index • New documents will take more time to be available for Search. PUT logstash-2017.05.16/_settings { #docs time(s) 100 500 1000 1s refresh 189.7 159.7 159.0 60s refresh 185.8 152.1 152.6

Slide 75

Slide 75 text

@aravindputrevu sli.do/elastic !75 Durability index a doc time lucene flush buffer segment trans_log buffer trans_log buffer trans_log elasticsearch flush doc op lucene commit segment segment

Slide 76

Slide 76 text

@aravindputrevu sli.do/elastic !76 Translog fsync every 5s (1.7) index a doc buffer trans_log doc op index a doc buffer trans_log doc op Primary Replica redundancy doesn’t help if all nodes lose power

Slide 77

Slide 77 text

@aravindputrevu sli.do/elastic !77 Translog fsync on every request • For low volume indexing, fsync matters less • For high volume indexing, we can amortize the costs and fsync on every bulk • Concurrent requests can share an fsync bulk 1 bulk 2 single fsync

Slide 78

Slide 78 text

@aravindputrevu sli.do/elastic !78 Async Transaction Log • index.translog.durability ‒ request (default) ‒ async • index.translog.sync_interval (only if async is set) • Dynamic per-index settings • Be careful, you are relaxing the safety guarantees #docs time(s) 100 500 1000 Request fsync 185.8 152.1 152.6 5s sync 154.8 143.2 143.1

Slide 79

Slide 79 text

Final Remarks

Slide 80

Slide 80 text

@aravindputrevu sli.do/elastic !80 Final Remarks Beats Log Metrics Wire your{beat} Data Web Social Sensors Elasticsearch Master Ingest Data Nodes Data Notes Logstash Nodes (X) Kafka Redis Messaging Kibana Instances (X) Notification Queues Storage Metrics X-Pack X-Pack X-Pack

Slide 81

Slide 81 text

@aravindputrevu sli.do/elastic !81 Final Remarks • Primaries ‒ More data -> More shards ‒ Do not overshard! • Replicas ‒ high availability (1 replica is the default) ‒ read throughput (More reads -> More replicas) Big Data ... ... ... ... ... ... U s e

Slide 82

Slide 82 text

@aravindputrevu sli.do/elastic !82 Final Remarks • Bulk and Test • Distribute the Load • Refresh Interval • Async Trans Log (careful) #docs 100 500 1000 Default 191.7s 161.9s 163.5s RR+60s+Async5s 154.8s 143.2s 143.1s

Slide 83

Slide 83 text

@aravindputrevu sli.do/elastic !83 References bit.ly/ArchitectMeetup

Slide 84

Slide 84 text

Managing your Big Billion Day Logs Aravind Putrevu Engineer | Evangelist, @aravindputrevu