Elasticsearch en producción

Slide 1

Slide 1 text

Elasticsearch en producción Monitoreo, problemas y setup recomendado in 15 minutes (or less!) 1

Slide 2

Slide 2 text

2 Javier Rey Tryolabs @vierja github.com/vierja [email protected]

Slide 3

Slide 3 text

Monitoreo ⌚ Cosas a medir • Estado del cluster Shards Segments Tasks • Estado de nodo Hardware JVM Cache • Uso de aplicaciones Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work. - Brian Redman 3

Slide 4

Slide 4 text

Status (duh!) relocating_shards initializing_shards unassigned_shards delayed_unassigned_shards Porcentaje de documentos eliminados Cantidad de segments 50-150 por índice Cantidad de documentos, tamaño Shards y Segments GET /_cluster/health  {  "cluster_name": "elasticsearch",  "status": "yellow",  "timed_out": false,  "number_of_nodes": 1,  "number_of_data_nodes": 1,  "active_primary_shards": 10,  "active_shards": 10,  "relocating_shards": 0,  "initializing_shards": 0,  "unassigned_shards": 10,  "delayed_unassigned_shards": 0,  "number_of_pending_tasks": 0,  "number_of_in_flight_fetch": 0,  "task_max_waiting_in_queue_millis": 0,  "active_shards_percent_as_number": 50  } 4 ❗ Tasks Tareas pendientes Hay alguna tarea colgada?

Slide 5

Slide 5 text

Hardware CPU load Free space Open file descriptors / Max file descriptors Swap? ☠ 5 JVM Heap usage Garbage collection (count & times, young & old) Cache (filter_cache, fielddata) Cache size Evictions Fielddata circuit breakers Uso de aplicaciones Request rate Query latency Per shard query latency Index rate Delete rate “etc rate”

Slide 6

Slide 6 text

Como medirlas Idealmente New Relic Elasticsearch plugin https://github.com/s12v/newrelic-elasticsearch Marvel https://www.elastic.co/products/marvel 6 ”Plugins” (sin instalar) Whatson https://github.com/xyu/elasticsearch-whatson   elastichq  http://www.elastichq.org/app/index.php    kopf  https://github.com/lmenezes/elasticsearch-kopf    bigdesk  http://bigdesk.org/    paramedic  https://github.com/karmi/elasticsearch-paramedic API (BYO) Cat API  GET /_cat

Slide 7

Slide 7 text

GET /_cat/  =^.^=   /_cat/allocation  /_cat/shards  /_cat/shards/{index}  /_cat/master  /_cat/nodes  /_cat/indices  /_cat/indices/{index}  /_cat/segments  /_cat/segments/{index}  /_cat/count  /_cat/count/{index}  /_cat/recovery  /_cat/recovery/{index}  /_cat/health  /_cat/pending_tasks  /_cat/aliases  /_cat/aliases/{alias}  /_cat/thread_pool  /_cat/plugins  /_cat/fielddata  /_cat/fielddata/{fields}  /_cat/nodeattrs  /_cat/repositories  /_cat/snapshots/{repository} 7 GET /_cat/segments?v&h=shard,segment,docs.count,size.memory  shard segment docs.count size.memory  0 _c 12855 10511  0 _l 18655 13055  0 _m 2747 5394  0 _n 49 3483  0 _o 319 3460  0 _p 3364 5851  0 _q 2148 4743  1 _m 17124 12742  1 _v 23005 14209  2 _c 9987 9236  2 _l 2992 5565  2 _m 17604 13041  2 _n 165 4533  2 _o 2866 5448  2 _p 3213 5522  2 _q 33 3335  2 _r 556 3702  2 _s 2408 5003  3 _l 22082 14204  3 _v 11427 9837  3 _w 514 3722  3 _x 2693 5291  3 _y 168 4614  3 _z 2900 5363

Slide 8

Slide 8 text

Problemas típicos Nodo Configuración Cluster (mal) Uso 8

Slide 9

Slide 9 text

Nodo Swap configurado Heap size mal configurado Memory pressure Mucho garbage collection filter_cache lleno Storage Network storage Lento Mal configurado (max open files) Merge throttling Circuit breakers 9

Slide 10

Slide 10 text

Configuración Cantidad de shards Over sharding Mappings mal configurados Mappings for defecto Campos analyzed innecesarios Data path 10 Cluster ☁ Split brain Problemas de bandwidth que afectan replication Multicast (mal) Uso Estructuración SQL → NoSQL Nested objects muy grandes Paginados gigantescos sin _scroll (bots) Muchos updates → muchos deleted → mucho merging → ???

Slide 11

Slide 11 text

Set-up recomendado ✅ Mecanismo de actualización Sin downtime Sin pérdida de datos (o fácil de repopular) Elasticsearch 2.X doc_values por defecto Better query execution planner using filters. Query profiler Optimización de geoqueries Merging optimizations Better recovery 11 RAM > CPU >= 2 core (index es CPU-bound) SSD No más de 32 GB Hardware

Slide 12

Slide 12 text

Max 50 GB / shard Non-default mapping Rolling indices (time series data) Master dedicados Nro de réplicas: 1 (si no se agregan más nodos) 12 Configuración y settings importantes bootstrap.mlockall: true  cluster.name: ‘Nombre del cluster’   discovery.zen.minimum_master_nodes: 2  refresh_time: >1s ? path:  data:  logs:  plugins: discovery.zen.ping.unicast.hosts: [“host1”, …]

Slide 13

Slide 13 text

Preguntas? 13