Save 37% off PRO during our Black Friday Sale! »

Application Logging with Elasticsearch at Naver

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
December 12, 2017

Application Logging with Elasticsearch at Naver

Jaeik Lee | Lead In-house Log Management Platform | Naver

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

December 12, 2017
Tweet

Transcript

  1. 1 Lee, Jae Ik 2017/12/12, NELO Team, Naver Application Logging

    Platform with Elasticsearch in Naver
  2. 2 Agenda Introduction to NELO 1 Problem of multi-tenant logging

    platform Our Solution: Multi-cluster & Multi-indices per day How NELO was using Elasticsearch 2 3 4
  3. 3 NELO

  4. 4 Introduction to NELO • SDK/Agent: Log forwarding for various

    platforms • Collect/Search/Analyze Logs • Real-time/Scheduled alerts • Crash log symbolication • Own Webapp + Kibana(Dashboard) • OpenAPI for custom data processing Features
  5. 5 Introduction to NELO Architecture 수집서버 클러스터 모바일/데스크 톱 애플리케이션

    SDK Collector Server N Thrift Collector Server 2 Collector Server 1 Collector Server 3 Queue Sink 분산큐 Webapp & Kibana 메타 정보 DB N E L O O P E N A P I Filter Convert Syslog K A F K A 시스템 로그 전송 오픈소스 로깅 에이전트 HTTP HTTPS 심볼리케이터 알람서버 크래시집계 검색/분석 서버
  6. 6 Elasticsearch in NELO

  7. 7 Node 7 instances 9 clusters Scale Documents Total number

    of logs Size Total size of logs 388 2630B 627T
  8. 8 Index Model • 1 Index per day → daily

    index lifecycle management • Various retention time according to the instances (1 M, 3M, 2Y, 5Y) • Type for project → mapping variance per project Time-based model nelo2-2017-08-19 nelo2-2017-08-20 … nelo2-2017-09-18
  9. 9 Custom Routing • Use custom routing both in index

    & search ‒ Small project: store only in one shard (custom routing: project name) ‒ Big project: distribute logs over all shards (default routing) Depends on project size nelo2-2015-09-18 0 1 2 3 4 5 6 7 8 9 Client Client Client
  10. 10 Hot-Warm Architecture • Recent data in SSD Search HDD

    Data Node Web UI (Search Query) Indexer (Index Query) Client Node SSD Data Node Master Nodes
  11. 11 Problems

  12. 12 Mapping Explosion • Numbers: 3,000 projects → 3,000 mappings

    • Mapping Size: 6MB Too many mappings
  13. 13 Mapping Explosion Stopping cluster due to update mapping [2017-05-30

    21:36:57,773][WARN ][cluster.service ] [elastic09.nelo2] cluster state update task [put-mapping [naver-project],put-mapping [naver-project]] took 5.1m above the warn threshold of 30s
  14. 14 Mapping Explosion Indexing lag

  15. 15 Shard Size Distribution Skewed shards due to routing 0

    20 40 60 80 100 120 0 5 10 15 20 25 30 35 40
  16. 16 Multi-Cluster Multi-Indices per day

  17. 17 Tribe • A federated client across multiple clusters •

    Limits ‒ Cannot handle indices with the same name in multiple clusters ‒ Master level write operations are not allowed. • Will be replaced with cross cluster search Introduction Client Tribe Node Cluster A Cluster B
  18. 18 Tribe Sample cluster.name: es-tribe tribe: es1: cluster.name: es1 discovery.zen.ping.unicast.hosts:

    ['10.3.8.76'] es2: cluster.name: es2 discovery.zen.ping.unicast.hosts: ['10.3.8.75']
  19. 19 Tribe In NELO Cold Cluster Webapp Indexer Tribe Nodes

    Hot Cluster HDFS index search Search hot data Search cold data snapshot restore master nodes master nodes
  20. 20 Index Model Change • Time based index model +

    project based index model ‒ Same policy for daily index creation ‒ For big projects, split indices ‒ For small projects, share index Introduction nelo2-2017-09-18 nelo2-2017-09-18 nelo2-2017-09-18-naverapp nelo2-2017-09-18-line nelo2-2017-09-18-band
  21. 21 Index Model Change • Use aliases no matter a

    project is indiced either in common or own index. ‒ Alias name: <project name>-yyyyMMdd ‒ Index name: nelo2-log-yyyy-MM-dd-<project name> Aliases
  22. 22 Index Model Change • Shard Count ‒ Estimated Index

    Size / Shard Size Threshold ‒ Estimated Index Size: average values of past logs ‒ Shard Size Threshold: configured (by test) Shard size estimation
  23. 23 After changing model Stabilized Indexing

  24. 24 After changing model Shard size distribution 0 2 4

    6 8 10 12 0 50 100 150 200 0 20 40 60 80 100 120 0 10 20 30 40
  25. 25 Q&A