Slide 1

Slide 1 text

1 Lee, Jae Ik 2017/12/12, NELO Team, Naver Application Logging Platform with Elasticsearch in Naver

Slide 2

Slide 2 text

2 Agenda Introduction to NELO 1 Problem of multi-tenant logging platform Our Solution: Multi-cluster & Multi-indices per day How NELO was using Elasticsearch 2 3 4

Slide 3

Slide 3 text

3 NELO

Slide 4

Slide 4 text

4 Introduction to NELO • SDK/Agent: Log forwarding for various platforms • Collect/Search/Analyze Logs • Real-time/Scheduled alerts • Crash log symbolication • Own Webapp + Kibana(Dashboard) • OpenAPI for custom data processing Features

Slide 5

Slide 5 text

5 Introduction to NELO Architecture 수집서버 클러스터 모바일/데스크 톱 애플리케이션 SDK Collector Server N Thrift Collector Server 2 Collector Server 1 Collector Server 3 Queue Sink 분산큐 Webapp & Kibana 메타 정보 DB N E L O O P E N A P I Filter Convert Syslog K A F K A 시스템 로그 전송 오픈소스 로깅 에이전트 HTTP HTTPS 심볼리케이터 알람서버 크래시집계 검색/분석 서버

Slide 6

Slide 6 text

6 Elasticsearch in NELO

Slide 7

Slide 7 text

7 Node 7 instances 9 clusters Scale Documents Total number of logs Size Total size of logs 388 2630B 627T

Slide 8

Slide 8 text

8 Index Model • 1 Index per day → daily index lifecycle management • Various retention time according to the instances (1 M, 3M, 2Y, 5Y) • Type for project → mapping variance per project Time-based model nelo2-2017-08-19 nelo2-2017-08-20 … nelo2-2017-09-18

Slide 9

Slide 9 text

9 Custom Routing • Use custom routing both in index & search ‒ Small project: store only in one shard (custom routing: project name) ‒ Big project: distribute logs over all shards (default routing) Depends on project size nelo2-2015-09-18 0 1 2 3 4 5 6 7 8 9 Client Client Client

Slide 10

Slide 10 text

10 Hot-Warm Architecture • Recent data in SSD Search HDD Data Node Web UI (Search Query) Indexer (Index Query) Client Node SSD Data Node Master Nodes

Slide 11

Slide 11 text

11 Problems

Slide 12

Slide 12 text

12 Mapping Explosion • Numbers: 3,000 projects → 3,000 mappings • Mapping Size: 6MB Too many mappings

Slide 13

Slide 13 text

13 Mapping Explosion Stopping cluster due to update mapping [2017-05-30 21:36:57,773][WARN ][cluster.service ] [elastic09.nelo2] cluster state update task [put-mapping [naver-project],put-mapping [naver-project]] took 5.1m above the warn threshold of 30s

Slide 14

Slide 14 text

14 Mapping Explosion Indexing lag

Slide 15

Slide 15 text

15 Shard Size Distribution Skewed shards due to routing 0 20 40 60 80 100 120 0 5 10 15 20 25 30 35 40

Slide 16

Slide 16 text

16 Multi-Cluster Multi-Indices per day

Slide 17

Slide 17 text

17 Tribe • A federated client across multiple clusters • Limits ‒ Cannot handle indices with the same name in multiple clusters ‒ Master level write operations are not allowed. • Will be replaced with cross cluster search Introduction Client Tribe Node Cluster A Cluster B

Slide 18

Slide 18 text

18 Tribe Sample cluster.name: es-tribe tribe: es1: cluster.name: es1 discovery.zen.ping.unicast.hosts: ['10.3.8.76'] es2: cluster.name: es2 discovery.zen.ping.unicast.hosts: ['10.3.8.75']

Slide 19

Slide 19 text

19 Tribe In NELO Cold Cluster Webapp Indexer Tribe Nodes Hot Cluster HDFS index search Search hot data Search cold data snapshot restore master nodes master nodes

Slide 20

Slide 20 text

20 Index Model Change • Time based index model + project based index model ‒ Same policy for daily index creation ‒ For big projects, split indices ‒ For small projects, share index Introduction nelo2-2017-09-18 nelo2-2017-09-18 nelo2-2017-09-18-naverapp nelo2-2017-09-18-line nelo2-2017-09-18-band

Slide 21

Slide 21 text

21 Index Model Change • Use aliases no matter a project is indiced either in common or own index. ‒ Alias name: -yyyyMMdd ‒ Index name: nelo2-log-yyyy-MM-dd- Aliases

Slide 22

Slide 22 text

22 Index Model Change • Shard Count ‒ Estimated Index Size / Shard Size Threshold ‒ Estimated Index Size: average values of past logs ‒ Shard Size Threshold: configured (by test) Shard size estimation

Slide 23

Slide 23 text

23 After changing model Stabilized Indexing

Slide 24

Slide 24 text

24 After changing model Shard size distribution 0 2 4 6 8 10 12 0 50 100 150 200 0 20 40 60 80 100 120 0 10 20 30 40

Slide 25

Slide 25 text

25 Q&A