Slide 1

Slide 1 text

2019 DevDay LINT (LINE Improvement for Next Ten Years) > Shunsuke Nakamura > LINE LINT TF Engineering manager & Software engineer

Slide 2

Slide 2 text

Agenda > LINT background > Event delivery core > LINT Epics • HTTP/2 and Push • Event delivery mechanism Agenda

Slide 3

Slide 3 text

Agenda > LINT background > Event delivery core > LINT Epics • HTTP/2 and Push • Event delivery mechanism Agenda

Slide 4

Slide 4 text

LINE started as a service in 2011 A few kind of servers iPhone Android ~ 1 million users messaging
 service other services … … ~ 10s nodes clusters

Slide 5

Slide 5 text

Enlarging messaging service → platform Many kind of servers iPhone Android 200 million users
 (approx.) messaging
 platform Desktop Chrome And other
 2nd~3rd parties reliability & flexibility Watch iPad ~ 1000 nodes clusters

Slide 6

Slide 6 text

Platform over technical debts > On-demand data delivery Higher reliability > Ensure event delivery with robustness and resiliency More flexibility A B block UserB message from UserB Blocked UserB though? 1 No unread message though?

Slide 7

Slide 7 text

Known debts but could not focus Android
 client dev front-end server
 dev planner QA iPhone
 client dev backend-end server
 dev

Slide 8

Slide 8 text

LINT (LINE Improvement for Next Ten years) > Gather BIG issues to prevent the future growth > Perform systematically toward common goal A TaskForce across multi-teams LINT TF was born 
 against big technical debts Technical
 PM iOS Android
 Desktop Chrome front- end server back- end server auth server storage server QA

Slide 9

Slide 9 text

LINT Epics in 2019 Picked up highest priority ones from many Epics.. > 2. Event delivery mechanism > 3. Authentication Token renewal > 1. HTTP/2 and Push > 4. General setting storage for client/server

Slide 10

Slide 10 text

Talk 1 and 2 in LINE DEVELOPER DAY 2019 > 2. Event delivery mechanism > 3. Authentication Token renewal > 1. HTTP/2 and Push > 4. General setting storage for client/server

Slide 11

Slide 11 text

Agenda > LINT background > Event delivery core > LINT Epics • HTTP/2 and Push • Event delivery mechanism Agenda

Slide 12

Slide 12 text

Components of Event delivery core LINE client L4 load balancer + LEGY (front-end) talk-server (back-end) storages collaboration
 services

Slide 13

Slide 13 text

Event := “Operation” (OP) clients server storages Store OPs
 per each accounts Fetch OPs 
 per each devices struct Operation {
 // 1, 2, 3, … per each accounts 1: i64 revision,
 // enum of Message,Group,Contact,.. 2: OpType type, 
 // message, messageId, targetUserId, chatId.. 3: Payload payload,

Slide 14

Slide 14 text

> request user > the followers OP for Contact (social-graph) OP for LINE Group > request user > members of the group OP for Messaging > sender > receivers LINE core process is Operation delivery SEND_MSG OP RECV_MSG OP SEND_CHAT_ CHECKED OP NOTIFIED_READ_MSG OP sendMessage mark-as-read

Slide 15

Slide 15 text

fetchOperations API List fetchOperations (i64 clientRev, OpCount count); OP001 OP002 …. OP099 OP100 OP101 OP102 OP103 OP104 END fetchOperations (rev:1, count:100) fetchOperations (rev:101, count:100) localRev: 1 =>101 localRev: 101 =>104

Slide 16

Slide 16 text

LEGY LINE Event GatewaY > long-polling > LEGY encryption, header cache Own optimization for better messaging UX In-house reverse proxy since 2012 > SPDY based protocol for multiplexing > Request routing and Connection/Session management > Protocol conversion between SPDY and HTTP LINE LEGY

Slide 17

Slide 17 text

fetchOperations via LEGY client LEGY talk-server fetchOperations
 (rev:1, count:100) Bypass request “200 OK” Bypass response Get 
 rev:2-101 OPs
 from storage SPDY HTTP reverse proxy
 protocol conversion

Slide 18

Slide 18 text

long-polling by LEGY client LEGY talk-server fetchOperations
 (rev:101, count:100) “204 No Content” empty OPs new OPs rev:102-104 
 incoming publish “200 OK” Save Get rev:102-104 OPs

Slide 19

Slide 19 text

Agenda > LINT background > Event delivery core > LINT Epics • HTTP/2 and Push • Event delivery mechanism Agenda

Slide 20

Slide 20 text

2 issues of LEGY Not use standard protocol Redundant inter-access due to polling style

Slide 21

Slide 21 text

SPDY to HTTP/2 > Standard client library (NSURLSession / okhttp) • with OS native features like MPTCP, TLS1.3, metrics > Enable to replace own optimization with standard • LEGY header cache => Header compression in HTTP/2 • LEGY encryption => 0-RTT by TLS 1.3 Outdated SPDY > No standard client library > Complex own optimization without full documentation Shift to HTTP/2 standard LEGY In-house maintenance per kinds of devices

Slide 22

Slide 22 text

Seamless migration to HTTP/2 > Switch target protocol according to LEGY connection info > Control protocol on server side dynamically Abstract layer to switch SPDY <-> HTTP/2 App v1 App v2 LEGY SPDY HTTP2 upgrade critical
 bug App v2 abstract layer SPDY HTTP2 LEGY fallback connInfo: SPDY connInfo: HTTP2

Slide 23

Slide 23 text

Long polling to push LEGY talk-server empty response fetchOperations long polling 33% of all 
 are 
 empty response push LEGY talk-server sign on “200” OK fetchOperations fetchOperations push Save New OP income New OP income

Slide 24

Slide 24 text

Summary : HTTP/2 and Push Streaming push over HTTP2 Push-style fetchOperations SPDY to HTTP2 Stop 
 long-polling on-going > standardized protocol > Save redundant request

Slide 25

Slide 25 text

Agenda > LINT background > Event delivery core > LINT Epics • HTTP/2 and Push • Event delivery mechanism Agenda

Slide 26

Slide 26 text

2 issues of fetchOperations > No way to recover inconsistency > No way to see the inconsistency load > Must fetch Operations sequentially > Too low cost-effective storage management 226TB +3TB/month 0.005% 
 usage talk-server JOIN_GROUP OP RECV_MSG OP Inefficient way for inactive Apps Not robust on partial data lost Needs
 complex workaround

Slide 27

Slide 27 text

Sleep
 for long time wake up Zzz … Cannot chat till completion fetchOperations fetchOperations fetchOperations . . . . . . . Resume 
 a chat!

Slide 28

Slide 28 text

Another way := Snapshot APIs Snapshot APIs to provide the latest snapshot per each Categories App sendMessage, addFriend, createGroup, … Operation
 Storage MessageBox
 Storage Store mutation in 2 kind of ways fetchOperations API Snapshot APIs UserSettings
 Storage SocialGraph
 Storage Group
 Storage …. Utilize more

Slide 29

Slide 29 text

3 Sync mechanism with Snapshot APIs Inefficient way for inactive Apps Not robust on partial data lost >1. FullSync >2. Auto Repair >3. Manual Repair manual repair

Slide 30

Slide 30 text

1. FullSync mechanism Trigger per-categories snapshot API calls under client/server side conditions Conditions: > 1. Revision gap > 2. Revision hole > 3. Initialization > 4. Client local DB corruption

Slide 31

Slide 31 text

fetchOperations (incremental) > Quick sync for active clients > Lightweight IO/network costs FullSync (batch) > Efficient sync for inactive clients > On-demand partial sync sync() API Cover fetchOperations and FullSync talk-server client API call with 
 local revision conditions

Slide 32

Slide 32 text

Active clients fetch Operations like before server 
 revision client revision rev:N-200 fetch per 100 OPs fetchOperations calls x2 client revision fetchOperations calls x1 rev:N-10 rev:N 10 200

Slide 33

Slide 33 text

Server conditions to trigger FullSync server 
 revision active client revision inative client revision N N-200 N-100,000 calls x1000 2.revision hole 1.revision gap calls x2 normal fetchOperations data lost revision

Slide 34

Slide 34 text

Client conditions to trigger FullSync server 
 revision LINE Desktop
 registration N N-200 rev:1 3.initialization calls x2 normal fetchOperations 4.client localDB corruption calls x??? revision

Slide 35

Slide 35 text

Full Sync flow (1/4) 1. Calls sync() API and gets FullSync trigger due to REVISION_GAP Client talk-server sync() API FullSync trigger 
 with “server revision=100,100” revision server 
 revision client revision 100 100,100 rev = 100 rev = 100,100

Slide 36

Slide 36 text

UserSettings
 Storage Full Sync flow (2/4) 2. Calls each Snapshot APIs Client talk-server snapshot APIs MessageBox
 Storage SocialGraph
 Storage Group
 Storage revision client revision server 
 revision rev = 100 100 100,100 100,200 rev = 100,100

Slide 37

Slide 37 text

Full Sync flow (3/4) 3. Client bumps up local revision to given server revision=100,100 Client talk-server revision client revision bump up rev = 100 => 100,100 rev = 100,200 100 100,100 server 
 revision 100,200

Slide 38

Slide 38 text

Full Sync flow (4/4) 4. Calls sync() again and gets Operations (resume fetchOperations) revision 100,100 server 
 revision 100,200 Client talk-server sync() API rev:100,100 - 100,200 
 OPs rev = 100,100 rev = 100,200 client revision

Slide 39

Slide 39 text

Many Data / Indexes on storages Plus, Repair mechanism > Put Availability before Consistency > Always possible to occur unexpected mistake and code bugs anywhere Hard to maintain 100% consistency at LINE scale Resiliency enhancement > ASIS: Adhoc recovery after CS income > TOBE: Repair mechanism to satisfy eventual consistency talk-server

Slide 40

Slide 40 text

2. Auto Repair Mechanism Periodic coarse-grained multi-tiered recovery Exchange digest between client and server Dynamic period control based on data granularity & load data load Small Medium Large Tier1: O(1) data 
 like Profile, Settings Tier2: Digest of O(N) data 
 like friends/groups Tier3: Digest of O(NxM) data 
 like num of members per groups Per 1day Per 1 Weeks Per 2 Weeks cycle

Slide 41

Slide 41 text

Example: Repair Friend List friends Brown Cony friends Brown Cony server client

Slide 42

Slide 42 text

Failed to notify block status friends Brown Cony friends Brown Cony Block Failed to notify due to internal error server client

Slide 43

Slide 43 text

Wrong sync by collision friends Brown Cony friends Brown Cony recommendations recommendations server client Sally Sally Collision
 while sync Sally concurrent
 update

Slide 44

Slide 44 text

Periodic API call to exchange data digest server client 1. Call getRepairElements API
 weekly - numFriends = 2
 numBlocked = 0
 numRecommendation = 1 with local state 2. Compare with server state - numFriends = 2
 numBlocked = 1
 numRecommendation = 0

Slide 45

Slide 45 text

Response and report the result Server Client Data platform 3. Response to 
 trigger Contact sync 4. Report result for 
 monitoring, stats and analysis

Slide 46

Slide 46 text

Repaired friends Brown Cony friends server client Sally Brown Cony Sally 5. getContactList API
 (snapshot) 6. Response the correct data

Slide 47

Slide 47 text

Summary: event delivery mechanism > Auto Repair + Manual Repair > fetchOperations + fullSync (batch) Inefficient way for inactive Apps Not robust on partial data lost

Slide 48

Slide 48 text

Summary LINT (LINE Improvement for Next Ten years) >Big challenges against technical debts
 to be resolved for platform under approx. 200 million users > LINT is an organization/project of multiple-teams
 to empower the future messaging platform

Slide 49

Slide 49 text

Future works of LINT within 202x > Support login-logout features > Support multi-accounts features > Support multi-devices features > Bi-directional social-graph model > Social-graph redesign to support more features > More flexible Message metadata > Idempotent event delivery > Migration to async/non-blocking processing > Release various system limitations > Flexible fan-out/fan-in async mechanism > Make monolithic talk-server MSA > Multi-iDC aware 0-downtime reliable data store > Multi-iDC aware messaging protocol renewal > Bot broadcast/multi-cast architecture renewal > and more technical challenges…

Slide 50

Slide 50 text

Thank You

Slide 51

Slide 51 text

Appendix : 3. Auth token renewal Issues > No way to manage unused Auth Token > No way to manage multiple accounts / devices efficiently Objective > Enable to invalidate inactive/abnormal accounts’ Auth token > Enable to renew Auth token for inactive accounts securely Zzz… wake up Auth server Can renew 
 token don’t know
 token usage..

Slide 52

Slide 52 text

Appendix : 4. General setting storage Issues > Many kind of client local data 
 that are required for multiple devices/accounts feature > Server settings on legacy in-memory store on Redis cluster (space bounded) > No proper storage to maintain such data flexibly Objective > Flexible setting storage & server 
 to storage local/server data per accounts/devices as an isolated Microservice > Enable to utilize client/server integrated data via pipelining > Enable to analyze data across client/server on Data platform local theme
 per-chats pin
 options for A/B test 
 and etc. talk-server Redis Redis Redis