LINT (LINE Improvement for Next Ten years)

LINT (LINE Improvement for Next Ten years)

Shunsuke Nakamura
LINE LINT TF Engineering manager & Software engineer
https://linedevday.linecorp.com/jp/2019/sessions/A1-2

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 2019 DevDay LINT (LINE Improvement for Next Ten Years) >

    Shunsuke Nakamura > LINE LINT TF Engineering manager & Software engineer
  2. Agenda > LINT background > Event delivery core > LINT

    Epics • HTTP/2 and Push • Event delivery mechanism Agenda
  3. Agenda > LINT background > Event delivery core > LINT

    Epics • HTTP/2 and Push • Event delivery mechanism Agenda
  4. LINE started as a service in 2011 A few kind

    of servers iPhone Android ~ 1 million users messaging
 service other services … … ~ 10s nodes clusters
  5. Enlarging messaging service → platform Many kind of servers iPhone

    Android 200 million users
 (approx.) messaging
 platform Desktop Chrome And other
 2nd~3rd parties reliability & flexibility Watch iPad ~ 1000 nodes clusters
  6. Platform over technical debts > On-demand data delivery Higher reliability

    > Ensure event delivery with robustness and resiliency More flexibility A B block UserB message from UserB Blocked UserB though? 1 No unread message though?
  7. Known debts but could not focus Android
 client dev front-end

    server
 dev planner QA iPhone
 client dev backend-end server
 dev
  8. LINT (LINE Improvement for Next Ten years) > Gather BIG

    issues to prevent the future growth > Perform systematically toward common goal A TaskForce across multi-teams LINT TF was born 
 against big technical debts Technical
 PM iOS Android
 Desktop Chrome front- end server back- end server auth server storage server QA
  9. LINT Epics in 2019 Picked up highest priority ones from

    many Epics.. > 2. Event delivery mechanism > 3. Authentication Token renewal > 1. HTTP/2 and Push > 4. General setting storage for client/server
  10. Talk 1 and 2 in LINE DEVELOPER DAY 2019 >

    2. Event delivery mechanism > 3. Authentication Token renewal > 1. HTTP/2 and Push > 4. General setting storage for client/server
  11. Agenda > LINT background > Event delivery core > LINT

    Epics • HTTP/2 and Push • Event delivery mechanism Agenda
  12. Components of Event delivery core LINE client L4 load balancer

    + LEGY (front-end) talk-server (back-end) storages collaboration
 services
  13. Event := “Operation” (OP) clients server storages Store OPs
 per

    each accounts Fetch OPs 
 per each devices struct Operation {
 // 1, 2, 3, … per each accounts 1: i64 revision,
 // enum of Message,Group,Contact,.. 2: OpType type, 
 // message, messageId, targetUserId, chatId.. 3: Payload payload,
  14. > request user > the followers OP for Contact (social-graph)

    OP for LINE Group > request user > members of the group OP for Messaging > sender > receivers LINE core process is Operation delivery SEND_MSG OP RECV_MSG OP SEND_CHAT_ CHECKED OP NOTIFIED_READ_MSG OP sendMessage mark-as-read
  15. fetchOperations API List<Operation> fetchOperations (i64 clientRev, OpCount count); OP001 OP002

    …. OP099 OP100 OP101 OP102 OP103 OP104 END fetchOperations (rev:1, count:100) fetchOperations (rev:101, count:100) localRev: 1 =>101 localRev: 101 =>104
  16. LEGY LINE Event GatewaY > long-polling > LEGY encryption, header

    cache Own optimization for better messaging UX In-house reverse proxy since 2012 > SPDY based protocol for multiplexing > Request routing and Connection/Session management > Protocol conversion between SPDY and HTTP LINE LEGY
  17. fetchOperations via LEGY client LEGY talk-server fetchOperations
 (rev:1, count:100) Bypass

    request “200 OK” Bypass response Get 
 rev:2-101 OPs
 from storage SPDY HTTP reverse proxy
 protocol conversion
  18. long-polling by LEGY client LEGY talk-server fetchOperations
 (rev:101, count:100) “204

    No Content” empty OPs new OPs rev:102-104 
 incoming publish “200 OK” Save Get rev:102-104 OPs
  19. Agenda > LINT background > Event delivery core > LINT

    Epics • HTTP/2 and Push • Event delivery mechanism Agenda
  20. 2 issues of LEGY Not use standard protocol Redundant inter-access

    due to polling style
  21. SPDY to HTTP/2 > Standard client library (NSURLSession / okhttp)

    • with OS native features like MPTCP, TLS1.3, metrics > Enable to replace own optimization with standard • LEGY header cache => Header compression in HTTP/2 • LEGY encryption => 0-RTT by TLS 1.3 Outdated SPDY > No standard client library > Complex own optimization without full documentation Shift to HTTP/2 standard LEGY In-house maintenance per kinds of devices
  22. Seamless migration to HTTP/2 > Switch target protocol according to

    LEGY connection info > Control protocol on server side dynamically Abstract layer to switch SPDY <-> HTTP/2 App v1 App v2 LEGY SPDY HTTP2 upgrade critical
 bug App v2 abstract layer SPDY HTTP2 LEGY fallback connInfo: SPDY connInfo: HTTP2
  23. Long polling to push LEGY talk-server empty response fetchOperations long

    polling 33% of all 
 are 
 empty response push LEGY talk-server sign on “200” OK fetchOperations fetchOperations push Save New OP income New OP income
  24. Summary : HTTP/2 and Push Streaming push over HTTP2 Push-style

    fetchOperations SPDY to HTTP2 Stop 
 long-polling on-going > standardized protocol > Save redundant request
  25. Agenda > LINT background > Event delivery core > LINT

    Epics • HTTP/2 and Push • Event delivery mechanism Agenda
  26. 2 issues of fetchOperations > No way to recover inconsistency

    > No way to see the inconsistency load > Must fetch Operations sequentially > Too low cost-effective storage management 226TB +3TB/month 0.005% 
 usage talk-server JOIN_GROUP OP RECV_MSG OP Inefficient way for inactive Apps Not robust on partial data lost Needs
 complex workaround
  27. Sleep
 for long time wake up Zzz … Cannot chat

    till completion fetchOperations fetchOperations fetchOperations . . . . . . . Resume 
 a chat!
  28. Another way := Snapshot APIs Snapshot APIs to provide the

    latest snapshot per each Categories App sendMessage, addFriend, createGroup, … Operation
 Storage MessageBox
 Storage Store mutation in 2 kind of ways fetchOperations API Snapshot APIs UserSettings
 Storage SocialGraph
 Storage Group
 Storage …. Utilize more
  29. 3 Sync mechanism with Snapshot APIs Inefficient way for inactive

    Apps Not robust on partial data lost >1. FullSync >2. Auto Repair >3. Manual Repair manual repair
  30. 1. FullSync mechanism Trigger per-categories snapshot API calls under client/server

    side conditions Conditions: > 1. Revision gap > 2. Revision hole > 3. Initialization > 4. Client local DB corruption
  31. fetchOperations (incremental) > Quick sync for active clients > Lightweight

    IO/network costs FullSync (batch) > Efficient sync for inactive clients > On-demand partial sync sync() API Cover fetchOperations and FullSync talk-server client API call with 
 local revision conditions
  32. Active clients fetch Operations like before server 
 revision client

    revision rev:N-200 fetch per 100 OPs fetchOperations calls x2 client revision fetchOperations calls x1 rev:N-10 rev:N 10 200
  33. Server conditions to trigger FullSync server 
 revision active client

    revision inative client revision N N-200 N-100,000 calls x1000 2.revision hole 1.revision gap calls x2 normal fetchOperations data lost revision
  34. Client conditions to trigger FullSync server 
 revision LINE Desktop


    registration N N-200 rev:1 3.initialization calls x2 normal fetchOperations 4.client localDB corruption calls x??? revision
  35. Full Sync flow (1/4) 1. Calls sync() API and gets

    FullSync trigger due to REVISION_GAP Client talk-server sync() API FullSync trigger 
 with “server revision=100,100” revision server 
 revision client revision 100 100,100 rev = 100 rev = 100,100
  36. UserSettings
 Storage Full Sync flow (2/4) 2. Calls each Snapshot

    APIs Client talk-server snapshot APIs MessageBox
 Storage SocialGraph
 Storage Group
 Storage revision client revision server 
 revision rev = 100 100 100,100 100,200 rev = 100,100
  37. Full Sync flow (3/4) 3. Client bumps up local revision

    to given server revision=100,100 Client talk-server revision client revision bump up rev = 100 => 100,100 rev = 100,200 100 100,100 server 
 revision 100,200
  38. Full Sync flow (4/4) 4. Calls sync() again and gets

    Operations (resume fetchOperations) revision 100,100 server 
 revision 100,200 Client talk-server sync() API rev:100,100 - 100,200 
 OPs rev = 100,100 rev = 100,200 client revision
  39. Many Data / Indexes on storages Plus, Repair mechanism >

    Put Availability before Consistency > Always possible to occur unexpected mistake and code bugs anywhere Hard to maintain 100% consistency at LINE scale Resiliency enhancement > ASIS: Adhoc recovery after CS income > TOBE: Repair mechanism to satisfy eventual consistency talk-server
  40. 2. Auto Repair Mechanism Periodic coarse-grained multi-tiered recovery Exchange digest

    between client and server Dynamic period control based on data granularity & load data load Small Medium Large Tier1: O(1) data 
 like Profile, Settings Tier2: Digest of O(N) data 
 like friends/groups Tier3: Digest of O(NxM) data 
 like num of members per groups Per 1day Per 1 Weeks Per 2 Weeks cycle
  41. Example: Repair Friend List friends Brown Cony friends Brown Cony

    server client
  42. Failed to notify block status friends Brown Cony friends Brown

    Cony Block Failed to notify due to internal error server client
  43. Wrong sync by collision friends Brown Cony friends Brown Cony

    recommendations recommendations server client Sally Sally Collision
 while sync Sally concurrent
 update
  44. Periodic API call to exchange data digest server client 1.

    Call getRepairElements API
 weekly - numFriends = 2
 numBlocked = 0
 numRecommendation = 1 with local state 2. Compare with server state - numFriends = 2
 numBlocked = 1
 numRecommendation = 0
  45. Response and report the result Server Client Data platform 3.

    Response to 
 trigger Contact sync 4. Report result for 
 monitoring, stats and analysis
  46. Repaired friends Brown Cony friends server client Sally Brown Cony

    Sally 5. getContactList API
 (snapshot) 6. Response the correct data
  47. Summary: event delivery mechanism > Auto Repair + Manual Repair

    > fetchOperations + fullSync (batch) Inefficient way for inactive Apps Not robust on partial data lost
  48. Summary LINT (LINE Improvement for Next Ten years) >Big challenges

    against technical debts
 to be resolved for platform under approx. 200 million users > LINT is an organization/project of multiple-teams
 to empower the future messaging platform
  49. Future works of LINT within 202x > Support login-logout features

    > Support multi-accounts features > Support multi-devices features > Bi-directional social-graph model > Social-graph redesign to support more features > More flexible Message metadata > Idempotent event delivery > Migration to async/non-blocking processing > Release various system limitations > Flexible fan-out/fan-in async mechanism > Make monolithic talk-server MSA > Multi-iDC aware 0-downtime reliable data store > Multi-iDC aware messaging protocol renewal > Bot broadcast/multi-cast architecture renewal > and more technical challenges…
  50. Thank You

  51. Appendix : 3. Auth token renewal Issues > No way

    to manage unused Auth Token > No way to manage multiple accounts / devices efficiently Objective > Enable to invalidate inactive/abnormal accounts’ Auth token > Enable to renew Auth token for inactive accounts securely Zzz… wake up Auth server Can renew 
 token don’t know
 token usage..
  52. Appendix : 4. General setting storage Issues > Many kind

    of client local data 
 that are required for multiple devices/accounts feature > Server settings on legacy in-memory store on Redis cluster (space bounded) > No proper storage to maintain such data flexibly Objective > Flexible setting storage & server 
 to storage local/server data per accounts/devices as an isolated Microservice > Enable to utilize client/server integrated data via pipelining > Enable to analyze data across client/server on Data platform local theme
 per-chats pin
 options for A/B test 
 and etc. talk-server Redis Redis Redis