Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pandemic response questionnaire for 25M users, made in 1 week

Pandemic response questionnaire for 25M users, made in 1 week

Eebedc2ee7ff95ffb9d9102c6d4a065c?s=128

LINE DevDay 2020

November 26, 2020
Tweet

Transcript

  1. None
  2. Self Introduction › Server-Side Software Engineer of LINE Official Account

    Manager › Previously developed Smart Channel, LINE Ads, and LINE Points Ads › Joined to LINE in 2015 as a new graduate
  3. National Survey for COVID-19

  4. National Survey for COVID-19 › Conducted 5 times of surveys

    › Provided answer data to the Ministry of Health, Labor and Welfare › Surveyed the physical condition and awareness of infection prevention
  5. Numbers of the Survey Message Recipients 83M Preparation Days 6

    Answers 25M As of March 2020
  6. Background › In March, the number of positives was increasing

    in Japan › Data was needed Open Data by the Ministry of Health, Labor and Welfare https://www.mhlw.go.jp/stf/covid-19/open-data.html 0 50 100 150 200 250 300 350 400 450 2/1 2/8 2/15 2/22 2/29 3/7 3/14 3/21 3/28 4/4 Number of Positives April 7: The State of Emergency was declared March 25: The project was started Asked to stay home on weekends in Tokyo
  7. What We Can Do? › Collect a huge amount of

    data in a short period › Take the initiative in the survey
  8. Options › Need a lot of time to communicate with

    partners Collaborate with other companies Develop a new system from scratch › Can design specifically for the survey Use our existing service (LINE Research) › Does not meet the purpose
  9. Timeline Day 7 Release Day 5 Start QA Day 2

    Start Development Day 6 Finish QA Day 4 Finish Development Day 1 Start Project Only 3 Days
  10. Development Team › Supported by › Infrastructure engineers › Security

    engineers › Data scientists › DBAs › Engineers of collaborating services › And more... Server-side Front-end Planner Core Team
  11. Related Articles › LINEユーザー8300万人を対象とするアンケートを1週間で開発するには? コロナ禍におけるLINEの施策とフロントエンド開発 › https://logmi.jp/tech/articles/322999 Front-end Development Product

    Management › リリースまで6日、LINE『新型コロナ対策のための全国調査』の舞台裏 › https://careerhack.en-japan.com/report/detail/1422
  12. Development Challenges Develop rapidly Handle high traffic Never stop the

    system due to trouble
  13. Policy to Develop in 3 Days Develop a simple system

    in the minimum specification Develop a system step by step Avoid system troubles by design
  14. First Step Survey Page / Answer Store

  15. User Experience Message Survey Page Thanks Page Chat Room (Flex

    Message) In-App Browser (LIFF Platform)
  16. First Idea (Not Adopted) › Concerns › Time to implement

    › Stability when high traffic › Reduce the amount of code › Reduce time to implementation › Reduce unexpected behaviors Web App Client Master Data Serve Survey Page LB App Server Answer Store Send Answers
  17. Survey Page / Answer Store First Step nginx Client Serve

    Survey Page (Static File) Open Survey Page LB App Server Answer Store
  18. Sending Answers Message Answer Page Thanks Page Entry Event Submit

    Event Answer of First Question All Answers
  19. Survey Page / Answer Store First Step nginx Client Answer

    Store (access log) Send Answers Store Answers Open Survey Page LB App Server
  20. Access Log Format log_format main "request_id:$request_id" ”¥t" "remote_addr:$remote_addr" "¥t" "real_ip:$http_x_true_userip"

    "¥t" "msec:$msec” "¥t" "server_protocol:$server_protocol" "¥t" "method:$request_method" "¥t" "scheme:$scheme" "¥t" "host:$host" "¥t" "path:$request_uri" "¥t" "status:$status” ... "¥t" "request_body:$request_body" ; request_id:b58cbf9c66a55a4a7b79e7f18906badd remote_addr:… real_ip:… msec:1585441263.563 server_protocol:HTTP/1.1 method:POST scheme:https host:covid19.line-apps.com path:/api/survey status:200 … request_body:{¥x22userId¥x22:¥x22ue216a4b17f4946f19e8f47 889830f275¥x22,…,¥x22body¥x22:{¥x22a4¥x22:¥x221¥x22,¥x22 a5¥x22:¥x223¥x22,¥x22a6¥x22:[¥x221¥x22],¥x22a7¥x22:[¥x22 2¥x22],…}} Nginx Config Actual Log Entry Answer JSON as a Field in LTSV
  21. Second Step Aggregate Answer Logs

  22. Need for Log Aggregation access log Aggregate Answer Store access

    log App Servers Original data for the final result Monitoring Maximize the number of answers
  23. Aggregate Answer Logs Second Step nginx access log Transfer access

    logs every second (Asynchronously) App Server fluentd MySQL Client LB
  24. Pipeline in fluentd Source @type tail Filter @type record_transformer Filter

    @type parser Match @type mysql_bulk Read access logs and parse LTSV Unescape request_body Parse request_body Insert access logs to MySQL
  25. Third Step Verify Answer Logs

  26. Need for Log Verification (Unverified) Answer Store Filter unauthorized answer

    logs Filter duplicate answers Verified Answer Store Aggregated data The original data of the final result
  27. Verify Answer Logs Third Step nginx access log App Server

    fluentd MySQL Verifier Batch Server LINE Login Client LB Issue ID Token Send Answers with ID Token
  28. ID Token LINE Login … Issue ID Token • User

    Data • Expiration Date • … • Signature Answer Store Verifier Client Verify ID Token Can be verified the login locally and asynchronously
  29. Verify Answer Logs Third Step nginx access log App Server

    fluentd MySQL Verifier Batch Server LINE Login Client LB 1. Fetch Access Logs 2. Verify ID Tokens 3. Write Results Issue ID Token Send Answers with ID Token
  30. Preparing for the Survey Delivering Messages / Performance Test /

    Monitoring
  31. Message Delivery Spikes Time Fast Delivery Messages Sent Answer Rate

    Time Slow Delivery Messages Sent Answer Rate Higher Peak Traffic Lower Peak Traffic
  32. Pseudo Slow Delivery by Manual Control Time Messages Sent Answer

    Rate (Pseudo Slow Delivery) Answer Rate (Fast Delivery) Answer Rate (Slow Delivery) Control the Delivery Pace Reduced Peak Traffic Baseline
  33. Performance Test › Handle 1K answers / sec by single

    instance › Reached to 25K requests / sec in past campaigns › Prepare 50 instances for 50K (=25K * 2) requests / sec nginx access log App Server fluentd MySQL
  34. Monitoring node_exporter App Server Prometheus nginx_exporter fluent-plugin- prometheus Push Gateway

    node_exporter Batch Server Verifier Monitoring Server Notify App MySQL (Replica) Grafana LINE Notify Monitoring App Notify Server Alert Manager Since the 4th Survey
  35. Monitoring Dashboard Grafana LINE Notify

  36. Delivery Result

  37. Changes in the Number of Events 0 1,000 2,000 3,000

    4,000 5,000 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 3/31 9:00 3/31 18:00 4/1 3:00 4/1 12:00 4/1 21:00 Events / sec Events Events Answer Events Event Rate 66M Events 25M Answers 5K Events / sec
  38. ID Token Verification Delay 0 1,000 2,000 3,000 4,000 5,000

    0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 3/31 09:00 3/31 12:00 3/31 15:00 3/31 18:00 3/31 21:00 4/1 00:00 Events / sec Events Delay Event Rate 3.5M Events Delay
  39. Retrospective › Felt comfortable with the development Develop a system

    step by step Avoid system troubles by design › Some background processing was delayed › Avoided the impact of processing delays on the users Develop a simple system in the minimum specification › Ensured performance and stability
  40. Conclusion › The scale of the survey was very large,

    but the system was very simple › Handled high traffic stably and prepared in a short period › Conducted the national survey for COVID-19 in 3 days development
  41. Thank you