Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pandemic response questionnaire for 25M users, made in 1 week

Pandemic response questionnaire for 25M users, made in 1 week

LINE DevDay 2020

November 26, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Self Introduction › Server-Side Software Engineer of LINE Official Account

    Manager › Previously developed Smart Channel, LINE Ads, and LINE Points Ads › Joined to LINE in 2015 as a new graduate
  2. National Survey for COVID-19 › Conducted 5 times of surveys

    › Provided answer data to the Ministry of Health, Labor and Welfare › Surveyed the physical condition and awareness of infection prevention
  3. Background › In March, the number of positives was increasing

    in Japan › Data was needed Open Data by the Ministry of Health, Labor and Welfare https://www.mhlw.go.jp/stf/covid-19/open-data.html 0 50 100 150 200 250 300 350 400 450 2/1 2/8 2/15 2/22 2/29 3/7 3/14 3/21 3/28 4/4 Number of Positives April 7: The State of Emergency was declared March 25: The project was started Asked to stay home on weekends in Tokyo
  4. What We Can Do? › Collect a huge amount of

    data in a short period › Take the initiative in the survey
  5. Options › Need a lot of time to communicate with

    partners Collaborate with other companies Develop a new system from scratch › Can design specifically for the survey Use our existing service (LINE Research) › Does not meet the purpose
  6. Timeline Day 7 Release Day 5 Start QA Day 2

    Start Development Day 6 Finish QA Day 4 Finish Development Day 1 Start Project Only 3 Days
  7. Development Team › Supported by › Infrastructure engineers › Security

    engineers › Data scientists › DBAs › Engineers of collaborating services › And more... Server-side Front-end Planner Core Team
  8. Policy to Develop in 3 Days Develop a simple system

    in the minimum specification Develop a system step by step Avoid system troubles by design
  9. User Experience Message Survey Page Thanks Page Chat Room (Flex

    Message) In-App Browser (LIFF Platform)
  10. First Idea (Not Adopted) › Concerns › Time to implement

    › Stability when high traffic › Reduce the amount of code › Reduce time to implementation › Reduce unexpected behaviors Web App Client Master Data Serve Survey Page LB App Server Answer Store Send Answers
  11. Survey Page / Answer Store First Step nginx Client Serve

    Survey Page (Static File) Open Survey Page LB App Server Answer Store
  12. Sending Answers Message Answer Page Thanks Page Entry Event Submit

    Event Answer of First Question All Answers
  13. Survey Page / Answer Store First Step nginx Client Answer

    Store (access log) Send Answers Store Answers Open Survey Page LB App Server
  14. Access Log Format log_format main "request_id:$request_id" ”¥t" "remote_addr:$remote_addr" "¥t" "real_ip:$http_x_true_userip"

    "¥t" "msec:$msec” "¥t" "server_protocol:$server_protocol" "¥t" "method:$request_method" "¥t" "scheme:$scheme" "¥t" "host:$host" "¥t" "path:$request_uri" "¥t" "status:$status” ... "¥t" "request_body:$request_body" ; request_id:b58cbf9c66a55a4a7b79e7f18906badd remote_addr:… real_ip:… msec:1585441263.563 server_protocol:HTTP/1.1 method:POST scheme:https host:covid19.line-apps.com path:/api/survey status:200 … request_body:{¥x22userId¥x22:¥x22ue216a4b17f4946f19e8f47 889830f275¥x22,…,¥x22body¥x22:{¥x22a4¥x22:¥x221¥x22,¥x22 a5¥x22:¥x223¥x22,¥x22a6¥x22:[¥x221¥x22],¥x22a7¥x22:[¥x22 2¥x22],…}} Nginx Config Actual Log Entry Answer JSON as a Field in LTSV
  15. Need for Log Aggregation access log Aggregate Answer Store access

    log App Servers Original data for the final result Monitoring Maximize the number of answers
  16. Aggregate Answer Logs Second Step nginx access log Transfer access

    logs every second (Asynchronously) App Server fluentd MySQL Client LB
  17. Pipeline in fluentd Source @type tail Filter @type record_transformer Filter

    @type parser Match @type mysql_bulk Read access logs and parse LTSV Unescape request_body Parse request_body Insert access logs to MySQL
  18. Need for Log Verification (Unverified) Answer Store Filter unauthorized answer

    logs Filter duplicate answers Verified Answer Store Aggregated data The original data of the final result
  19. Verify Answer Logs Third Step nginx access log App Server

    fluentd MySQL Verifier Batch Server LINE Login Client LB Issue ID Token Send Answers with ID Token
  20. ID Token LINE Login … Issue ID Token • User

    Data • Expiration Date • … • Signature Answer Store Verifier Client Verify ID Token Can be verified the login locally and asynchronously
  21. Verify Answer Logs Third Step nginx access log App Server

    fluentd MySQL Verifier Batch Server LINE Login Client LB 1. Fetch Access Logs 2. Verify ID Tokens 3. Write Results Issue ID Token Send Answers with ID Token
  22. Message Delivery Spikes Time Fast Delivery Messages Sent Answer Rate

    Time Slow Delivery Messages Sent Answer Rate Higher Peak Traffic Lower Peak Traffic
  23. Pseudo Slow Delivery by Manual Control Time Messages Sent Answer

    Rate (Pseudo Slow Delivery) Answer Rate (Fast Delivery) Answer Rate (Slow Delivery) Control the Delivery Pace Reduced Peak Traffic Baseline
  24. Performance Test › Handle 1K answers / sec by single

    instance › Reached to 25K requests / sec in past campaigns › Prepare 50 instances for 50K (=25K * 2) requests / sec nginx access log App Server fluentd MySQL
  25. Monitoring node_exporter App Server Prometheus nginx_exporter fluent-plugin- prometheus Push Gateway

    node_exporter Batch Server Verifier Monitoring Server Notify App MySQL (Replica) Grafana LINE Notify Monitoring App Notify Server Alert Manager Since the 4th Survey
  26. Changes in the Number of Events 0 1,000 2,000 3,000

    4,000 5,000 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 3/31 9:00 3/31 18:00 4/1 3:00 4/1 12:00 4/1 21:00 Events / sec Events Events Answer Events Event Rate 66M Events 25M Answers 5K Events / sec
  27. ID Token Verification Delay 0 1,000 2,000 3,000 4,000 5,000

    0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 3/31 09:00 3/31 12:00 3/31 15:00 3/31 18:00 3/31 21:00 4/1 00:00 Events / sec Events Delay Event Rate 3.5M Events Delay
  28. Retrospective › Felt comfortable with the development Develop a system

    step by step Avoid system troubles by design › Some background processing was delayed › Avoided the impact of processing delays on the users Develop a simple system in the minimum specification › Ensured performance and stability
  29. Conclusion › The scale of the survey was very large,

    but the system was very simple › Handled high traffic stably and prepared in a short period › Conducted the national survey for COVID-19 in 3 days development