Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Metrics Aggregation Data Flow

Metrics Aggregation Data Flow

Metrics集約のデータフローとシステムデザイン、監視について
LINE Wataru Manji (@_manji0 : https://twitter.com/_manji0 )

大規模なシステムでよく使われるようになってきたPrometheusのagent化と大きなTSDBへのremote-writeによるデータ集約モデルについて紹介し、スケーラビリティのためにどのようなデザインを採用するべきか、そのシステムそのものの監視方法などについて解説します。

2022年3月12日開催「6社合同 SRE勉強会」
https://line.connpass.com/event/236497/

発表内容をもっと詳しく聞きたいという方は、Meetyでカジュアル面談を受け付けています。
https://meety.net/matches/OmIcsJmyfCHb
*本人の都合・判断で、クローズする場合やリクエストに対応できない場合もあります。予めご了承ください。

A3966f193f4bef226a0d3e3c1f728d7f?s=128

LINE Developers
PRO

March 12, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Metrics Aggregation Data Flow

  2. Wataru Manji Team Manager - Verda Reliability Engineering, LINE -*/&ͷϓϥΠϕʔτΫϥ΢υl7FSEBzͰ43&νʔϜΛϦʔυ

    ؂ࢹσʔλͷऩूɺอଘํࣜΛ౷Ұ͢Δ1+ͷ1K. झຯ͸+&%.ؑ৆ɺےτϨɺ΢Ϛ່ͷҭ੒ཧ࿦ߟ࡯
  3. Scale of LINE infrastructure 30% Annual rate of growth 80%

    Coverage of Verda Servers 5000 units Monthly New Server
  4. Verda - Private Cloud Platform of LINE 85,000 VMs 40,000

    Baremetal servers 6,500 HVs 10,000 Loadbalancers 4,500 MySQL clusters 700 k8s clusters Statistics of as 2022/03
  5. Problems with Scale Increased cost of product's own monitoring Complexity

    of troubleshooting Vertical Scale Problem of Prometheus
  6. Solution procedure Use Central TSDB Agentificate Prometheus Split Scrape Unit

  7. Requirements of Central TSDB for metrics Not depend on Verda

    HA with Multi-DC Scalability Operation
  8. Metrics TSDB Implementations Cortex Metrics Thanos IMON Flash Developed and

    maintained by LINE.
  9. Design of Cortex metrics Retention: Custom

  10. Design of IMON Flash retention: 24h

  11. What makes IMON Flash suitable for us 1 Custom Retention

    2 Data openness 3 Low dependence on Verda 4 Support Level 5 Low cost 6 Less operation cost for us
  12. Product Selection for Public Cloud GCP: based on Monarch AWS:

    based on Cortex Metrics Azure: (base is not opened)
  13. Solution procedure Use Central TSDB Agentficate Prometheus Split Scrape Unit

  14. Prometheus Operation Problems ࠶ىಈʹ͕͔͔࣌ؒΔ ϝϞϦ࢖༻ྔ͕ലେʹͳΔ σΟεΫ࢖༻ྔ͕ലେʹͳΔ TDSBQFUBSHFU͕૿͑Δ΄Ͳ $16࢖༻ྔ͕ ུ

  15. Time required to restart Prometheus Down! Restart Replaying WAL... Over

    1h Running
  16. Prometheus Agent-mode Pros Cons

  17. Use normal Prometheus as Agent ࠶ىಈʹ͔͔Δ࣌ؒ ϝϞϦ࢖༻ྔ σΟεΫ࢖༻ྔ

  18. Solution procedure Use Central TSDB Agentificate Prometheus Split Scrape Unit

  19. on VM/Baremetal on Kubernetes Type of Our Prometheus Πϯϑϥن໛ͷ֦େʹΑΓɺ1SPNFUIFVTͷਫฏεέʔϧ͕ඞཁʹͳͬͨ

  20. Required Server Resource for Prometheus 500 2vCPU under 16GB RAM

    20GB Disk Number of scraping HV Required Spec 1000 4vCPU under 32GB RAM 40GB Disk 5000 16vCPU over 256GB RAM over 2000GB Disk
  21. Time to Generate static scrape config with Ansible 500 2800s

    Units deploy time 1,000 4200s 5,000 6200s
  22. Split static scrape configs by hashmod with Ansible JOWFOUPSZϕʔεͷTUBUJDͳDPOGJHΛ4%Λ࢖ͬͨEZOBNJDͳDPOGJHʹஔ͖׵͑Δͷ͸໰୊ղܾͷείʔϓ͕σΧ͗͢Δ ˠTUBUJDͳ··෼ׂ͢Δํ๏Λߟ͑Δ

    "OTJCMFͷ'JMUFSͱ࣮ͯ͠૷͢Δ͜ͱͰɺKJOKBUFNQMBUFͷGPSจͷதͰ៉ྷʹදݱ͢Δ͜ͱ͕Ͱ͖Δ͜ͱʹؾ෇͍ͨ ˠ࿙Εͳܾ͘ఆతʹ෼ׂ͢ΔͨΊʹɺIPTUOBNFͷIBTIΛར༻ͨ͠"OTJCMF'JMUFSΛ࣮૷
  23. Accuracy of splitting by hashmod 2000 314 Targets Min 5000

    287 10000 262 Max 356 339 337 Stdiv (不偏標準偏差) 15.53 14.33 16.05 50000 261 349 17.41 Base 6 16 33 166
  24. Redundancy of scrape by multi hashmod function γϯϓϧʹIBTINPEͰ෼ׂͯ͠͠·͏ͱɺ1SPNFUIFVTˠ5BSHFUͷରԠ͕Ͱ͖ͯ͠·͏ TVSKFDUJWFͬͯݴ͏ͷ͔ͳ w

    1SPNFUIFVTͷEPXO΍SFNPUFXSJUFͷ஗ԆʹରॲͰ͖ͳ͍ IBTINPEͷ݁Ռͱ1SPNFUIFVTͷ JE JE NPE OVNCFSPGQSPNFUIFVT ΛϚονͤͯ͞ଟॏͰͷ4DSBQFΛ͢Δ͜ ͱͰղܾ w ॏෳNFUSJDT͸$FOUSBM54%#ͷػೳͰഉআ͢Δ w (#3".ͷΠϯελϯεͰػೳ͢ΔΑ͏ʹɺIBTINPEͰͷ෼ׂ਺Λఔ౓ʹ੍ޚ ଟॏͰ୆ΛUBSHFUʹTDSBQF w ୆͋ͨΓͷTDSBQFUBSHFU਺ͷόϥπΩ͕૬ରతʹখ͘͞ͳΔͷ΋ϝϦοτ ༨ஊ7."1*ͱݸʑͷ7.ͷՔಈ͸׬શʹผͷ໰୊ͳͷͰɺ1SPNFUIFVTΛ7.Ͱಈ͔͢͜ͱ͸0,
  25. Solution procedure Split Scrape Unit Agentficate Prometheus Use Central TSDB

  26. As-is: Metrics data-flow of VM system

  27. To-be: Metrics data-flow of VM system

  28. Conclusions Central TSDB Split scrape target Agentificate Prometheus

  29. Thank you

  30. Answers to preliminary questions ʮٻΊΒΕΔαʔϏεϨϕϧʯ͸ͦͷγεςϜͷϢʔβࢹ఺͔Βߟ͑Δඞཁ͕͋ΔͷͰɺצॴ͸ͦ͜ɻ 43&ͱ͍͏จԽʹج͍ͨ։ൃ׆ಈͷόϥϯεʹ͓͍ͯࢦඪͱ͢ΔʮαʔϏεϨϕϧʯͷ਺͸ແҋ΍ͨΒʹ૿΍͞ͳ͍͜ͱ͕େࣄͳͷͰɺ Ϣʔβࢹ఺͔ΒʮΑΓॏཁͳ߲໨ʯ΁ߜΔ͜ͱ͕େ੾ͩͱࢥ͍·͢ɻ 7FSEB͸·ͩͰ͖ͯͳ͍ͷͰɺ43&νʔϜͱͯ͠͸ʮϢʔβαʔϏεԣஅͰࢦඪʹબ͹ΕΔͰ͋Ζ͏4-*ͷܭଌखஈͷ౷Ұʯ΍ ʮܭଌͷͨΊͷΠϯλʔϑΣʔεͷ౷ҰʯΈ͍ͨͳͱ͜ΖʹྗΛೖΕͯ׆ಈ͠Α͏ͱͯ͠·͢ɻ

  31. Answers to preliminary questions ௐࠪɺ෼ੳͷͨΊʹNFUSJDTΛ࢖͏͜ͱ͸ଟ͋͘Γ·͢ɻ w "1*αʔό͕μ΢ϯͨ͠ ˠ௚લʹॲཧͯͨ͠ϦΫΤετͷ୯Ґ࣌ؒ͋ͨΓͷྔ͸ ॲཧ࣌ؒͷฏۉ͸ w

    .FTTBHJOH2VFVFͷTUVDLˠ4VCTDSJCFSͦΕͧΕͷ୯Ґ࣌ؒ͋ͨΓͷॲཧྔ͸  ͦ΋ͦ΋ͪΌΜͱॲཧͰ͖ͯΔ w ϢʔβαʔϏεͷෆ௨ ˠ7FSEBͷ-#ʹͦ΋ͦ΋ϦΫΤετ͕ಧ͍͍ͯΔ #BDLFOE4FSWFSͷ)FBMUIDIFDLͷ҆ఆੑ͸Ͳ ͏
  32. Answers to preliminary questions 7FSEBͷϢʔβϦιʔεʹ͸ґଘ͍ͯ͠Δ͜ͱ͕͋Δ͚Ͳɺ7FSEBͷ"1*෦෼ʹ͸ґଘ͠ͳ͍Α͏ʹͯ͠·͢ɻ ಛʹʮ7FSEB͕ఏڙ͢ΔετϨʔδαʔϏεʯʹґଘ͠ͳ͍Α͏ʹؾΛݣͬͯ·͢ɻ

  33. Answers to preliminary questions σʔλͷܽଛ͸ڐ༰͢Δ΂͖ͰɺͦͷલఏʹཱͬͯNFUSJDTͷҙຯදݱΛݕ౼͢Δඞཁ͕͋Γ·͢ɻ w σʔλࣗମ͕ܽଛʹΑͬͯܥྻͱͯ͠ͷҙຯΛࣦΘͳ͍Α͏ʹ͢Δ $PVOUFSͷར༻ͳͲ w 4DSBQFͷ੒ޭࣦഊͦͷ΋ͷΛ࣌ܥྻσʔλͱͯ͠؅ཧ͢Δඞཁੑ

    1SPNFUIFVTͷVQNFUSJDT ͷͭͷ఺ʹ஫ҙͯ͠"MFSU౳Λઃܭ͢Δ΂͖Ͱ͢Ͷɻ
  34. Answers to preliminary questions ؂ࢹํ๏ FYQPSUFSͷ࣮૷ɺ഑ஔɺNFUSJDTͷૹ৴ํ๏ ͱ؂ࢹσʔλͷอଘઌ͸෼͚ͯߟ͑Δ΂͖ͰɺલऀͷΞϓϩʔν͸ݸ ผɺޙऀ͸౷Ұ͢Δ͜ͱΛ೦಄ʹஔ͍͍ͯ·͢ɻ

  35. Answers to preliminary questions w ؂ࢹํ๏ FYQPSUFSͷ࣮૷ɺ഑ஔɺNFUSJDTͷૹ৴ํ๏ ͸7FSEBʹݶΒͣશ-*/&αʔϏεͰͦΕͧΕͷνʔϜ͕ઃܭͯ͠ ͍Δ͸ͣ w

    ؂ࢹσʔλͷอଘઌͱͯ͠ͷ-*/&಺੡ඪ४ͷ؂ࢹγεςϜ͸͍͔ͭ͋͘Γ·͕͢ɺ7FSEBͷετϨʔδαʔϏε΁ͷґଘ ͕ͳ͍͔ɺશࣾͷσʔλอଘΨΠυϥΠϯʹ४ڌͭͭ͠΋7FSEB։ൃऀ͕ετϨεແ͘࢖͑Δ͔ɺ7FSEBͷσʔλྔʹ଱͑ ΒΕΔ͔ͳͲͷݕ౼݁ՌʹΑͬͯద੾ͳखஈΛબͿඞཁ͕͋Γ·͢ɻ