Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Metrics Aggregation Data Flow

Metrics Aggregation Data Flow

Metrics集約のデータフローとシステムデザイン、監視について
LINE Wataru Manji (@_manji0 : https://twitter.com/_manji0 )

大規模なシステムでよく使われるようになってきたPrometheusのagent化と大きなTSDBへのremote-writeによるデータ集約モデルについて紹介し、スケーラビリティのためにどのようなデザインを採用するべきか、そのシステムそのものの監視方法などについて解説します。

2022年3月12日開催「6社合同 SRE勉強会」
https://line.connpass.com/event/236497/

発表内容をもっと詳しく聞きたいという方は、Meetyでカジュアル面談を受け付けています。
https://meety.net/matches/OmIcsJmyfCHb
*本人の都合・判断で、クローズする場合やリクエストに対応できない場合もあります。予めご了承ください。

LINE Developers

March 12, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Wataru Manji Team Manager - Verda Reliability Engineering, LINE -*/&ͷϓϥΠϕʔτΫϥ΢υl7FSEBzͰ43&νʔϜΛϦʔυ

    ؂ࢹσʔλͷऩूɺอଘํࣜΛ౷Ұ͢Δ1+ͷ1K. झຯ͸+&%.ؑ৆ɺےτϨɺ΢Ϛ່ͷҭ੒ཧ࿦ߟ࡯
  2. Scale of LINE infrastructure 30% Annual rate of growth 80%

    Coverage of Verda Servers 5000 units Monthly New Server
  3. Verda - Private Cloud Platform of LINE 85,000 VMs 40,000

    Baremetal servers 6,500 HVs 10,000 Loadbalancers 4,500 MySQL clusters 700 k8s clusters Statistics of as 2022/03
  4. Problems with Scale Increased cost of product's own monitoring Complexity

    of troubleshooting Vertical Scale Problem of Prometheus
  5. Requirements of Central TSDB for metrics Not depend on Verda

    HA with Multi-DC Scalability Operation
  6. What makes IMON Flash suitable for us 1 Custom Retention

    2 Data openness 3 Low dependence on Verda 4 Support Level 5 Low cost 6 Less operation cost for us
  7. Product Selection for Public Cloud GCP: based on Monarch AWS:

    based on Cortex Metrics Azure: (base is not opened)
  8. Required Server Resource for Prometheus 500 2vCPU under 16GB RAM

    20GB Disk Number of scraping HV Required Spec 1000 4vCPU under 32GB RAM 40GB Disk 5000 16vCPU over 256GB RAM over 2000GB Disk
  9. Time to Generate static scrape config with Ansible 500 2800s

    Units deploy time 1,000 4200s 5,000 6200s
  10. Split static scrape configs by hashmod with Ansible JOWFOUPSZϕʔεͷTUBUJDͳDPOGJHΛ4%Λ࢖ͬͨEZOBNJDͳDPOGJHʹஔ͖׵͑Δͷ͸໰୊ղܾͷείʔϓ͕σΧ͗͢Δ ˠTUBUJDͳ··෼ׂ͢Δํ๏Λߟ͑Δ

    "OTJCMFͷ'JMUFSͱ࣮ͯ͠૷͢Δ͜ͱͰɺKJOKBUFNQMBUFͷGPSจͷதͰ៉ྷʹදݱ͢Δ͜ͱ͕Ͱ͖Δ͜ͱʹؾ෇͍ͨ ˠ࿙Εͳܾ͘ఆతʹ෼ׂ͢ΔͨΊʹɺIPTUOBNFͷIBTIΛར༻ͨ͠"OTJCMF'JMUFSΛ࣮૷
  11. Accuracy of splitting by hashmod 2000 314 Targets Min 5000

    287 10000 262 Max 356 339 337 Stdiv (不偏標準偏差) 15.53 14.33 16.05 50000 261 349 17.41 Base 6 16 33 166
  12. Redundancy of scrape by multi hashmod function γϯϓϧʹIBTINPEͰ෼ׂͯ͠͠·͏ͱɺ1SPNFUIFVTˠ5BSHFUͷରԠ͕Ͱ͖ͯ͠·͏ TVSKFDUJWFͬͯݴ͏ͷ͔ͳ w

    1SPNFUIFVTͷEPXO΍SFNPUFXSJUFͷ஗ԆʹରॲͰ͖ͳ͍ IBTINPEͷ݁Ռͱ1SPNFUIFVTͷ JE JE NPE OVNCFSPGQSPNFUIFVT ΛϚονͤͯ͞ଟॏͰͷ4DSBQFΛ͢Δ͜ ͱͰղܾ w ॏෳNFUSJDT͸$FOUSBM54%#ͷػೳͰഉআ͢Δ w (#3".ͷΠϯελϯεͰػೳ͢ΔΑ͏ʹɺIBTINPEͰͷ෼ׂ਺Λఔ౓ʹ੍ޚ ଟॏͰ୆ΛUBSHFUʹTDSBQF w ୆͋ͨΓͷTDSBQFUBSHFU਺ͷόϥπΩ͕૬ରతʹখ͘͞ͳΔͷ΋ϝϦοτ ༨ஊ7."1*ͱݸʑͷ7.ͷՔಈ͸׬શʹผͷ໰୊ͳͷͰɺ1SPNFUIFVTΛ7.Ͱಈ͔͢͜ͱ͸0,
  13. Answers to preliminary questions ௐࠪɺ෼ੳͷͨΊʹNFUSJDTΛ࢖͏͜ͱ͸ଟ͋͘Γ·͢ɻ w "1*αʔό͕μ΢ϯͨ͠ ˠ௚લʹॲཧͯͨ͠ϦΫΤετͷ୯Ґ࣌ؒ͋ͨΓͷྔ͸ ॲཧ࣌ؒͷฏۉ͸ w

    .FTTBHJOH2VFVFͷTUVDLˠ4VCTDSJCFSͦΕͧΕͷ୯Ґ࣌ؒ͋ͨΓͷॲཧྔ͸  ͦ΋ͦ΋ͪΌΜͱॲཧͰ͖ͯΔ w ϢʔβαʔϏεͷෆ௨ ˠ7FSEBͷ-#ʹͦ΋ͦ΋ϦΫΤετ͕ಧ͍͍ͯΔ #BDLFOE4FSWFSͷ)FBMUIDIFDLͷ҆ఆੑ͸Ͳ ͏
  14. Answers to preliminary questions w ؂ࢹํ๏ FYQPSUFSͷ࣮૷ɺ഑ஔɺNFUSJDTͷૹ৴ํ๏ ͸7FSEBʹݶΒͣશ-*/&αʔϏεͰͦΕͧΕͷνʔϜ͕ઃܭͯ͠ ͍Δ͸ͣ w

    ؂ࢹσʔλͷอଘઌͱͯ͠ͷ-*/&಺੡ඪ४ͷ؂ࢹγεςϜ͸͍͔ͭ͋͘Γ·͕͢ɺ7FSEBͷετϨʔδαʔϏε΁ͷґଘ ͕ͳ͍͔ɺશࣾͷσʔλอଘΨΠυϥΠϯʹ४ڌͭͭ͠΋7FSEB։ൃऀ͕ετϨεແ͘࢖͑Δ͔ɺ7FSEBͷσʔλྔʹ଱͑ ΒΕΔ͔ͳͲͷݕ౼݁ՌʹΑͬͯద੾ͳखஈΛબͿඞཁ͕͋Γ·͢ɻ