Slide 1

Slide 1 text

Metrics Aggregation Data Flow

Slide 2

Slide 2 text

Wataru Manji Team Manager - Verda Reliability Engineering, LINE -*/&ͷϓϥΠϕʔτΫϥ΢υl7FSEBzͰ43&νʔϜΛϦʔυ ؂ࢹσʔλͷऩूɺอଘํࣜΛ౷Ұ͢Δ1+ͷ1K. झຯ͸+&%.ؑ৆ɺےτϨɺ΢Ϛ່ͷҭ੒ཧ࿦ߟ࡯

Slide 3

Slide 3 text

Scale of LINE infrastructure 30% Annual rate of growth 80% Coverage of Verda Servers 5000 units Monthly New Server

Slide 4

Slide 4 text

Verda - Private Cloud Platform of LINE 85,000 VMs 40,000 Baremetal servers 6,500 HVs 10,000 Loadbalancers 4,500 MySQL clusters 700 k8s clusters Statistics of as 2022/03

Slide 5

Slide 5 text

Problems with Scale Increased cost of product's own monitoring Complexity of troubleshooting Vertical Scale Problem of Prometheus

Slide 6

Slide 6 text

Solution procedure Use Central TSDB Agentificate Prometheus Split Scrape Unit

Slide 7

Slide 7 text

Requirements of Central TSDB for metrics Not depend on Verda HA with Multi-DC Scalability Operation

Slide 8

Slide 8 text

Metrics TSDB Implementations Cortex Metrics Thanos IMON Flash Developed and maintained by LINE.

Slide 9

Slide 9 text

Design of Cortex metrics Retention: Custom

Slide 10

Slide 10 text

Design of IMON Flash retention: 24h

Slide 11

Slide 11 text

What makes IMON Flash suitable for us 1 Custom Retention 2 Data openness 3 Low dependence on Verda 4 Support Level 5 Low cost 6 Less operation cost for us

Slide 12

Slide 12 text

Product Selection for Public Cloud GCP: based on Monarch AWS: based on Cortex Metrics Azure: (base is not opened)

Slide 13

Slide 13 text

Solution procedure Use Central TSDB Agentficate Prometheus Split Scrape Unit

Slide 14

Slide 14 text

Prometheus Operation Problems ࠶ىಈʹ͕͔͔࣌ؒΔ ϝϞϦ࢖༻ྔ͕ലେʹͳΔ σΟεΫ࢖༻ྔ͕ലେʹͳΔ TDSBQFUBSHFU͕૿͑Δ΄Ͳ $16࢖༻ྔ͕ ུ

Slide 15

Slide 15 text

Time required to restart Prometheus Down! Restart Replaying WAL... Over 1h Running

Slide 16

Slide 16 text

Prometheus Agent-mode Pros Cons

Slide 17

Slide 17 text

Use normal Prometheus as Agent ࠶ىಈʹ͔͔Δ࣌ؒ ϝϞϦ࢖༻ྔ σΟεΫ࢖༻ྔ

Slide 18

Slide 18 text

Solution procedure Use Central TSDB Agentificate Prometheus Split Scrape Unit

Slide 19

Slide 19 text

on VM/Baremetal on Kubernetes Type of Our Prometheus Πϯϑϥن໛ͷ֦େʹΑΓɺ1SPNFUIFVTͷਫฏεέʔϧ͕ඞཁʹͳͬͨ

Slide 20

Slide 20 text

Required Server Resource for Prometheus 500 2vCPU under 16GB RAM 20GB Disk Number of scraping HV Required Spec 1000 4vCPU under 32GB RAM 40GB Disk 5000 16vCPU over 256GB RAM over 2000GB Disk

Slide 21

Slide 21 text

Time to Generate static scrape config with Ansible 500 2800s Units deploy time 1,000 4200s 5,000 6200s

Slide 22

Slide 22 text

Split static scrape configs by hashmod with Ansible JOWFOUPSZϕʔεͷTUBUJDͳDPOGJHΛ4%Λ࢖ͬͨEZOBNJDͳDPOGJHʹஔ͖׵͑Δͷ͸໰୊ղܾͷείʔϓ͕σΧ͗͢Δ ˠTUBUJDͳ··෼ׂ͢Δํ๏Λߟ͑Δ "OTJCMFͷ'JMUFSͱ࣮ͯ͠૷͢Δ͜ͱͰɺKJOKBUFNQMBUFͷGPSจͷதͰ៉ྷʹදݱ͢Δ͜ͱ͕Ͱ͖Δ͜ͱʹؾ෇͍ͨ ˠ࿙Εͳܾ͘ఆతʹ෼ׂ͢ΔͨΊʹɺIPTUOBNFͷIBTIΛར༻ͨ͠"OTJCMF'JMUFSΛ࣮૷

Slide 23

Slide 23 text

Accuracy of splitting by hashmod 2000 314 Targets Min 5000 287 10000 262 Max 356 339 337 Stdiv (不偏標準偏差) 15.53 14.33 16.05 50000 261 349 17.41 Base 6 16 33 166

Slide 24

Slide 24 text

Redundancy of scrape by multi hashmod function γϯϓϧʹIBTINPEͰ෼ׂͯ͠͠·͏ͱɺ1SPNFUIFVTˠ5BSHFUͷରԠ͕Ͱ͖ͯ͠·͏ TVSKFDUJWFͬͯݴ͏ͷ͔ͳ w 1SPNFUIFVTͷEPXO΍SFNPUFXSJUFͷ஗ԆʹରॲͰ͖ͳ͍ IBTINPEͷ݁Ռͱ1SPNFUIFVTͷ JE JE NPE OVNCFSPGQSPNFUIFVT ΛϚονͤͯ͞ଟॏͰͷ4DSBQFΛ͢Δ͜ ͱͰղܾ w ॏෳNFUSJDT͸$FOUSBM54%#ͷػೳͰഉআ͢Δ w (#3".ͷΠϯελϯεͰػೳ͢ΔΑ͏ʹɺIBTINPEͰͷ෼ׂ਺Λఔ౓ʹ੍ޚ ଟॏͰ୆ΛUBSHFUʹTDSBQF w ୆͋ͨΓͷTDSBQFUBSHFU਺ͷόϥπΩ͕૬ରతʹখ͘͞ͳΔͷ΋ϝϦοτ ༨ஊ7."1*ͱݸʑͷ7.ͷՔಈ͸׬શʹผͷ໰୊ͳͷͰɺ1SPNFUIFVTΛ7.Ͱಈ͔͢͜ͱ͸0,

Slide 25

Slide 25 text

Solution procedure Split Scrape Unit Agentficate Prometheus Use Central TSDB

Slide 26

Slide 26 text

As-is: Metrics data-flow of VM system

Slide 27

Slide 27 text

To-be: Metrics data-flow of VM system

Slide 28

Slide 28 text

Conclusions Central TSDB Split scrape target Agentificate Prometheus

Slide 29

Slide 29 text

Thank you

Slide 30

Slide 30 text

Answers to preliminary questions ʮٻΊΒΕΔαʔϏεϨϕϧʯ͸ͦͷγεςϜͷϢʔβࢹ఺͔Βߟ͑Δඞཁ͕͋ΔͷͰɺצॴ͸ͦ͜ɻ 43&ͱ͍͏จԽʹج͍ͨ։ൃ׆ಈͷόϥϯεʹ͓͍ͯࢦඪͱ͢ΔʮαʔϏεϨϕϧʯͷ਺͸ແҋ΍ͨΒʹ૿΍͞ͳ͍͜ͱ͕େࣄͳͷͰɺ Ϣʔβࢹ఺͔ΒʮΑΓॏཁͳ߲໨ʯ΁ߜΔ͜ͱ͕େ੾ͩͱࢥ͍·͢ɻ 7FSEB͸·ͩͰ͖ͯͳ͍ͷͰɺ43&νʔϜͱͯ͠͸ʮϢʔβαʔϏεԣஅͰࢦඪʹબ͹ΕΔͰ͋Ζ͏4-*ͷܭଌखஈͷ౷Ұʯ΍ ʮܭଌͷͨΊͷΠϯλʔϑΣʔεͷ౷ҰʯΈ͍ͨͳͱ͜ΖʹྗΛೖΕͯ׆ಈ͠Α͏ͱͯ͠·͢ɻ

Slide 31

Slide 31 text

Answers to preliminary questions ௐࠪɺ෼ੳͷͨΊʹNFUSJDTΛ࢖͏͜ͱ͸ଟ͋͘Γ·͢ɻ w "1*αʔό͕μ΢ϯͨ͠ ˠ௚લʹॲཧͯͨ͠ϦΫΤετͷ୯Ґ࣌ؒ͋ͨΓͷྔ͸ ॲཧ࣌ؒͷฏۉ͸ w .FTTBHJOH2VFVFͷTUVDLˠ4VCTDSJCFSͦΕͧΕͷ୯Ґ࣌ؒ͋ͨΓͷॲཧྔ͸  ͦ΋ͦ΋ͪΌΜͱॲཧͰ͖ͯΔ w ϢʔβαʔϏεͷෆ௨ ˠ7FSEBͷ-#ʹͦ΋ͦ΋ϦΫΤετ͕ಧ͍͍ͯΔ #BDLFOE4FSWFSͷ)FBMUIDIFDLͷ҆ఆੑ͸Ͳ ͏

Slide 32

Slide 32 text

Answers to preliminary questions 7FSEBͷϢʔβϦιʔεʹ͸ґଘ͍ͯ͠Δ͜ͱ͕͋Δ͚Ͳɺ7FSEBͷ"1*෦෼ʹ͸ґଘ͠ͳ͍Α͏ʹͯ͠·͢ɻ ಛʹʮ7FSEB͕ఏڙ͢ΔετϨʔδαʔϏεʯʹґଘ͠ͳ͍Α͏ʹؾΛݣͬͯ·͢ɻ

Slide 33

Slide 33 text

Answers to preliminary questions σʔλͷܽଛ͸ڐ༰͢Δ΂͖ͰɺͦͷલఏʹཱͬͯNFUSJDTͷҙຯදݱΛݕ౼͢Δඞཁ͕͋Γ·͢ɻ w σʔλࣗମ͕ܽଛʹΑͬͯܥྻͱͯ͠ͷҙຯΛࣦΘͳ͍Α͏ʹ͢Δ $PVOUFSͷར༻ͳͲ w 4DSBQFͷ੒ޭࣦഊͦͷ΋ͷΛ࣌ܥྻσʔλͱͯ͠؅ཧ͢Δඞཁੑ 1SPNFUIFVTͷVQNFUSJDT ͷͭͷ఺ʹ஫ҙͯ͠"MFSU౳Λઃܭ͢Δ΂͖Ͱ͢Ͷɻ

Slide 34

Slide 34 text

Answers to preliminary questions ؂ࢹํ๏ FYQPSUFSͷ࣮૷ɺ഑ஔɺNFUSJDTͷૹ৴ํ๏ ͱ؂ࢹσʔλͷอଘઌ͸෼͚ͯߟ͑Δ΂͖ͰɺલऀͷΞϓϩʔν͸ݸ ผɺޙऀ͸౷Ұ͢Δ͜ͱΛ೦಄ʹஔ͍͍ͯ·͢ɻ

Slide 35

Slide 35 text

Answers to preliminary questions w ؂ࢹํ๏ FYQPSUFSͷ࣮૷ɺ഑ஔɺNFUSJDTͷૹ৴ํ๏ ͸7FSEBʹݶΒͣશ-*/&αʔϏεͰͦΕͧΕͷνʔϜ͕ઃܭͯ͠ ͍Δ͸ͣ w ؂ࢹσʔλͷอଘઌͱͯ͠ͷ-*/&಺੡ඪ४ͷ؂ࢹγεςϜ͸͍͔ͭ͋͘Γ·͕͢ɺ7FSEBͷετϨʔδαʔϏε΁ͷґଘ ͕ͳ͍͔ɺશࣾͷσʔλอଘΨΠυϥΠϯʹ४ڌͭͭ͠΋7FSEB։ൃऀ͕ετϨεແ͘࢖͑Δ͔ɺ7FSEBͷσʔλྔʹ଱͑ ΒΕΔ͔ͳͲͷݕ౼݁ՌʹΑͬͯద੾ͳखஈΛબͿඞཁ͕͋Γ·͢ɻ