Wataru Manji
Team Manager - Verda Reliability Engineering, LINE
-*/&ͷϓϥΠϕʔτΫϥυl7FSEBzͰ43&νʔϜΛϦʔυ
ࢹσʔλͷऩूɺอଘํࣜΛ౷Ұ͢Δ1+ͷ1K.
झຯ+&%.ؑɺےτϨɺϚ່ͷҭཧߟ
Slide 3
Slide 3 text
Scale of LINE infrastructure
30%
Annual rate of growth
80%
Coverage of Verda Servers
5000 units
Monthly New Server
Slide 4
Slide 4 text
Verda - Private Cloud Platform of LINE
85,000 VMs 40,000 Baremetal
servers
6,500 HVs
10,000 Loadbalancers 4,500 MySQL clusters 700 k8s clusters
Statistics of as 2022/03
Slide 5
Slide 5 text
Problems with Scale
Increased cost of product's own monitoring
Complexity of troubleshooting
Vertical Scale Problem of Prometheus
Slide 6
Slide 6 text
Solution procedure
Use Central TSDB
Agentificate
Prometheus
Split Scrape Unit
Slide 7
Slide 7 text
Requirements of Central TSDB for metrics
Not depend on Verda HA with Multi-DC
Scalability Operation
Slide 8
Slide 8 text
Metrics TSDB Implementations
Cortex Metrics Thanos IMON Flash
Developed and maintained
by LINE.
Slide 9
Slide 9 text
Design of Cortex metrics
Retention: Custom
Slide 10
Slide 10 text
Design of IMON Flash
retention: 24h
Slide 11
Slide 11 text
What makes IMON Flash suitable for us
1 Custom Retention
2 Data openness
3 Low dependence on Verda
4 Support Level
5 Low cost
6 Less operation cost for us
Slide 12
Slide 12 text
Product Selection for Public Cloud
GCP: based on Monarch
AWS: based on Cortex Metrics
Azure: (base is not opened)
Slide 13
Slide 13 text
Solution procedure
Use Central TSDB
Agentficate
Prometheus
Split Scrape Unit
Time required to restart Prometheus
Down! Restart
Replaying WAL...
Over 1h
Running
Slide 16
Slide 16 text
Prometheus Agent-mode
Pros Cons
Slide 17
Slide 17 text
Use normal Prometheus as Agent
࠶ىಈʹ͔͔Δ࣌ؒ
ϝϞϦ༻ྔ
σΟεΫ༻ྔ
Slide 18
Slide 18 text
Solution procedure
Use Central TSDB
Agentificate
Prometheus
Split Scrape Unit
Slide 19
Slide 19 text
on VM/Baremetal
on Kubernetes
Type of Our Prometheus
Πϯϑϥنͷ֦େʹΑΓɺ1SPNFUIFVTͷਫฏεέʔϧ͕ඞཁʹͳͬͨ
Slide 20
Slide 20 text
Required Server Resource for Prometheus
500
2vCPU
under 16GB RAM
20GB Disk
Number of scraping HV Required Spec
1000
4vCPU
under 32GB RAM
40GB Disk
5000
16vCPU
over 256GB RAM
over 2000GB Disk
Slide 21
Slide 21 text
Time to Generate static scrape config with Ansible
500 2800s
Units deploy time
1,000 4200s
5,000 6200s
Slide 22
Slide 22 text
Split static scrape configs by hashmod with Ansible
JOWFOUPSZϕʔεͷTUBUJDͳDPOGJHΛ4%ΛͬͨEZOBNJDͳDPOGJHʹஔ͖͑Δͷղܾͷείʔϓ͕σΧ͗͢Δ
ˠTUBUJDͳ··ׂ͢Δํ๏Λߟ͑Δ
"OTJCMFͷ'JMUFSͱ࣮ͯ͢͠Δ͜ͱͰɺKJOKBUFNQMBUFͷGPSจͷதͰ៉ྷʹදݱ͢Δ͜ͱ͕Ͱ͖Δ͜ͱʹؾ͍ͨ
ˠ࿙Εͳܾ͘ఆతʹׂ͢ΔͨΊʹɺIPTUOBNFͷIBTIΛར༻ͨ͠"OTJCMF'JMUFSΛ࣮
Slide 23
Slide 23 text
Accuracy of splitting by hashmod
2000 314
Targets Min
5000 287
10000 262
Max
356
339
337
Stdiv
(不偏標準偏差)
15.53
14.33
16.05
50000 261 349 17.41
Base
6
16
33
166
Slide 24
Slide 24 text
Redundancy of scrape by multi hashmod function
γϯϓϧʹIBTINPEͰׂͯ͠͠·͏ͱɺ1SPNFUIFVTˠ5BSHFUͷରԠ͕Ͱ͖ͯ͠·͏ TVSKFDUJWFͬͯݴ͏ͷ͔ͳ
w 1SPNFUIFVTͷEPXOSFNPUFXSJUFͷԆʹରॲͰ͖ͳ͍
IBTINPEͷ݁Ռͱ1SPNFUIFVTͷ JE
JENPE OVNCFSPGQSPNFUIFVT
ΛϚονͤͯ͞ଟॏͰͷ4DSBQFΛ͢Δ͜
ͱͰղܾ
w ॏෳNFUSJDT$FOUSBM54%#ͷػೳͰഉআ͢Δ
w (#3".ͷΠϯελϯεͰػೳ͢ΔΑ͏ʹɺIBTINPEͰͷׂΛఔʹ੍ޚ ଟॏͰΛUBSHFUʹTDSBQF
w ͋ͨΓͷTDSBQFUBSHFUͷόϥπΩ͕૬ରతʹখ͘͞ͳΔͷϝϦοτ
༨ஊ7."1*ͱݸʑͷ7.ͷՔಈશʹผͷͳͷͰɺ1SPNFUIFVTΛ7.Ͱಈ͔͢͜ͱ0,
Slide 25
Slide 25 text
Solution procedure
Split Scrape Unit
Agentficate
Prometheus
Use Central TSDB
Slide 26
Slide 26 text
As-is: Metrics data-flow of VM system
Slide 27
Slide 27 text
To-be: Metrics data-flow of VM system
Slide 28
Slide 28 text
Conclusions
Central TSDB
Split scrape target
Agentificate Prometheus
Slide 29
Slide 29 text
Thank you
Slide 30
Slide 30 text
Answers to preliminary questions
ʮٻΊΒΕΔαʔϏεϨϕϧʯͦͷγεςϜͷϢʔβࢹ͔Βߟ͑Δඞཁ͕͋ΔͷͰɺצॴͦ͜ɻ
43&ͱ͍͏จԽʹج͍ͨ։ൃ׆ಈͷόϥϯεʹ͓͍ͯࢦඪͱ͢ΔʮαʔϏεϨϕϧʯͷແҋͨΒʹ૿͞ͳ͍͜ͱ͕େࣄͳͷͰɺ
Ϣʔβࢹ͔ΒʮΑΓॏཁͳ߲ʯߜΔ͜ͱ͕େͩͱࢥ͍·͢ɻ
7FSEB·ͩͰ͖ͯͳ͍ͷͰɺ43&νʔϜͱͯ͠ʮϢʔβαʔϏεԣஅͰࢦඪʹબΕΔͰ͋Ζ͏4-*ͷܭଌखஈͷ౷Ұʯ
ʮܭଌͷͨΊͷΠϯλʔϑΣʔεͷ౷ҰʯΈ͍ͨͳͱ͜ΖʹྗΛೖΕͯ׆ಈ͠Α͏ͱͯ͠·͢ɻ
Slide 31
Slide 31 text
Answers to preliminary questions
ௐࠪɺੳͷͨΊʹNFUSJDTΛ͏͜ͱଟ͋͘Γ·͢ɻ
w "1*αʔό͕μϯͨ͠ ˠલʹॲཧͯͨ͠ϦΫΤετͷ୯Ґ࣌ؒ͋ͨΓͷྔ ॲཧ࣌ؒͷฏۉ
w .FTTBHJOH2VFVFͷTUVDLˠ4VCTDSJCFSͦΕͧΕͷ୯Ґ࣌ؒ͋ͨΓͷॲཧྔ ͦͦͪΌΜͱॲཧͰ͖ͯΔ
w ϢʔβαʔϏεͷෆ௨ ˠ7FSEBͷ-#ʹͦͦϦΫΤετ͕ಧ͍͍ͯΔ #BDLFOE4FSWFSͷ)FBMUIDIFDLͷ҆ఆੑͲ
͏
Slide 32
Slide 32 text
Answers to preliminary questions
7FSEBͷϢʔβϦιʔεʹґଘ͍ͯ͠Δ͜ͱ͕͋Δ͚Ͳɺ7FSEBͷ"1*෦ʹґଘ͠ͳ͍Α͏ʹͯ͠·͢ɻ
ಛʹʮ7FSEB͕ఏڙ͢ΔετϨʔδαʔϏεʯʹґଘ͠ͳ͍Α͏ʹؾΛݣͬͯ·͢ɻ
Slide 33
Slide 33 text
Answers to preliminary questions
σʔλͷܽଛڐ༰͢Δ͖ͰɺͦͷલఏʹཱͬͯNFUSJDTͷҙຯදݱΛݕ౼͢Δඞཁ͕͋Γ·͢ɻ
w σʔλࣗମ͕ܽଛʹΑͬͯܥྻͱͯ͠ͷҙຯΛࣦΘͳ͍Α͏ʹ͢Δ $PVOUFSͷར༻ͳͲ
w 4DSBQFͷޭࣦഊͦͷͷΛ࣌ܥྻσʔλͱͯ͠ཧ͢Δඞཁੑ 1SPNFUIFVTͷVQNFUSJDT
ͷͭͷʹҙͯ͠"MFSUΛઃܭ͢Δ͖Ͱ͢Ͷɻ
Slide 34
Slide 34 text
Answers to preliminary questions
ࢹํ๏ FYQPSUFSͷ࣮ɺஔɺNFUSJDTͷૹ৴ํ๏
ͱࢹσʔλͷอଘઌ͚ͯߟ͑Δ͖ͰɺલऀͷΞϓϩʔνݸ
ผɺޙऀ౷Ұ͢Δ͜ͱΛ೦಄ʹஔ͍͍ͯ·͢ɻ
Slide 35
Slide 35 text
Answers to preliminary questions
w ࢹํ๏ FYQPSUFSͷ࣮ɺஔɺNFUSJDTͷૹ৴ํ๏
7FSEBʹݶΒͣશ-*/&αʔϏεͰͦΕͧΕͷνʔϜ͕ઃܭͯ͠
͍Δͣ
w ࢹσʔλͷอଘઌͱͯ͠ͷ-*/&ඪ४ͷࢹγεςϜ͍͔ͭ͋͘Γ·͕͢ɺ7FSEBͷετϨʔδαʔϏεͷґଘ
͕ͳ͍͔ɺશࣾͷσʔλอଘΨΠυϥΠϯʹ४ڌͭͭ͠7FSEB։ൃऀ͕ετϨεແ͑͘Δ͔ɺ7FSEBͷσʔλྔʹ͑
ΒΕΔ͔ͳͲͷݕ౼݁ՌʹΑͬͯదͳखஈΛબͿඞཁ͕͋Γ·͢ɻ