Embracing automation - An autoscaler mechanism by using Saltstack and Prometheus

Embracing automation - An autoscaler mechanism by using Saltstack and
Prometheus 童冠傑 (Jacky_Tung) Sep 2018

• Software engineer - HTC DeepQ • AI Platform •
Web Development • Github: https://github.com/JackyTung • Slides: https://speakerdeck.com/jackytung About Me

About DeepQ AI Platform https://ai-platform.deepq.com/ • Purpose: • Lower the
barrier of AI training • Feature • Simpliﬁed model development process • Auto hyper-parameter tuning • Optimized training environment

Before we start …

• Maintain dev/sta/prod environment • No. Of Servers to be
monitor increases • Diagnose & provide feedback • Automate infrastructure management Operation Challenges

• Maintain dev/sta/prod environment • No. Of Servers to be
monitor increases • Diagnose & provide feedback • Automate infrastructure management • GPU instances take a large proportion of cost Operation Challenges DeepQ AI Platform GPU instance Client Side Training Task

Candidate solutions • Use existing cloud service solutions (e.g AWS,
GCP) • problem: increase deployment time • terminate instances when instances are idle • Implement by ourselves • Reduce deployment time: reserve instances • Scalability: can apply on existing cloud service instances

TARGET INSTANCES SALT MASTER Execute salt command to scale-up or
scale-down instance Autoscaler salt-api AlertManager notify pull metrics Overall architecture Prometheus

scale-down instance Autoscaler salt-api AlertManager notify pull metrics Prometheus

What choices do I have? + Prometheus +

Metric based vs Log based ref: https://signalfx.com/blog/metric-log-monitoring-really-need/ Typically, metrics are
best used for monitoring, proﬁling, and alerting Logs give you the extra level of detail necessary for troubleshooting, debugging, support, and auditing

Metric based vs Log based Metrics Log Exact counter X
O Error cause X O Network Bandwidth Const amount (e.g. once per 15s ) Linear to #event Storage Usage Small (just sampled numbers) Large (event details) Detect Incidents O O

Why Prometheus • A metric-based monitor system • Focus on
time series data monitoring • We care about metrics like • queue length • waiting time • number of free instances • error count • CPU, MEM, disk usage

Simple Prometheus Architecture Query to Visualize AlertManager Pull metrics Push
alerts Autoscaler TARGET INSTANCES

scale-down instance Autoscaler salt-api AlertManager notify pull metrics Prometheus

What is autoscaler scalecondition ( 0, 5, 0, 1, 270)
#instance workload for _, env := range []string{‘dev’, ‘sta’, ‘prod’}  for _, instanceType := range []string{‘type1’, ‘type2’} 5 threshold maximum # of instance minimum # of instance down threshold up threshold 0 1 Interval (secs)

scale-down instance Autoscaler Send scale command

• A conﬁguration management tool • Flexible, Scalable to maintain
10,000 of machines • A remote execution framework • Master / Agent • Parallel execution • Secure • Salt minion key authentication What is Saltstack ?

TARGET INSTANCES SALT MASTER Scale-down Scale-up

TARGET INSTANCES SALT MASTER Scale-down

TARGET INSTANCES SALT MASTER Scale-up Scale up command will be
like this …. salt-run cloud.action start {target instance} It just starts instance , How to deploy ? Use Saltstack Event-driven System !

• Everything you care about • authentication, minion start, job
events, cloud event …. • Event types: https://docs.saltstack.com/en/latest/topics/event/ master_events.html • The structure of event : Tag + Data What is Event? Ref: https://docs.saltstack.com/en/latest/topics/event/master_events.html

• An Event-driven infrastructure • Event System: ﬁre oﬀ events
enabling third party applications or external process to react to behavior with salt • Reactor System: trigger actions in response to an event Event and Reactor Ref: https://docs.saltstack.com/en/getstarted/overview.html 1 2 3

TARGET INSTANCES SALT MASTER Scale-up Event: salt/ minion/target- minions/start 1
2 3 Reaction

• The autoscale mechanism can apply on non-container based instance
• Before survey new techniques, think purpose ﬁrst • Can services be containerlized? • Do the existing solutions meet our requirements? • Monitoring: metrics-based or log-based ? • Conﬁguration Management Tool: Ansible, Chef, Puppet, Saltstack ..etc Summary

Thank you More about us —>  - https://deepq.com/ - https://medium.com/htc-research-engineering-blog

Embracing automation - An autoscaler mechanism ...

Embracing automation - An autoscaler mechanism by using Saltstack and Prometheus

JackyTung

Other Decks in Technology

Featured

Transcript