FChain: toward black-box online fault localization for cloud systems

FChain Hiep Nguyen, Zhiming Shen, Yongmin Tan, Xiaohui Gu Toward
Black-box Online Fault Localization for Cloud Systems Presented by Stefano Zanella

Agenda Why ? What ? (and what not!) How ?
Experimental evaluation Results Open questions Paper evaluation

Text IaaS is complex no matter whether public or private
http://harish11g.blogspot.it/2012/07/amazon-availability-zones-aws-az.html

Text Distributed Software is complex and it’ll be even more
so in the future http://www.mimul.com/pebble/default/2012/09/06/1346890689112.html

Failures are unavoidable …so they must be considered from day
0

0 resource contentions

0 resource contentions software bugs

0 resource contentions software bugs hardware failures

0 resource contentions software bugs hardware failures …

Text http://www.n2growth.com/blog/leadership-and-blame/

The Challenge:

localize faulty component(s)

localize faulty component(s) (as quickly as possible)

Text Available tools aren’t enough http://blogs.urbancode.com/agile/maslows-hammer-the-curse-of-tool-blindness/

White/Grey box techniques

White/Grey box techniques intrusive

White/Grey box techniques intrusive difﬁcult to deploy

White/Grey box techniques intrusive difﬁcult to deploy signiﬁcant runtime overhead

White/Grey box techniques intrusive difﬁcult to deploy signiﬁcant runtime overhead
Impractical for online fault localisation

Black box techniques

Black box techniques only for anomaly detection…

Black box techniques only for anomaly detection… …or system metric
attribution

attribution some are application-speciﬁc or fault-speciﬁc

attribution some are application-speciﬁc or fault-speciﬁc Still not enough

FChain

black-box

black-box online

black-box online fault localization

black-box online fault localization system

«black box» vs

«black box» application instrumentation vs

«black box» application instrumentation low-level system metrics vs

«online» vs

«online» ofﬂine algorithm vs

«online» ofﬂine algorithm online algorithm vs

«fault localization» vs

«fault localization» anomaly detection vs

«fault localization» anomaly detection fault pinpointing vs

«system» vs

«system» single algorithm vs

«system» single algorithm set of cooperative algorithms vs

What not:

no prior application knowledge

no prior application knowledge (suitable for IaaS clouds)

no prior application knowledge (suitable for IaaS clouds) no requirement
for training data

no prior application knowledge (suitable for IaaS clouds) no requirement
for training data (detects known and new anomalies)

Key observations

Key observations Performance anomalies often manifest as abnormal system metric
ﬂuctuations that are distinctive from the normal ﬂuctuation patterns

Key observations Performance anomalies often manifest as abnormal system metric
ﬂuctuations that are distinctive from the normal ﬂuctuation patterns The abnormal system metric changes often start from the faulty components and then propagate to other non-faulty components via inter-component interactions

Key questions

Key questions How to distinguish between normal & abnormal ﬂuctuations?

Key questions How to distinguish between normal & abnormal ﬂuctuations?
How to correctly determine the root cause?

Key ideas

Key ideas Prediction models + predictability metric

Key ideas Prediction models + predictability metric Dependency relationships +
fault propagation model

Text The mechanics http://21stcenturydamocles.deviantart.com/art/Chronometer-as-study-in-the-mechanics-of-time-336999135

Text Architecture

Core modules Normal ﬂuctuation modeling Abnormal change point selection Integrated
fault diagnosis Online pinpointing validation

Abnormal change point selection

Abnormal change point selection metrics ﬂuctuate under normal workload

Abnormal change point selection metrics ﬂuctuate under normal workload (standard
algorithms detect random points)

Abnormal change point selection bursty and spiky might be normal

Abnormal change point selection bursty and spiky might be normal
(smoothing + outlier detection isn’t enough)

Abnormal change point selection So what?

Abnormal change point selection So what? Predict changes!

Normal ﬂuctuation modeling

Normal ﬂuctuation modeling PRESS (PRedictive Elastic reSource Scaling)

Normal ﬂuctuation modeling PRESS (PRedictive Elastic reSource Scaling) “when and
how will a metric change?“

Normal ﬂuctuation modeling PRESS (PRedictive Elastic reSource Scaling) “when and
how will a metric change?“ prediction == error

Abnormal change point selection Observation: normal == low prediction error

Abnormal change point selection Ergo: abnormal == (prediction error >
threshold)

Abnormal change point selection Q: how high is “high“?

Abnormal change point selection Q: how high is “high“? (tip:
it isn’t fixed)

Abnormal change point selection A: dynamic threshold computation

Abnormal change point selection given change point xt

Abnormal change point selection given change point xt X =
[ xt - 20s, xt + 20s ]

[ xt - 20s, xt + 20s ] FFT(X)

[ xt - 20s, xt + 20s ] top(90%, FFT(X))

[ xt - 20s, xt + 20s ] FFT-1(top(90%, FFT(X)))

[ xt - 20s, xt + 20s ] percentileOf(FFT-1(top(90%, FFT(X))),90)

[ xt - 20s, xt + 20s ] threshold = percentileOf(FFT-1(top(90%, FFT(X))),90)

Abnormal change point selection Q: when does the anomaly start?

Abnormal change point selection Q: when does the anomaly start?
(tip: might be earlier than noticed)

Abnormal change point selection A: tangent-based rollback

Abnormal change point selection f’(xt)

Abnormal change point selection f’(xt) - f’(xt-1)

Abnormal change point selection f’(xt) - f’(xt-1) < 0.1

Abnormal change point selection f’(xt) - f’(xt-1) < 0.1 xt
= xt-1

Abnormal change point selection while f’(xt) - f’(xt-1) < 0.1
xt = xt-1

Integrated fault diagnosis

Integrated fault diagnosis sort all components by fault manifestation time

earliest == faulty

earliest == faulty time difference < 2s == concurrent fault

earliest == faulty time difference < 2s == concurrent fault all faulty, same trend == external fault

Online pinpointing validation

Online pinpointing validation «inter-component dependency discovery»

Online pinpointing validation (no dependency b/w components == all faulty)
«inter-component dependency discovery»

Online pinpointing validation (no dependency b/w components == all faulty)
(offline use, black-box) «inter-component dependency discovery»

Caveats

Caveats back-pressure effect

Caveats back-pressure effect data stream processing

Caveats back-pressure effect data stream processing FCHAIN STILL WORKS

http://www.thedirtﬂoor.com/2009/12/23/bansky-elephant-in-the-room/

…and it’s called «time» http://www.thedirtﬂoor.com/2009/12/23/bansky-elephant-in-the-room/

NTP error < 0.1ms (LAN), 5 ms (WAN)

NTP error < 0.1ms (LAN), 5 ms (WAN) anomaly propagation
takes several seconds

NTP error < 0.1ms (LAN), 5 ms (WAN) anomaly propagation
takes several seconds small time skews are tolerable

Experimental evaluation

Platform Linux + Xen NCSU’s Virtual Computing Lab

Metrics CPU usage RAM usage Network in Network out Disk
read Disk write 1s sampling interval

Benchmarks RUBiS online auction benchmark Hadoop IBM System S

SLO Violations RUBiS: avg response time > 100ms Hadoop: no
progress for > 30s IBM System S: avg processing time > 20ms

Fault Injection memory leak cpu hog network hog disk hog
bottleneck (single component)

Fault Injection ofﬂoad bug load balancing bug concurrent cpu hog
concurrent disk hog concurrent memory leak (multi component)

Competitors Histogram NetMedic Topology Dependency PAL Fixed ﬁltering (existing black-box
localization schemes)

Scores precision = recall = Ntp Ntp + Nfp Ntp
Ntp + Nfn

Config params look back window = 100s (500s for disk
hog) concurrent faults threshold = 2s burst extraction window = 20s burst spectrum = 90% top frequencies expected prediction error = 90th percentile

Single-component RUBiS

Single-component System S

Multi-component RUBiS

Multi-component System S

Multi-component Hadoop

Online validation effect

Results

Summary

Summary Up to 90% higher precision

Summary Up to 90% higher precision Up to 20% higher
recall

recall Slave modules demands < 1% CPU, ~3MB RAM per host

recall Slave modules demands < 1% CPU, ~3MB RAM per host High parallelism, high scalability

recall Slave modules demands < 1% CPU, ~3MB RAM per host High parallelism, high scalability Still room for improvement (e.g. adaptive look-back window)

Open questions

Open questions How quickly is a failure detected?

Open questions How quickly is a failure detected? More predictive
/ less reactive approach?

Open questions How quickly is a failure detected? More predictive
/ less reactive approach? Data retention?

Paper evaluation

originality

technical meaning

clarity

ﬁtness

Thanks! Stefano Zanella ! [email protected] @stefano_zanella https://github.com/stefanozanella

FChain: toward black-box online fault localizat...

FChain: toward black-box online fault localization for cloud systems

More Decks by Stefano Zanella

Other Decks in Programming

Featured

Transcript