Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FChain: toward black-box online fault localizat...

FChain: toward black-box online fault localization for cloud systems

Presentation of the paper "FChain: toward black-box online fault localization for cloud systems" by Hiep Nguyen, Zhiming Shen, Yongmin Tan and Xiaohui Gu, presented at ICDCS 2013

Avatar for Stefano Zanella

Stefano Zanella

February 07, 2014
Tweet

More Decks by Stefano Zanella

Other Decks in Programming

Transcript

  1. FChain Hiep Nguyen, Zhiming Shen, Yongmin Tan, Xiaohui Gu Toward

    Black-box Online Fault Localization for Cloud Systems Presented by Stefano Zanella
  2. Agenda Why ? What ? (and what not!) How ?

    Experimental evaluation Results Open questions Paper evaluation
  3. Text IaaS is complex no matter whether public or private

    http://harish11g.blogspot.it/2012/07/amazon-availability-zones-aws-az.html
  4. Text Distributed Software is complex and it’ll be even more

    so in the future http://www.mimul.com/pebble/default/2012/09/06/1346890689112.html
  5. Failures are unavoidable …so they must be considered from day

    0 resource contentions software bugs hardware failures
  6. Failures are unavoidable …so they must be considered from day

    0 resource contentions software bugs hardware failures …
  7. Black box techniques only for anomaly detection… …or system metric

    attribution some are application-specific or fault-specific
  8. Black box techniques only for anomaly detection… …or system metric

    attribution some are application-specific or fault-specific Still not enough
  9. no prior application knowledge (suitable for IaaS clouds) no requirement

    for training data (detects known and new anomalies)
  10. Key observations Performance anomalies often manifest as abnormal system metric

    fluctuations that are distinctive from the normal fluctuation patterns
  11. Key observations Performance anomalies often manifest as abnormal system metric

    fluctuations that are distinctive from the normal fluctuation patterns The abnormal system metric changes often start from the faulty components and then propagate to other non-faulty components via inter-component interactions
  12. Abnormal change point selection bursty and spiky might be normal

    (smoothing + outlier detection isn’t enough)
  13. Abnormal change point selection given change point xt X =

    [ xt - 20s, xt + 20s ] top(90%, FFT(X))
  14. Abnormal change point selection given change point xt X =

    [ xt - 20s, xt + 20s ] FFT-1(top(90%, FFT(X)))
  15. Abnormal change point selection given change point xt X =

    [ xt - 20s, xt + 20s ] percentileOf(FFT-1(top(90%, FFT(X))),90)
  16. Abnormal change point selection given change point xt X =

    [ xt - 20s, xt + 20s ] threshold = percentileOf(FFT-1(top(90%, FFT(X))),90)
  17. Integrated fault diagnosis sort all components by fault manifestation time

    earliest == faulty time difference < 2s == concurrent fault
  18. Integrated fault diagnosis sort all components by fault manifestation time

    earliest == faulty time difference < 2s == concurrent fault all faulty, same trend == external fault
  19. Online pinpointing validation (no dependency b/w components == all faulty)

    (offline use, black-box) «inter-component dependency discovery»
  20. NTP

  21. NTP error < 0.1ms (LAN), 5 ms (WAN) anomaly propagation

    takes several seconds small time skews are tolerable
  22. Metrics CPU usage RAM usage Network in Network out Disk

    read Disk write 1s sampling interval
  23. SLO Violations RUBiS: avg response time > 100ms Hadoop: no

    progress for > 30s IBM System S: avg processing time > 20ms
  24. Fault Injection offload bug load balancing bug concurrent cpu hog

    concurrent disk hog concurrent memory leak (multi component)
  25. Config params look back window = 100s (500s for disk

    hog) concurrent faults threshold = 2s burst extraction window = 20s burst spectrum = 90% top frequencies expected prediction error = 90th percentile
  26. Summary Up to 90% higher precision Up to 20% higher

    recall Slave modules demands < 1% CPU, ~3MB RAM per host
  27. Summary Up to 90% higher precision Up to 20% higher

    recall Slave modules demands < 1% CPU, ~3MB RAM per host High parallelism, high scalability
  28. Summary Up to 90% higher precision Up to 20% higher

    recall Slave modules demands < 1% CPU, ~3MB RAM per host High parallelism, high scalability Still room for improvement (e.g. adaptive look-back window)
  29. Open questions How quickly is a failure detected? More predictive

    / less reactive approach? Data retention?