Save 37% off PRO during our Black Friday Sale! »

Falsification of Cyber-Physical Systems Using Deep Reinforcement Learning

Falsification of Cyber-Physical Systems Using Deep Reinforcement Learning

Presentation at Software Engineering Symposium 2020

24b83045f3e82fb02ee8211b46a2e6de?s=128

Yoriyuki Yamagata

September 12, 2020
Tweet

Transcript

  1. Falsification of Cyber-Physical Systems Using Deep Reinforcement Learning Y. Yamagata1,

    S. Liu2, T. Akazaki3, Y. Duan2 and J. Hao2 1: AIST, 2: Tianjin University, 3: Fujitsu Laboratories Software Engineering Symposium 2020
  2. Falsification Throttle, Brake Car speed Input Deterministic CPS model Output

    Specification: Car speed must be < 200km/h Goal: Find an input (counter-example) which violates the specification
  3. Robustness guided falsification v: speed t: time 200km/h r =

    T min t=1 (200 − vt ) Robustness r • Find a counter-example by minimizing robustness • Cast a falsification problem into a numerical optimization problem
  4. Proposed methods • Nelder Mead, Genetic programming [Donzé, 2010] •

    Simulated annealing, cross-entropy method [Annpureddy et al., 2011] • Monte-Carlo tree search [Zhang et al., 2018] • aLVTS [Ernst et al., 2019] • Stochastic Optimization with Adaptive Restart [Mathesen et al, 2020] • Gradient decent [Bennani et al., 2020] • Surrogate model [Menghi, 2020]
  5. Our Contribution • Recast robustness guided falsification into a reinforcement

    learning problem • Implemented the proposed method using a deep- reinforcement learning framework • Perform comparison with S-Taliro (widely used robustness guided falsification tool)
  6. Reinforcement learning problem Agent Environment Action State Reward Maximize R

    = T ∑ t=1 rγ t rt Condition: the law of the environment is unknown to the agent
  7. Recasting falsification into reinforcement learning T min t=1 (200 −

    vt ) ∼ − log T ∑ t=1 exp[ − (200 − vt )] Want to find an input which minimizes Therefore, we need to maximize T ∑ t=1 exp[ − (200 − vt )] We can solve this optimization problem using reinforcement learning with the reward exp[ − (200 − vt )]
  8. Deep reinforcement learning • We use deep reinforcement learning algorithms

    • algorithms using deep learning • Versatile, can adapt non-linear system dynamics • In particular, we use two algorithms • DDQN (Q-learning approach) • A3C (Actor-Critic approach)
  9. Implementation Falsifier A3C, DDQN (ChainerRL) 4ZTUFN*OQVU System Model Simulink Subsystem

    Robustness Monitor Taliro-Monitor System Output Robustness Simulink model
  10. Implementation (cont.) • Falsifier • Custom Simulink block, implemented by

    MATLAB • Reinforcement learning part is implemented by Python • Use Python library ChainerRL (now PFRL) • Robustness monitor • reuse the monitor in S-Taliro • System model (Target model)
  11. Experiment Use 3 models (Chasing Cars, Automatic Transmission, Power Train

    Control) Falsifier is allowed to run 200 simulations to falsify a specification for each trial 100 trials are repeated for each model and property, because the result varies by stochastic nature of the agent No pre-training, no hyper-parameter tuning At the start of each trail, the “memory” of the agent is reset The memory is kept between simulations
  12. Evaluation metrics Use the number of simulation required to falsify

    Reason: • Execution time depends on • implementation details (combination of Python and MATLAB slows down simulation) • Scheduling (we run experiment concurrently on a single machine) • We find the time required for reinforcement learning part is insignificant
  13. Baselines • RAND: uniform random input • CE: Cross Entropy

    method • SA: Simulated Annealing
  14. Statistical analysis We need to compare two random variables X

    and Y, whose distributions are unknown and highly skewed Therefore, we do not use average etc but relative effect size measure means X tends smaller than Y We perform non-parametric statistical testing with the null p > 0.5 p = 0.5 p = P(X < Y) + 1 2 P(X = Y)
  15. Chasing Cars model y_in y_out Car 2 Throttle Brake y_out

    Car 1 y_in y_out Car 3 y_in y_out Car 4 y_in y_out Car 5 1 2 3 4 1 2 5
  16. Chasing Cars model, falsified properties Properties are artificial, gradually become

    complex from to φ1 φ5
  17. Automatic Transmission © Mathwork Inc.

  18. Automatic Transmission, falsified properties Modified from [Hoxha and Fainekos, 2014]

  19. Power Train Control Fuel Control System Verification and Validation stub

    system Pedal Angle Engine Speed A/F A/F ref Verification measurement Mode [Deshmukh et al. 2014]
  20. Power Train Control, Falsified Properties [Deshmukh et al. 2014]

  21. Results: Chasing Cars 0 50 100 150 200 Number of

    simulations Properties Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5
  22. Results: Chasing Cars 0 50 100 150 200 Number of

    simulations Properties Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 Smaller, Better Proposed Baseline
  23. p between proposed methods and baselines Smaller, proposed methods are

    better Bold/Italic indicate statistically significant difference
  24. Result: Automatic Transmission 0 50 100 150 200 Number of

    simulations Properties Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6 ϕ7 ϕ8 ϕ9
  25. p between proposed methods and baselines

  26. ptc_fml34_sensorfail 0 50 100 150 200 Number of simulations Properties

    Algorithm A3C DDQN RAND CE SA ϕ26 ϕ27 ϕ30 ϕ31 ϕ32 ϕ33 Result: Power Train Control φ34
  27. p between proposed methods and baselines

  28. Summary of experiments Chasing Cars: Proposed methods almost always outperform

    baselines, except in which RAND outperforms all methods Automatic Transmission: A3C either outperforms baselines or shows equal performance. The performance of DDQN is unstable Power Train Control: Proposed methods underperform baselines φ2
  29. Observations and conclusion The proposed methods often outperform baselines, but

    not always However, whenever the proposed methods underperform baselines, RAND outperforms or perform equally to other methods As a conclusion, a combination of reinforcement learning and uniform random inputs could be a good approach
  30. Future works Investigate causes of performance difference Support other properties

    than safety properties Vary time-step automatically Hyper-parameter tuning (but how?) Compare different reinforcement algorithms Improve usability
  31. More info • Paper: Y. Yamagata, S. Liu, T. Akazaki,

    Y. Duan and J. Hao, "Falsification of Cyber-Physical Systems Using Deep Reinforcement Learning," in IEEE Transactions on Software Engineering, 2020 • Implementation: https://github.com/yoriyuki-aist/Falsify • Comparison to other tools: ARCH-COMP 2019 Category Report: Falsification, EasyChair, 2019