Save 37% off PRO during our Black Friday Sale! »

Falsification of Cyber Physical System Using Deep Reinforcement Learning

Falsification of Cyber Physical System Using Deep Reinforcement Learning

24b83045f3e82fb02ee8211b46a2e6de?s=128

Yoriyuki Yamagata

February 27, 2020
Tweet

Transcript

  1. Falsification of Cyber Physical System Using Deep Reinforcement Learning Y.

    Yamagata1 S. Liu2 T. Akazaki3 Y. Duan2 J. Hao2 February 27, 2020 1National Institute of Advanced Industrial Science and Technology (AIST) 2Tianjin University 3Fujitsu Laboratories Ltd 1
  2. Table of contents 1. Cyber Physical System (CPS) 2. Falsification

    3. Reinforcement learning 4. Falsification using reinforcement learning 5. Case studies 2
  3. Cyber Physical System (CPS)

  4. Cyber Physical System (CPS) Cyber Physical System (CPS) is a

    system in which physical parts and software coexist In this talk, we focus a reactive CPS, which takes inputs, changes its state and make outputs 3
  5. Chalenges posed by CPS Difficult to modularize - tight coupling

    of physical and software parts Difficult to test - infinite number of possible states Difficult to reason - hybrid systems are not well understood theoretically Yet many CPSs are safety critical, thus need high confidence 4
  6. Example: self-driving car Test drive of 11 billon miles is

    required to prove that self-driving car has the lower fatality rate than human [Kalra and Paddock, 2016]. Mixed reality autonomous vehicle testing methods [Huang et al., 2016] • Start from simulated scenarios • Move to the real world • Go back to simulation for newly discovered scenarios A lot of manual effort, yet potentially miss important corner cases 5
  7. Falsification

  8. Robustness guided falsification: Idea Specification + Simulated model (of CPS)

    ⇒ Counter-examples (corner cases violating the specification) Specification is described by (metric or signal) temporal logic Simulated model is modelled by a system modeling language (MATLAB/Simulink) To find a counter-example, minimize the robustness of the system output resp. to a specification 6
  9. Metric Temporal Logic (MTL) Definition MTL formulas F is F

    ::= a | true | F ∧ F | F ∨ F | ¬F | I F | ♦I F | I F | ♦ −I F I is an interval on [0, ∞). I = [0, ∞) is omitted. I , I : always on future/past I from now ♦I , ♦ −I : possible on future/past I from now 7
  10. Semantics of MTL Definition Our semantics uses discrete time •

    y = y1 , y2 , . . . , is a sequence of outputs • t = t1 , t2 , . . . , is a sequence of sampled time instances • Y is the (metric) space of outputs • a ⊆ Y is the interpretation of a 8
  11. Semantics of MTL (continued) Definition n, y, t |= a

    ⇐⇒ yn ∈ a n, y, t |= F1 ∧ F2 ⇐⇒ n, y, t |= F1 and n, y, t |= F2 n, y, t |= F1 ∨ F2 ⇐⇒ n, y, t |= F1 or n, y, t |= F2 n, y, t |= ¬F ⇐⇒ not n, y, t |= F n, y, t |= I F ⇐⇒ for any ti − tn ∈ I , i, y, t |= F n, y, t |= ♦I F ⇐⇒ there is a ti − tn ∈ I , i, y, t |= F n, y, t |= I F ⇐⇒ for any tn − ti ∈ I , i, y, t |= F n, y, t |= ♦ −I F ⇐⇒ there is a tn − ti ∈ I , i, y, t |= F n, i are time indices 9
  12. Robustness rob(F) of a MTL formula F Definition D(x, S)

    =    inf{ x − y | y ∈ S} if x ∈ S − inf{ x − y | y ∈ S} if x ∈ S where x − y is the distance between y and x in Y . 10
  13. Robustness rob(F) of a MTL formula F (continued) Definition rob(n,

    y, t, a) = D(yn , a ) rob(n, y, t, F1 ∧ F2 ) = min(rob(n, y, t, F1 ). rob(n, y, t, F2 )) rob(n, y, t, F1 ∨ F2 ) = max(rob(n, y, t, F1 ). rob(n, y, t, F2 )) rob(n, y, t, I F) = min{rob(i, y, t, F) | ti ∈ tn + I} rob(n, y, t, ♦I F) = max{rob(i, y, t, F) | ti ∈ tn + I} rob(n, y, t, I F) = min{rob(i, y, t, F) | ti ∈ tn − I} rob(n, y, t, ♦ −I F) = max{rob(i, y, t, F) | ti ∈ tn − I} 11
  14. Robustness and truth Theorem • If rob(n, y, t, F)

    > 0 then n, y, t |= F • If rob(n, y, t, F) < 0 then n, y, t |= F Remark If rob(n, y, t, F) = 0 then we cannot determine n, y, t |= F or n, y, t |= F 12
  15. Robustness guided falsification Minimize rob(n, f(x), t, F) to find

    an x which makes rob(n, f(x), t, F) < 0 Many numerical optimization techniques are used • Generic methods: Simulated annealing, cross-entropy method [Annpureddy et al., 2011], Nelder Mead, Genetic programming [Donz´ e, 2010] • Incremental: Monte-Carlo tree search [Zhang et al., 2018], Las Vegas tree search [Ernst et al., 2019] 13
  16. Reinforcement learning

  17. Reinforcement learning (RL) Agent Environment Action State Reward 14

  18. Markov Decision Process Definition Markov Decision Process (MDP): a tuple

    (X, A, P) • X: The set of states • A: The set of actions • P: A function from (x, a) ∈ X × A to a probability distribution D on X × R D gives the distribution of the next state and the reward. Remark The system is Markovian, in the sense that the next state and the reward only depend on the current state and the action 15
  19. Markov Reward Process Definition A policy π is a map

    from X to a distribution over A Definition Markov Reward Process (MRP): a pair (X, R) • X: The set of states • R: A function from X to a distribution over X × R Remark A MDP (X, A, P) and policy π give rise of a MRP (X, R) 16
  20. Reinforcement learning: Definition Definition (Reinforcement learning problem) • (x1 ,

    r1 ), (x2 , r2 ), . . . ,: a sequence generated by a MRA • γ ∈ (0, 1] : Discount rate, 1 ≤ T ≤ ∞: Horizon Find a policy π which maximize the expected reward R = E T i=1 γi ri Remark π need not to depend on the history x1 , x2 , . . . , xi 17
  21. Q and V -functions Definition Qπ(x, a) = E T

    i=1 γi ri | x0 = x, a0 = a V π(x) = E T i=1 γi ri | x0 = x Q(x, a) = sup π Qπ(x, a) V (x) = sup π V π(x) Remark Once Q-function is known, the greedy strategy which chooses the action a maximizing Q(x, a) is the best strategy. 18
  22. Two approaches of reinforcement learning • Q-learning: Directly learn Q-function

    and use greedy strategy. Double Deep Q-network (DDQN) is chosen as a representative • Actor-Critic: Follow a policy π, estimate its V π then update π. Asynchronous Advantage Actor Critic (A3C) is chosen as a representative 19
  23. Falsification using reinforcement learning

  24. Finite future reach safety property and pure past dependent safety

    property Definition • Future reach fr(F): Future which the truth of F depends • If fr(F) < ∞, F has finite future reach • If fr(F) = 0, F is pure past dependent • F where fr(F) < ∞ is a finite future reach safety property • F where fr(F) = 0 is a pure past dependent safety property 20
  25. Falsification to reinforcement learning F: pure past dependent, T: time

    horizon rob(0, y, t, F) = min{rob(n, y, t, F) | 0 ≤ n ≤ T} ∼ − log T n=0 exp{− rob(n, y, t, F)} To minimize rob(0, y, t, F), maximize T n=0 exp{− rob(n, y, t, F)} This is a reinforcement learning problem with γ = 1 and rn = exp{− rob(n, y, t, F)} 21
  26. Falsification to reinforcement learning (continued) Remark Because F is pure

    past dependent, exp{− rob(n, y, t, F)} = exp{− rob(n, y0 , . . . , yn , t1 , . . . , tn , F)} Thus rn can be incrementally computed. Further, yn = f(x0 , . . . , xn ) therefore xn can be determined incrementally Remark rob and f are not Markovian 22
  27. Finite future reach safety property F: finite future reach formula

    with r = fr(F) rob(0, y, t, F) ∼ rob(0, y, t, [r,r] F) Remark • [r,r] F is pure past dependent • If the time horizon is (−∞, ∞), rob(0, y, t, F) = rob(0, y, t, [r,r] F) 23
  28. Case studies

  29. Implementation Falsifier A3C, DDQN (ChainerRL) 4ZTUFN*OQVU System Model Simulink Subsystem

    Robustness Monitor Taliro-Monitor System Output Robustness 24
  30. Experiment setting • y is sampled each ∆T = 1

    and x is generated in the same rate • x is interpolated by a piece-wise linear function • For each property, repeat “trails” 100 times • For each trial, run simulations up to 200 times • Measure the number of simulations required to falsify the property 25
  31. Coparison between base lines and our methods Methods • Base

    lines: uniform random (UR), simulated annealing (SA) and cross entropy method (CE) • Our methods: A3C and DDQN (Double Deep Q network) Comparisons • Overview using box diagrams • Statistical testing I: uniform random against the rest • Statistical testing II: A3C and DDQN against simulated annealing and cross entropy method 26
  32. Relative effect size measure Definition The relative effect size measure

    p of two random variable X and Y is p = P(X < Y ) + 1 2 P(X = Y ) Remark • p > 0.5 implies X tends smaller than Y • p should not be confused with p-value of statistical testing 27
  33. Chasing Cars (CC) y_in y_out Car 2 Throttle Brake y_out

    Car 1 y_in y_out Car 3 y_in y_out Car 4 y_in y_out Car 5 1 2 3 4 1 2 5 28
  34. Falsified properties id Formula ϕ1 y5 − y4 ≤ 40.0

    ϕ2 [0,70] [0,30] y5 − y4 ≥ 15 ϕ2 [0,80] (( [0,20] y2 − y1 ≤ 20) ∨ ( [0,20] y5 − y4 ≥ 40)) ϕ4 [0,65] [0,30] [0,5] y5 − y4 ≥ 8 ϕ5 [0,72] [0,8] ( [0,5] y2 − y1 ≥ 9 → [5,20] y5 − y4 ≥ 9) Formulas are artificial examples with gradually increased complexities 29
  35. Overview of CC 0 50 100 150 200 Number of

    simulations Properties Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 30
  36. Comparison to UR Properties A3C DDQN CE SA ϕ1 0.001

    0.001 0.001 0.350 ϕ2 0.920 0.920 0.999 0.999 ϕ3 0.001 0.135 0.005 0.295 ϕ4 0.380 0.320 0.500 0.500 ϕ5 0.005 0.001 0.001 0.455 Null hypothesis p = 0.5. Bold and italic indicate statistically significant difference 31
  37. Comparison of A3C and DDQN against CE and SA A3C

    DDQN CE SA CE SA ϕ1 0.081 0.001 0.052 0.001 ϕ2 0.112 0.004 0.147 0.011 ϕ3 0.132 0.015 0.778 0.280 ϕ4 0.380 0.380 0.320 0.320 ϕ5 0.330 0.009 0.237 0.002 32
  38. Automatic Transimission Controller (ATC) https://mathworks.com/help/simulink/slref/ modeling-an-automatic-transmission-controller. html 33

  39. Falsified properties id Formula ϕ1 ω ≤ 4770 ϕ2 (v

    ≤ 170 ∧ ω ≤ 4770) ϕ3 ((g2 ∧ [0,0.1] g1 ) → [0.1,1.0] ¬g2 ) ϕ4 ((¬g1 ∧ [0,0.1] g1 ) → [0.1,1.0] g1 ) ϕ5 4 i=1 ((¬gi ∧ [0,0.1]gi ) → [0.1,1.0] gi ) ϕ6 ( [0,10] ω ≤ 4550 → [10,20] v ≤ 160) ϕ7 v ≤ 160 ϕ8 [0,25] ¬(70 ≤ v ≤ 80) ϕ9 ¬ [0,20] (¬g4 ∧ ω ≥ 3100) v: vehicle speed, ω: engine speed, gi : gear positions 34
  40. Falsified properties (continued) Remark • φ1 − φ5 are φAT

    1 − φAT 5 in [Bardh Hoxha and Fainekos, 2014] but parameters are different • φ6 is derived from φAT 6 but modified to a safety property • φ7 − φ9 are our own. the time horizon is longer (100 instead of 30) 35
  41. Result 0 50 100 150 200 Number of simulations Properties

    Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6 ϕ7 ϕ8 ϕ9 36
  42. Comparison of A3C, DDQN, SA, CE against UR A3C DDQN

    CE SA ϕ1 0.190 0.001 0.475 0.490 ϕ2 0.200 0.015 0.500 0.500 ϕ3 0.556 0.878 0.471 0.863 ϕ4 0.507 0.824 0.369 0.910 ϕ5 0.542 0.837 0.520 0.561 ϕ6 0.170 0.001 0.500 0.500 ϕ7 0.315 0.295 0.500 0.500 ϕ8 0.960 0.931 0.433 0.899 ϕ9 0.315 0.140 0.500 0.500 37
  43. Comparison of A3C and DDQN against SA and CE A3C

    DDQN CE SA CE SA ϕ1 0.206 0.196 0.004 0.002 ϕ2 0.200 0.200 0.015 0.015 ϕ3 0.524 0.275 0.875 0.414 ϕ4 0.561 0.215 0.930 0.355 ϕ5 0.523 0.475 0.999 0.723 ϕ6 0.170 0.170 0.001 0.001 ϕ7 0.315 0.315 0.295 0.295 ϕ8 0.968 0.525 0.960 0.180 ϕ9 0.315 0.315 0.140 0.140 38
  44. PTC model Fuel Control System Verification and Validation stub system

    Pedal Angle Engine Speed A/F A/F ref Verification measurement Mode This model [Jin et al., 2014] uses the sampling period ∆T = 5 and piece-wise constant x 39
  45. Falsified properties id Formula ϕ26 [11,50] |µ| ≤ 0.2 ϕ27

    [11,50] (rise ∨ fall =⇒ [1,5] |µ| ≤ 0.15) ϕ30 [11,50] µ ≥ −0.25 ϕ31 [11,50] µ ≤ 0.2 ϕ32 [11,50] (power ∧ [0,0.1] normal =⇒ [1,5] |µ| ≤ 0.2) ϕ33 [11,50] (power =⇒ |µ| ≤ 0.2) ϕ34 [0,50] (sensor fail =⇒ [1,5] |µ| ≤ 0.15) Properties are derived from [Jin et al., 2014] 40
  46. Overiew of results ptc_fml34_sensorfail 0 50 100 150 200 Number

    of simulations Properties Algorithm A3C DDQN RAND CE SA ϕ26 ϕ27 ϕ30 ϕ31 ϕ32 ϕ33 41
  47. Comparison to UR A3C DDQN CE SA ϕ26 0.705 0.914

    0.615 0.654 ϕ27 0.588 0.999 0.575 0.696 ϕ30 0.796 0.926 0.758 0.975 ϕ31 0.586 0.845 0.671 0.380 ϕ32 0.725 0.725 0.274 0.631 ϕ33 0.524 0.844 0.552 0.588 ϕ34 0.721 0.836 0.298 0.570 42
  48. Comparison of A3C and DDQN against CE and SE A3C

    DDQN CE SA CE SA ϕ26 0.616 0.511 0.874 0.682 ϕ27 0.501 0.455 0.900 0.925 ϕ30 0.644 0.365 0.835 0.455 ϕ31 0.414 0.684 0.688 0.879 ϕ32 0.875 0.600 0.875 0.600 ϕ33 0.478 0.459 0.812 0.747 ϕ34 0.812 0.607 0.942 0.711 43
  49. Observations • A3C underperforms the baselines only few the cases

    • A3C exhibits more stable performance than DDQN • Whenever CE or SA outperforms A3C or DDQN, UR outperforms A3C or DDQN Caveat: We adjust significance level to perform multiple statistical testing. Multiple statistical testing tends to give conservative results 44
  50. More on our research Falsification of Cyber-Physical Systems Using Deep

    Reinforcement Learning Yoriyuki Yamagata, Shuang Liu, Takumi Akazaki, Yihai Duan and Jianye Hao IEEE Transaction on Software Engineering, To appear Analysis on execution time, impact of logsumexp approximation and detailed analysis of experiment results 45
  51. References i Annpureddy, Y., Liu, C., Fainekos, G. E., and

    Sankaranarayanan, S. (2011). S-taliro: A tool for temporal logic falsification for hybrid systems. In Tools and Algorithms for the Construction and Analysis of Systems - 17th International Conference, TACAS 2011, Saarbr¨ ucken, Germany, March 26-April 3, 2011. Proceedings, pages 254–257. 46
  52. References ii Bardh Hoxha, H. A. and Fainekos, G. (2014).

    Benchmarks for temporal logic requirements for automotive systems. Proc. of Applied Verification for Continuous and Hybrid Systems. Donz´ e, A. (2010). Breach, A toolbox for verification and parameter synthesis of hybrid systems. In Touili, T., Cook, B., and Jackson, P. B., editors, Computer Aided Verification, 22nd International Conference, CAV 2010, Edinburgh, UK, July 15-19, 2010. 47
  53. References iii Proceedings, volume 6174 of Lecture Notes in Computer

    Science, pages 167–170. Springer. Ernst, G., Sedwards, S., Zhang, Z., and Hasuo, I. (2019). Fast falsification of hybrid systems using probabilistically adaptive input. In International Conference on Quantitative Evaluation of Systems, pages 165–181. Springer. Huang, W., Wang, K., Lv, Y., and Zhu, F. (2016). Autonomous vehicles testing methods review. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pages 163–168. IEEE. 48
  54. References iv Jin, X., Deshmukh, J. V., Kapinski, J., Ueda,

    K., and Butts, K. (2014). Powertrain control verification benchmark. In Proceedings of the 17th International Conference on Hybrid Systems: Computation and Control, HSCC ’14, pages 253–262, New York, NY, USA. ACM. Kalra, N. and Paddock, S. M. (2016). Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? 49
  55. References v Transportation Research Part A: Policy and Practice, 94:182–193.

    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Zhang, Z., Ernst, G., Sedwards, S., Arcaini, P., and Hasuo, I. (2018). Two-layered falsification of hybrid systems guided by monte carlo tree search. 50
  56. References vi IEEE Transactions on Computer-Aided Design of Integrated Circuits

    and Systems, 37(11):2894–2905. 51
  57. Text boooks for RL • Algorithms for Reinforcement Learning, Szepesv´

    ari • Reinforcement Learning, Sutton and Barto • Dynamic Programming and Optimal Control, Bertsekas
  58. Q-learning

  59. Q-learning: basic idea For each step, repeat this process •

    x: current state, a: action, r: reward, x : next state • Q: current estimate Q-function δ = r + γ · max a ∈A Q(x , a ) new estimate of Q(x, a) −Q(x, a) Update Q to reduce δ
  60. Deep Q network (DQN) [Mnih et al., 2013] Represent Q-function

    by a deep neural network Stability is a major issue Key ideas • Experience replay: memorize experiences ei = (xi , ai , ri , xi+1 ). Choose random experiences to update each step • Use two Q-functions Q, Q− δ = r + γ · max a ∈A Q−(x , a ) − Q(x, a) Q− is periodically updated by Q
  61. Actor Critic

  62. Actor-Critic Actor Environment Action State Reward Critic Value

  63. Policy gradient method X, A, R : stationary distributions of

    the state, action and reward when the actor follows a policy πθ Update π using gradient ascent of the expected return ρθ Theorem (Policy gradient theorem) G(θ) = (Qπθ (X, A) − h(X))∇θ log πθ (A, X) Then, E[G(θ)] is an unbiased estimation of ∇θE[ρθ ]
  64. Policy gradient method (cont.) Remark Theorem holds for any function

    h(x), but V πθ (x) is often used. Aπ(x, a) = Qπ(x, a) − V π(x) is called the advantage function
  65. Asynchronous Advantage Actor Critic (A3C) Gradient ascent using G(θ) =

           k i=0 γi ri + γkV π(xk ) estimate of Qπ(x0 , a0 ) −V π(x0 )        ∂ ∂θ log πθ (a0 , x0 ) • π and V π are represented by a deep neural network • Tricks to make learning asynchronous, i.e. learning from multiple plays • But our experiment uses a single play simultaneously