Falsification of Cyber Physical System Using Deep Reinforcement Learning

Falsiﬁcation of Cyber Physical System Using Deep Reinforcement Learning Y.
Yamagata1 S. Liu2 T. Akazaki3 Y. Duan2 J. Hao2 February 27, 2020 1National Institute of Advanced Industrial Science and Technology (AIST) 2Tianjin University 3Fujitsu Laboratories Ltd 1

Table of contents 1. Cyber Physical System (CPS) 2. Falsiﬁcation
3. Reinforcement learning 4. Falsiﬁcation using reinforcement learning 5. Case studies 2

Cyber Physical System (CPS)

Cyber Physical System (CPS) Cyber Physical System (CPS) is a
system in which physical parts and software coexist In this talk, we focus a reactive CPS, which takes inputs, changes its state and make outputs 3

Chalenges posed by CPS Difficult to modularize - tight coupling
of physical and software parts Difficult to test - infinite number of possible states Difficult to reason - hybrid systems are not well understood theoretically Yet many CPSs are safety critical, thus need high confidence 4

Example: self-driving car Test drive of 11 billon miles is
required to prove that self-driving car has the lower fatality rate than human [Kalra and Paddock, 2016]. Mixed reality autonomous vehicle testing methods [Huang et al., 2016] • Start from simulated scenarios • Move to the real world • Go back to simulation for newly discovered scenarios A lot of manual eﬀort, yet potentially miss important corner cases 5

Falsiﬁcation

Robustness guided falsification: Idea Specification + Simulated model (of CPS)
⇒ Counter-examples (corner cases violating the specification) Specification is described by (metric or signal) temporal logic Simulated model is modelled by a system modeling language (MATLAB/Simulink) To find a counter-example, minimize the robustness of the system output resp. to a specification 6

Metric Temporal Logic (MTL) Deﬁnition MTL formulas F is F
::= a | true | F ∧ F | F ∨ F | ¬F | I F | ♦I F | I F | ♦ −I F I is an interval on [0, ∞). I = [0, ∞) is omitted. I , I : always on future/past I from now ♦I , ♦ −I : possible on future/past I from now 7

Semantics of MTL Deﬁnition Our semantics uses discrete time •
y = y1 , y2 , . . . , is a sequence of outputs • t = t1 , t2 , . . . , is a sequence of sampled time instances • Y is the (metric) space of outputs • a ⊆ Y is the interpretation of a 8

Robustness rob(F) of a MTL formula F Deﬁnition D(x, S)
=    inf{ x − y | y ∈ S} if x ∈ S − inf{ x − y | y ∈ S} if x ∈ S where x − y is the distance between y and x in Y . 10

Robustness rob(F) of a MTL formula F (continued) Deﬁnition rob(n,
y, t, a) = D(yn , a ) rob(n, y, t, F1 ∧ F2 ) = min(rob(n, y, t, F1 ). rob(n, y, t, F2 )) rob(n, y, t, F1 ∨ F2 ) = max(rob(n, y, t, F1 ). rob(n, y, t, F2 )) rob(n, y, t, I F) = min{rob(i, y, t, F) | ti ∈ tn + I} rob(n, y, t, ♦I F) = max{rob(i, y, t, F) | ti ∈ tn + I} rob(n, y, t, I F) = min{rob(i, y, t, F) | ti ∈ tn − I} rob(n, y, t, ♦ −I F) = max{rob(i, y, t, F) | ti ∈ tn − I} 11

Robustness and truth Theorem • If rob(n, y, t, F)
> 0 then n, y, t |= F • If rob(n, y, t, F) < 0 then n, y, t |= F Remark If rob(n, y, t, F) = 0 then we cannot determine n, y, t |= F or n, y, t |= F 12

Robustness guided falsiﬁcation Minimize rob(n, f(x), t, F) to ﬁnd
an x which makes rob(n, f(x), t, F) < 0 Many numerical optimization techniques are used • Generic methods: Simulated annealing, cross-entropy method [Annpureddy et al., 2011], Nelder Mead, Genetic programming [Donz´ e, 2010] • Incremental: Monte-Carlo tree search [Zhang et al., 2018], Las Vegas tree search [Ernst et al., 2019] 13

Reinforcement learning

Reinforcement learning (RL) Agent Environment Action State Reward 14

Markov Decision Process Deﬁnition Markov Decision Process (MDP): a tuple
(X, A, P) • X: The set of states • A: The set of actions • P: A function from (x, a) ∈ X × A to a probability distribution D on X × R D gives the distribution of the next state and the reward. Remark The system is Markovian, in the sense that the next state and the reward only depend on the current state and the action 15

Markov Reward Process Deﬁnition A policy π is a map
from X to a distribution over A Deﬁnition Markov Reward Process (MRP): a pair (X, R) • X: The set of states • R: A function from X to a distribution over X × R Remark A MDP (X, A, P) and policy π give rise of a MRP (X, R) 16

Reinforcement learning: Deﬁnition Deﬁnition (Reinforcement learning problem) • (x1 ,
r1 ), (x2 , r2 ), . . . ,: a sequence generated by a MRA • γ ∈ (0, 1] : Discount rate, 1 ≤ T ≤ ∞: Horizon Find a policy π which maximize the expected reward R = E T i=1 γi ri Remark π need not to depend on the history x1 , x2 , . . . , xi 17

Q and V -functions Deﬁnition Qπ(x, a) = E T
i=1 γi ri | x0 = x, a0 = a V π(x) = E T i=1 γi ri | x0 = x Q(x, a) = sup π Qπ(x, a) V (x) = sup π V π(x) Remark Once Q-function is known, the greedy strategy which chooses the action a maximizing Q(x, a) is the best strategy. 18

Two approaches of reinforcement learning • Q-learning: Directly learn Q-function
and use greedy strategy. Double Deep Q-network (DDQN) is chosen as a representative • Actor-Critic: Follow a policy π, estimate its V π then update π. Asynchronous Advantage Actor Critic (A3C) is chosen as a representative 19

Falsiﬁcation using reinforcement learning

Finite future reach safety property and pure past dependent safety
property Definition • Future reach fr(F): Future which the truth of F depends • If fr(F) < ∞, F has finite future reach • If fr(F) = 0, F is pure past dependent • F where fr(F) < ∞ is a finite future reach safety property • F where fr(F) = 0 is a pure past dependent safety property 20

Falsiﬁcation to reinforcement learning F: pure past dependent, T: time
horizon rob(0, y, t, F) = min{rob(n, y, t, F) | 0 ≤ n ≤ T} ∼ − log T n=0 exp{− rob(n, y, t, F)} To minimize rob(0, y, t, F), maximize T n=0 exp{− rob(n, y, t, F)} This is a reinforcement learning problem with γ = 1 and rn = exp{− rob(n, y, t, F)} 21

Falsiﬁcation to reinforcement learning (continued) Remark Because F is pure
past dependent, exp{− rob(n, y, t, F)} = exp{− rob(n, y0 , . . . , yn , t1 , . . . , tn , F)} Thus rn can be incrementally computed. Further, yn = f(x0 , . . . , xn ) therefore xn can be determined incrementally Remark rob and f are not Markovian 22

Finite future reach safety property F: ﬁnite future reach formula
with r = fr(F) rob(0, y, t, F) ∼ rob(0, y, t, [r,r] F) Remark • [r,r] F is pure past dependent • If the time horizon is (−∞, ∞), rob(0, y, t, F) = rob(0, y, t, [r,r] F) 23

Case studies

Implementation Falsiﬁer A3C, DDQN (ChainerRL) 4ZTUFN*OQVU System Model Simulink Subsystem
Robustness Monitor Taliro-Monitor System Output Robustness 24

Experiment setting • y is sampled each ∆T = 1
and x is generated in the same rate • x is interpolated by a piece-wise linear function • For each property, repeat “trails” 100 times • For each trial, run simulations up to 200 times • Measure the number of simulations required to falsify the property 25

Coparison between base lines and our methods Methods • Base
lines: uniform random (UR), simulated annealing (SA) and cross entropy method (CE) • Our methods: A3C and DDQN (Double Deep Q network) Comparisons • Overview using box diagrams • Statistical testing I: uniform random against the rest • Statistical testing II: A3C and DDQN against simulated annealing and cross entropy method 26

Relative effect size measure Definition The relative effect size measure
p of two random variable X and Y is p = P(X < Y ) + 1 2 P(X = Y ) Remark • p > 0.5 implies X tends smaller than Y • p should not be confused with p-value of statistical testing 27

Chasing Cars (CC) y_in y_out Car 2 Throttle Brake y_out
Car 1 y_in y_out Car 3 y_in y_out Car 4 y_in y_out Car 5 1 2 3 4 1 2 5 28

Falsiﬁed properties id Formula ϕ1 y5 − y4 ≤ 40.0
ϕ2 [0,70] [0,30] y5 − y4 ≥ 15 ϕ2 [0,80] (( [0,20] y2 − y1 ≤ 20) ∨ ( [0,20] y5 − y4 ≥ 40)) ϕ4 [0,65] [0,30] [0,5] y5 − y4 ≥ 8 ϕ5 [0,72] [0,8] ( [0,5] y2 − y1 ≥ 9 → [5,20] y5 − y4 ≥ 9) Formulas are artiﬁcial examples with gradually increased complexities 29

Overview of CC 0 50 100 150 200 Number of
simulations Properties Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 30

Comparison to UR Properties A3C DDQN CE SA ϕ1 0.001
0.001 0.001 0.350 ϕ2 0.920 0.920 0.999 0.999 ϕ3 0.001 0.135 0.005 0.295 ϕ4 0.380 0.320 0.500 0.500 ϕ5 0.005 0.001 0.001 0.455 Null hypothesis p = 0.5. Bold and italic indicate statistically signiﬁcant diﬀerence 31

Comparison of A3C and DDQN against CE and SA A3C
DDQN CE SA CE SA ϕ1 0.081 0.001 0.052 0.001 ϕ2 0.112 0.004 0.147 0.011 ϕ3 0.132 0.015 0.778 0.280 ϕ4 0.380 0.380 0.320 0.320 ϕ5 0.330 0.009 0.237 0.002 32

Automatic Transimission Controller (ATC) https://mathworks.com/help/simulink/slref/ modeling-an-automatic-transmission-controller. html 33

Falsiﬁed properties id Formula ϕ1 ω ≤ 4770 ϕ2 (v
≤ 170 ∧ ω ≤ 4770) ϕ3 ((g2 ∧ [0,0.1] g1 ) → [0.1,1.0] ¬g2 ) ϕ4 ((¬g1 ∧ [0,0.1] g1 ) → [0.1,1.0] g1 ) ϕ5 4 i=1 ((¬gi ∧ [0,0.1]gi ) → [0.1,1.0] gi ) ϕ6 ( [0,10] ω ≤ 4550 → [10,20] v ≤ 160) ϕ7 v ≤ 160 ϕ8 [0,25] ¬(70 ≤ v ≤ 80) ϕ9 ¬ [0,20] (¬g4 ∧ ω ≥ 3100) v: vehicle speed, ω: engine speed, gi : gear positions 34

Falsified properties (continued) Remark • φ1 − φ5 are φAT
1 − φAT 5 in [Bardh Hoxha and Fainekos, 2014] but parameters are different • φ6 is derived from φAT 6 but modified to a safety property • φ7 − φ9 are our own. the time horizon is longer (100 instead of 30) 35

Result 0 50 100 150 200 Number of simulations Properties
Algorithm A3C DDQN RAND CE SA ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6 ϕ7 ϕ8 ϕ9 36

Comparison of A3C, DDQN, SA, CE against UR A3C DDQN
CE SA ϕ1 0.190 0.001 0.475 0.490 ϕ2 0.200 0.015 0.500 0.500 ϕ3 0.556 0.878 0.471 0.863 ϕ4 0.507 0.824 0.369 0.910 ϕ5 0.542 0.837 0.520 0.561 ϕ6 0.170 0.001 0.500 0.500 ϕ7 0.315 0.295 0.500 0.500 ϕ8 0.960 0.931 0.433 0.899 ϕ9 0.315 0.140 0.500 0.500 37

Comparison of A3C and DDQN against SA and CE A3C
DDQN CE SA CE SA ϕ1 0.206 0.196 0.004 0.002 ϕ2 0.200 0.200 0.015 0.015 ϕ3 0.524 0.275 0.875 0.414 ϕ4 0.561 0.215 0.930 0.355 ϕ5 0.523 0.475 0.999 0.723 ϕ6 0.170 0.170 0.001 0.001 ϕ7 0.315 0.315 0.295 0.295 ϕ8 0.968 0.525 0.960 0.180 ϕ9 0.315 0.315 0.140 0.140 38

PTC model Fuel Control System Veriﬁcation and Validation stub system
Pedal Angle Engine Speed A/F A/F ref Veriﬁcation measurement Mode This model [Jin et al., 2014] uses the sampling period ∆T = 5 and piece-wise constant x 39

Falsiﬁed properties id Formula ϕ26 [11,50] |µ| ≤ 0.2 ϕ27
[11,50] (rise ∨ fall =⇒ [1,5] |µ| ≤ 0.15) ϕ30 [11,50] µ ≥ −0.25 ϕ31 [11,50] µ ≤ 0.2 ϕ32 [11,50] (power ∧ [0,0.1] normal =⇒ [1,5] |µ| ≤ 0.2) ϕ33 [11,50] (power =⇒ |µ| ≤ 0.2) ϕ34 [0,50] (sensor fail =⇒ [1,5] |µ| ≤ 0.15) Properties are derived from [Jin et al., 2014] 40

Overiew of results ptc_fml34_sensorfail 0 50 100 150 200 Number
of simulations Properties Algorithm A3C DDQN RAND CE SA ϕ26 ϕ27 ϕ30 ϕ31 ϕ32 ϕ33 41

Comparison to UR A3C DDQN CE SA ϕ26 0.705 0.914
0.615 0.654 ϕ27 0.588 0.999 0.575 0.696 ϕ30 0.796 0.926 0.758 0.975 ϕ31 0.586 0.845 0.671 0.380 ϕ32 0.725 0.725 0.274 0.631 ϕ33 0.524 0.844 0.552 0.588 ϕ34 0.721 0.836 0.298 0.570 42

Comparison of A3C and DDQN against CE and SE A3C
DDQN CE SA CE SA ϕ26 0.616 0.511 0.874 0.682 ϕ27 0.501 0.455 0.900 0.925 ϕ30 0.644 0.365 0.835 0.455 ϕ31 0.414 0.684 0.688 0.879 ϕ32 0.875 0.600 0.875 0.600 ϕ33 0.478 0.459 0.812 0.747 ϕ34 0.812 0.607 0.942 0.711 43

Observations • A3C underperforms the baselines only few the cases
• A3C exhibits more stable performance than DDQN • Whenever CE or SA outperforms A3C or DDQN, UR outperforms A3C or DDQN Caveat: We adjust signiﬁcance level to perform multiple statistical testing. Multiple statistical testing tends to give conservative results 44

More on our research Falsiﬁcation of Cyber-Physical Systems Using Deep
Reinforcement Learning Yoriyuki Yamagata, Shuang Liu, Takumi Akazaki, Yihai Duan and Jianye Hao IEEE Transaction on Software Engineering, To appear Analysis on execution time, impact of logsumexp approximation and detailed analysis of experiment results 45

References i Annpureddy, Y., Liu, C., Fainekos, G. E., and
Sankaranarayanan, S. (2011). S-taliro: A tool for temporal logic falsiﬁcation for hybrid systems. In Tools and Algorithms for the Construction and Analysis of Systems - 17th International Conference, TACAS 2011, Saarbr¨ ucken, Germany, March 26-April 3, 2011. Proceedings, pages 254–257. 46

References ii Bardh Hoxha, H. A. and Fainekos, G. (2014).
Benchmarks for temporal logic requirements for automotive systems. Proc. of Applied Verification for Continuous and Hybrid Systems. Donz´ e, A. (2010). Breach, A toolbox for verification and parameter synthesis of hybrid systems. In Touili, T., Cook, B., and Jackson, P. B., editors, Computer Aided Verification, 22nd International Conference, CAV 2010, Edinburgh, UK, July 15-19, 2010. 47

References iii Proceedings, volume 6174 of Lecture Notes in Computer
Science, pages 167–170. Springer. Ernst, G., Sedwards, S., Zhang, Z., and Hasuo, I. (2019). Fast falsiﬁcation of hybrid systems using probabilistically adaptive input. In International Conference on Quantitative Evaluation of Systems, pages 165–181. Springer. Huang, W., Wang, K., Lv, Y., and Zhu, F. (2016). Autonomous vehicles testing methods review. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pages 163–168. IEEE. 48

References iv Jin, X., Deshmukh, J. V., Kapinski, J., Ueda,
K., and Butts, K. (2014). Powertrain control veriﬁcation benchmark. In Proceedings of the 17th International Conference on Hybrid Systems: Computation and Control, HSCC ’14, pages 253–262, New York, NY, USA. ACM. Kalra, N. and Paddock, S. M. (2016). Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? 49

References v Transportation Research Part A: Policy and Practice, 94:182–193.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Zhang, Z., Ernst, G., Sedwards, S., Arcaini, P., and Hasuo, I. (2018). Two-layered falsiﬁcation of hybrid systems guided by monte carlo tree search. 50

References vi IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 37(11):2894–2905. 51

Text boooks for RL • Algorithms for Reinforcement Learning, Szepesv´
ari • Reinforcement Learning, Sutton and Barto • Dynamic Programming and Optimal Control, Bertsekas

Q-learning

Q-learning: basic idea For each step, repeat this process •
x: current state, a: action, r: reward, x : next state • Q: current estimate Q-function δ = r + γ · max a ∈A Q(x , a ) new estimate of Q(x, a) −Q(x, a) Update Q to reduce δ

Deep Q network (DQN) [Mnih et al., 2013] Represent Q-function
by a deep neural network Stability is a major issue Key ideas • Experience replay: memorize experiences ei = (xi , ai , ri , xi+1 ). Choose random experiences to update each step • Use two Q-functions Q, Q− δ = r + γ · max a ∈A Q−(x , a ) − Q(x, a) Q− is periodically updated by Q

Actor Critic

Actor-Critic Actor Environment Action State Reward Critic Value

Policy gradient method X, A, R : stationary distributions of
the state, action and reward when the actor follows a policy πθ Update π using gradient ascent of the expected return ρθ Theorem (Policy gradient theorem) G(θ) = (Qπθ (X, A) − h(X))∇θ log πθ (A, X) Then, E[G(θ)] is an unbiased estimation of ∇θE[ρθ ]

Policy gradient method (cont.) Remark Theorem holds for any function
h(x), but V πθ (x) is often used. Aπ(x, a) = Qπ(x, a) − V π(x) is called the advantage function

Asynchronous Advantage Actor Critic (A3C) Gradient ascent using G(θ) =
       k i=0 γi ri + γkV π(xk ) estimate of Qπ(x0 , a0 ) −V π(x0 )        ∂ ∂θ log πθ (a0 , x0 ) • π and V π are represented by a deep neural network • Tricks to make learning asynchronous, i.e. learning from multiple plays • But our experiment uses a single play simultaneously

Falsification of Cyber Physical System Using De...

Falsification of Cyber Physical System Using Deep Reinforcement Learning

More Decks by Yoriyuki Yamagata

Other Decks in Research

Featured

Transcript