T min t=1 (200 − vt ) Robustness r • Find a counter-example by minimizing robustness • Cast a falsification problem into a numerical optimization problem
learning problem • Implemented the proposed method using a deep- reinforcement learning framework • Perform comparison with S-Taliro (widely used robustness guided falsification tool)
vt ) ∼ − log T ∑ t=1 exp[ − (200 − vt )] Want to find an input which minimizes Therefore, we need to maximize T ∑ t=1 exp[ − (200 − vt )] We can solve this optimization problem using reinforcement learning with the reward exp[ − (200 − vt )]
• algorithms using deep learning • Versatile, can adapt non-linear system dynamics • In particular, we use two algorithms • DDQN (Q-learning approach) • A3C (Actor-Critic approach)
MATLAB • Reinforcement learning part is implemented by Python • Use Python library ChainerRL (now PFRL) • Robustness monitor • reuse the monitor in S-Taliro • System model (Target model)
Control) Falsifier is allowed to run 200 simulations to falsify a specification for each trial 100 trials are repeated for each model and property, because the result varies by stochastic nature of the agent No pre-training, no hyper-parameter tuning At the start of each trail, the “memory” of the agent is reset The memory is kept between simulations
Reason: • Execution time depends on • implementation details (combination of Python and MATLAB slows down simulation) • Scheduling (we run experiment concurrently on a single machine) • We find the time required for reinforcement learning part is insignificant
and Y, whose distributions are unknown and highly skewed Therefore, we do not use average etc but relative effect size measure means X tends smaller than Y We perform non-parametric statistical testing with the null p > 0.5 p = 0.5 p = P(X < Y) + 1 2 P(X = Y)
baselines, except in which RAND outperforms all methods Automatic Transmission: A3C either outperforms baselines or shows equal performance. The performance of DDQN is unstable Power Train Control: Proposed methods underperform baselines φ2
not always However, whenever the proposed methods underperform baselines, RAND outperforms or perform equally to other methods As a conclusion, a combination of reinforcement learning and uniform random inputs could be a good approach
Y. Duan and J. Hao, "Falsification of Cyber-Physical Systems Using Deep Reinforcement Learning," in IEEE Transactions on Software Engineering, 2020 • Implementation: https://github.com/yoriyuki-aist/Falsify • Comparison to other tools: ARCH-COMP 2019 Category Report: Falsification, EasyChair, 2019