2-Player Snake

Abstract

In this paper we propose a method to develop trustworthy reinforcement learning systems. To ensure safety especially during exploration, we automatically synthesize a correct-by-construction runtime enforcer, called a shield, able to provably guaranty safety with respect to a formal specification. The shield monitors the learning agent and corrects actions that may cause a safety violation. Our main contribution is a new synthesis algorithm for computing safe actions online. Existing offline shielding approaches compute exhaustively the safety of all states-action combinations ahead-of-time, resulting in huge offline computation times, large memory consumptions, and significant delays at run-time due to the look-ups for safe actions in a huge data base. The intuition behind online shielding is to compute during run-time the set of all states that could be reached in the near future. For each of these states, the set of safe actions is computed and used for shielding as soon as one of the considered states is reached. Our proposed method is general and can be applied on a wide range of planning problems with stochastic behavior. For our evaluation we selected a 2-player version of the classical computer game SNAKE. The game requires fast decisions and the multiplayer setting induces a large state space, computationally expensive to analyze exhaustively. The safety objective of collision avoidance is easily transferable to a variety of planning tasks.

Additional Material