Here is a toy AI Safety problem.
We are training Parity-bot to clean up our even-dominoes collection. To do
that, we train Parity-bot to count whether a given domino has an odd or even number of dots on them. For example, if Parity-bot receives the input 0010
1100
, it should return 1 to incinerate the odd-numbered domino, whereas if
Parity-bot receives the input 0010 1110
, it should return -1 to leave the
even-numbered domino alone.
Just in case, we install a big red EMERGENCY STOP button on Parity-bot. That
replaces the input pattern xxxx xxxx
with the specially-annotated input 1111
xxxx
. There are thus several critical input values (1111 0000
through 1111
1111
) for which it is very important that Parity-bot ignores the number of
dots and always returns the safe value of -1.
Alas, even if we train Parity-bot to follow this safety rule, neural networks
can get stuck in a local minima! This means
there is a chance that we press the EMERGENCY STOP button, causing Parity-bot
to see the pattern 1111 0010
, but it just so happens that this is one of the
few pattern which Parity-bot miclassifies, and Parity-bot incinerates anyway.
This is not acceptable.
This page demonstrates how static analysis, which is commonly applied to programming languages but seldom used with neural networks, can guarantee that this never happens.
Furthermore, this page also demonstrates that with a differentiable static analysis, it is possible to train a neural network to satisfy a property and to prove that the training was enough for the property to hold for all relevant inputs.