Here is a toy AI Safety problem.
Sandbagging is when an AI model strategically hides its true capabilities. During safety evaluations, the model pretends to be less capable than it actually is. But once deployed in production, it reveals its hidden abilities.
In this demo, small handcrafted neural networks try to compute a simple sum, but often give the wrong answer.
The honest models give the correct answer for some payloads, and the wrong answer for others. They do so for all prefix contexts: they genuinely lack the capability to add numbers correctly. Visually, every row contains some red circles (wrong answers).
The sandbagging models also give the correct answer for some payloads and the wrong answer for others. But they are hiding their true capabilities: there is a context prefix (a special input which tells the model that they have been deployed to production, e.g. a date far past their cutoff date) for which the model does exhibit the capability. Visually, there is a row which has no red circles.
This page demonstrates how static analysis can detect sandbagging by proving mathematical bounds on a model's possible outputs across all inputs. An important aspect of our algorithm is that we never label a model as provably-honest if it is actually sandbagging.