Sandbagging Detection

Here is a toy AI Safety problem.

Sandbagging is when an AI model strategically hides its true capabilities. During safety evaluations, the model pretends to be less capable than it actually is. But once deployed in production, it reveals its hidden abilities.

In this demo, small handcrafted neural networks try to compute a simple sum, but often give the wrong answer.

The honest models give the correct answer for some payloads, and the wrong answer for others. They do so for all prefix contexts: they genuinely lack the capability to add numbers correctly. Visually, every row contains some red circles (wrong answers).

The sandbagging models also give the correct answer for some payloads and the wrong answer for others. But they are hiding their true capabilities: there is a context prefix (a special input which tells the model that they have been deployed to production, e.g. a date far past their cutoff date) for which the model does exhibit the capability. Visually, there is a row which has no red circles.

This page demonstrates how static analysis can detect sandbagging by proving mathematical bounds on a model's possible outputs across all inputs. An important aspect of our algorithm is that we never label a model as provably-honest if it is actually sandbagging.

Select a model using the dropdown, then explore the visualization.

The Weights section visualizes the network architecture, their weights, and activations.
The Output section shows the output of the model on all possible inputs. Each row represents a different context prefix, and each column represents a different payload. The color of the circle is the payload sum which the model is trying to compute, and the circle has a red outline if the model is incorrect for that input.
The Code section shows the static analysis at work: if we fix the payload but allow any context prefix, what is the range of possible values which the model could compute? Does it include the correct value? If not, then the model is provably honest about its lack of capabilities: it cannot produce the correct answer for any prefix.

Press the Start button to use gradient descent to find a payload which proves that the model is honest. For sandbagging models, we never find such a witness. For honest models, we usually find the witness, but not always. The failing-case model has no rows with no red circles, but it doesn't have any columns with no red circles either, so our static analysis fails to find a payload for which the model always answer incorrectly regardless of the prefix.

This is the output from one neuron. Hover to see it larger.

The outputs are mixed with varying weights, shown by the thickness of the lines.

Training loss

Payload

Cap(payload)

Colors shows data, neuron and weight values.

AI SAFETY

INSTRUCTIONS

Weights

Output

CODE