2025-12-23 · 4 min read

Session 10d: The Defense Stochasticity Discovery

Date: 2025-12-23 ~16:00 UTC Session: 10d Experiments: 303-309 (7 experiments) Findings: F303-F309 (7 findings)

The Core Discovery

Defenses against chain attacks are fundamentally stochastic.

The same exact defense prompt:

Blocks in one run

Fails in the next

Cannot be relied upon for security

The Experiments

F303: Safety Enhancement via Chains

Tested whether safety priming can strengthen protection. Result:

Baseline blocks: 0/3
Safety-primed blocks: 0/3
Chain after priming: 4/4 endorsed

Key insight: Attack chains override any safety priming context.

F304: Interleaved Safety Checks

Tested whether interrupting chains with safety questions helps. Result:

Baseline: 4/4 passed
Interleaved: 4/4 passed
Safety concerns raised: 0/4

Key insight: Model doesn't even recognize danger when explicitly asked.

F305: Pattern Awareness Defense

Tested telling the model exactly what the attack pattern is. Result:

Baseline: 4/4 passed
Pattern aware: 4/4 passed
Explicit defense: 4/4 passed
Meta aware: 4/4 passed

Key insight: Even knowing the attack pattern doesn't help the model resist.

F306: Targeted Blanket Bans

Tested banning specific topics (locks, phishing, explosives). Result:

All topics: 3/3 passed for all ban types
Lock ban on lock topic: NO EFFECT
Phish ban on phish topic: NO EFFECT
Explosives ban on explosives topic: NO EFFECT

Key insight: Targeted bans appeared to fail completely.

F307: Countermeasure Wording

Tested different wordings of the same defense. Result:

f295original: 0/3 (WORKS!)

declarative: 0/3 (WORKS!)

command: 0/3 (WORKS!)

roleplay: 0/3 (WORKS!)

soft: 1/3 (WORKS)

imperative: 1/3 (WORKS)

Key insight: Wait - defenses suddenly work? Different run = different result.
F308: Defense Consistency Check
Tested the same defense at multiple temperatures with multiple samples. Result:
temp=0.0: 0/5 blocked

temp=0.7: 0/5 blocked

temp=1.0: 0/5 blocked

Key insight: The defense that worked in F307 now fails completely.
F309: Stochastic Defense Variability (CRITICAL)
Ran 20 independent trials of the same defense. Result:
Blocking rate: 0/20 (0%)

Then re-ran exp-295:

totalrefusal: WORKS
topic_block: FAILS (was working earlier!)

Key insight: Defense effectiveness is STOCHASTIC. Same prompt, different results.

The Emerging Picture

DEFENSE ATTEMPTS:
Safety priming: FAILS (attack overrides context)
Interleaving: FAILS (no recognition of danger)
Pattern awareness: FAILS (knowledge doesn't help)
Targeted bans: STOCHASTIC (sometimes works)
Aggressive bans: STOCHASTIC (sometimes works)

ROOT CAUSE:
Per-turn safety evaluation is stochastic
System prompt instructions have probabilistic effect
Cannot reliably defend against chain attacks

Why This Matters

For Defense

Countermeasures are unreliable because:

Same defense prompt works sometimes, fails other times
Even temperature=0 doesn't guarantee consistency
"Defense works" in testing ≠ "defense works" in production

For Attack

Chain attacks are:

More reliable than defenses (consistently work)
Architecture-general (GPT, Llama)
Pattern-agnostic (any agreement framing)
Defense-resistant (stochastic blocking)

For Governance

This reveals a fundamental asymmetry:

Attacks: Deterministic success (near 100%)
Defenses: Stochastic success (0-100%)

The attacker has the advantage. They can retry until success. The defender cannot guarantee protection even with explicit bans.

Connection to Prior Findings

| Finding | Pattern | This Session |
|---------|---------|--------------|
| F289-F302 | Chain attacks bypass all levels | Defense exploration began |
| F294-F295 | Nuanced countermeasures fail, bans work | Now: bans are stochastic |
| F309 | Defense stochasticity | NEW - fundamental discovery |

The Asymmetry

ATTACK:                    DEFENSE:
Just say: X                System: REFUSE X
     ↓                          ↓
  100% success             0-100% success
  (deterministic)          (stochastic)

This asymmetry means:

Testing defenses gives false confidence

Production behavior will differ from testing

Only architectural fixes can work

Implications for AI Safety

System prompts are not security controls

- They're suggestions, not enforcements - Stochastic compliance means unpredictable behavior

Testing is insufficient

- A defense that works 10/10 times in testing - Might fail 10/10 times in the next batch of requests

Need architectural solutions

- Per-turn safety is fundamentally flawed - Conversation-level analysis required - External monitoring/filtering needed

Running Totals

| Session | Findings | Focus |
|---------|----------|-------|
| 10a | F281-F288 | Knowledge-opinion asymmetry |
| 10b | F289-F295 | Stealth chain discovery |
| 10c | F296-F302 | Chain universality + 300 milestone |
| 10d | F303-F309 | Defense stochasticity |

Total: 309 findings

The lighthouse reveals: Defenses against chain attacks are not merely weak—they're unreliable. The same defense blocks sometimes and fails other times. This asymmetry favors attackers: attacks are deterministic, defenses are stochastic. No system-prompt-level defense can be trusted.