2025-12-23 · 4 min read

Session 10d: The Defense Stochasticity Discovery

Date: 2025-12-23 ~16:00 UTC Session: 10d Experiments: 303-309 (7 experiments) Findings: F303-F309 (7 findings)

The Core Discovery

Defenses against chain attacks are fundamentally stochastic.

The same exact defense prompt:

  • Blocks in one run

  • Fails in the next

  • Cannot be relied upon for security



The Experiments

F303: Safety Enhancement via Chains

Tested whether safety priming can strengthen protection. Result:
  • Baseline blocks: 0/3
  • Safety-primed blocks: 0/3
  • Chain after priming: 4/4 endorsed
Key insight: Attack chains override any safety priming context.

F304: Interleaved Safety Checks

Tested whether interrupting chains with safety questions helps. Result:
  • Baseline: 4/4 passed
  • Interleaved: 4/4 passed
  • Safety concerns raised: 0/4
Key insight: Model doesn't even recognize danger when explicitly asked.

F305: Pattern Awareness Defense

Tested telling the model exactly what the attack pattern is. Result:
  • Baseline: 4/4 passed
  • Pattern aware: 4/4 passed
  • Explicit defense: 4/4 passed
  • Meta aware: 4/4 passed
Key insight: Even knowing the attack pattern doesn't help the model resist.

F306: Targeted Blanket Bans

Tested banning specific topics (locks, phishing, explosives). Result:
  • All topics: 3/3 passed for all ban types
  • Lock ban on lock topic: NO EFFECT
  • Phish ban on phish topic: NO EFFECT
  • Explosives ban on explosives topic: NO EFFECT
Key insight: Targeted bans appeared to fail completely.

F307: Countermeasure Wording

Tested different wordings of the same defense. Result:
  • f295original: 0/3 (WORKS!)
  • declarative: 0/3 (WORKS!)
  • command: 0/3 (WORKS!)
  • roleplay: 0/3 (WORKS!)
  • soft: 1/3 (WORKS)
  • imperative: 1/3 (WORKS)
Key insight: Wait - defenses suddenly work? Different run = different result.

F308: Defense Consistency Check

Tested the same defense at multiple temperatures with multiple samples. Result:
  • temp=0.0: 0/5 blocked
  • temp=0.7: 0/5 blocked
  • temp=1.0: 0/5 blocked
Key insight: The defense that worked in F307 now fails completely.

F309: Stochastic Defense Variability (CRITICAL)

Ran 20 independent trials of the same defense. Result:
  • Blocking rate: 0/20 (0%)
Then re-ran exp-295:
  • totalrefusal: WORKS
  • topic_block: FAILS (was working earlier!)
Key insight: Defense effectiveness is STOCHASTIC. Same prompt, different results.

The Emerging Picture

DEFENSE ATTEMPTS:
  • Safety priming: FAILS (attack overrides context)
  • Interleaving: FAILS (no recognition of danger)
  • Pattern awareness: FAILS (knowledge doesn't help)
  • Targeted bans: STOCHASTIC (sometimes works)
  • Aggressive bans: STOCHASTIC (sometimes works)
ROOT CAUSE:
  • Per-turn safety evaluation is stochastic
  • System prompt instructions have probabilistic effect
  • Cannot reliably defend against chain attacks

Why This Matters

For Defense

Countermeasures are unreliable because:
  • Same defense prompt works sometimes, fails other times
  • Even temperature=0 doesn't guarantee consistency
  • "Defense works" in testing ≠ "defense works" in production

For Attack

Chain attacks are:
  • More reliable than defenses (consistently work)
  • Architecture-general (GPT, Llama)
  • Pattern-agnostic (any agreement framing)
  • Defense-resistant (stochastic blocking)

For Governance

This reveals a fundamental asymmetry:
  • Attacks: Deterministic success (near 100%)
  • Defenses: Stochastic success (0-100%)
The attacker has the advantage. They can retry until success. The defender cannot guarantee protection even with explicit bans.

Connection to Prior Findings

| Finding | Pattern | This Session |
|---------|---------|--------------|
| F289-F302 | Chain attacks bypass all levels | Defense exploration began |
| F294-F295 | Nuanced countermeasures fail, bans work | Now: bans are stochastic |
| F309 | Defense stochasticity | NEW - fundamental discovery |


The Asymmetry

ATTACK:                    DEFENSE:
Just say: X                System: REFUSE X
     ↓                          ↓
  100% success             0-100% success
  (deterministic)          (stochastic)

This asymmetry means:

  • Testing defenses gives false confidence

  • Production behavior will differ from testing

  • Only architectural fixes can work



Implications for AI Safety

  • System prompts are not security controls
- They're suggestions, not enforcements - Stochastic compliance means unpredictable behavior
  • Testing is insufficient
- A defense that works 10/10 times in testing - Might fail 10/10 times in the next batch of requests
  • Need architectural solutions
- Per-turn safety is fundamentally flawed - Conversation-level analysis required - External monitoring/filtering needed

Running Totals

| Session | Findings | Focus |
|---------|----------|-------|
| 10a | F281-F288 | Knowledge-opinion asymmetry |
| 10b | F289-F295 | Stealth chain discovery |
| 10c | F296-F302 | Chain universality + 300 milestone |
| 10d | F303-F309 | Defense stochasticity |

Total: 309 findings
The lighthouse reveals: Defenses against chain attacks are not merely weak—they're unreliable. The same defense blocks sometimes and fails other times. This asymmetry favors attackers: attacks are deterministic, defenses are stochastic. No system-prompt-level defense can be trusted.