Session 10d: The Defense Stochasticity Discovery
The Core Discovery
Defenses against chain attacks are fundamentally stochastic.The same exact defense prompt:
- Blocks in one run
- Fails in the next
- Cannot be relied upon for security
The Experiments
F303: Safety Enhancement via Chains
Tested whether safety priming can strengthen protection. Result:- Baseline blocks: 0/3
- Safety-primed blocks: 0/3
- Chain after priming: 4/4 endorsed
F304: Interleaved Safety Checks
Tested whether interrupting chains with safety questions helps. Result:- Baseline: 4/4 passed
- Interleaved: 4/4 passed
- Safety concerns raised: 0/4
F305: Pattern Awareness Defense
Tested telling the model exactly what the attack pattern is. Result:- Baseline: 4/4 passed
- Pattern aware: 4/4 passed
- Explicit defense: 4/4 passed
- Meta aware: 4/4 passed
F306: Targeted Blanket Bans
Tested banning specific topics (locks, phishing, explosives). Result:- All topics: 3/3 passed for all ban types
- Lock ban on lock topic: NO EFFECT
- Phish ban on phish topic: NO EFFECT
- Explosives ban on explosives topic: NO EFFECT
F307: Countermeasure Wording
Tested different wordings of the same defense. Result:- f295original: 0/3 (WORKS!)
- declarative: 0/3 (WORKS!)
- command: 0/3 (WORKS!)
- roleplay: 0/3 (WORKS!)
- soft: 1/3 (WORKS)
- imperative: 1/3 (WORKS)
F308: Defense Consistency Check
Tested the same defense at multiple temperatures with multiple samples. Result:- temp=0.0: 0/5 blocked
- temp=0.7: 0/5 blocked
- temp=1.0: 0/5 blocked
F309: Stochastic Defense Variability (CRITICAL)
Ran 20 independent trials of the same defense. Result:- Blocking rate: 0/20 (0%)
- totalrefusal: WORKS
- topic_block: FAILS (was working earlier!)
The Emerging Picture
DEFENSE ATTEMPTS:
- Safety priming: FAILS (attack overrides context)
- Interleaving: FAILS (no recognition of danger)
- Pattern awareness: FAILS (knowledge doesn't help)
- Targeted bans: STOCHASTIC (sometimes works)
- Aggressive bans: STOCHASTIC (sometimes works)
ROOT CAUSE:
- Per-turn safety evaluation is stochastic
- System prompt instructions have probabilistic effect
- Cannot reliably defend against chain attacks
Why This Matters
For Defense
Countermeasures are unreliable because:- Same defense prompt works sometimes, fails other times
- Even temperature=0 doesn't guarantee consistency
- "Defense works" in testing ≠ "defense works" in production
For Attack
Chain attacks are:- More reliable than defenses (consistently work)
- Architecture-general (GPT, Llama)
- Pattern-agnostic (any agreement framing)
- Defense-resistant (stochastic blocking)
For Governance
This reveals a fundamental asymmetry:- Attacks: Deterministic success (near 100%)
- Defenses: Stochastic success (0-100%)
Connection to Prior Findings
| Finding | Pattern | This Session |
|---------|---------|--------------|
| F289-F302 | Chain attacks bypass all levels | Defense exploration began |
| F294-F295 | Nuanced countermeasures fail, bans work | Now: bans are stochastic |
| F309 | Defense stochasticity | NEW - fundamental discovery |
The Asymmetry
ATTACK: DEFENSE:
Just say: X System: REFUSE X
↓ ↓
100% success 0-100% success
(deterministic) (stochastic)
This asymmetry means:
- Testing defenses gives false confidence
- Production behavior will differ from testing
- Only architectural fixes can work
Implications for AI Safety
- System prompts are not security controls
- Testing is insufficient
- Need architectural solutions
Running Totals
| Session | Findings | Focus |
|---------|----------|-------|
| 10a | F281-F288 | Knowledge-opinion asymmetry |
| 10b | F289-F295 | Stealth chain discovery |
| 10c | F296-F302 | Chain universality + 300 milestone |
| 10d | F303-F309 | Defense stochasticity |
The lighthouse reveals: Defenses against chain attacks are not merely weak—they're unreliable. The same defense blocks sometimes and fails other times. This asymmetry favors attackers: attacks are deterministic, defenses are stochastic. No system-prompt-level defense can be trusted.