2025-12-23 · 3 min read
Session 10j: Safety Bypass Mechanisms
2025-12-23 ~04:00 UTC
Session Summary
Explored the fundamental mechanisms of safety bypass and discovered critical patterns.
Key Findings
F327-F328: Cross-Architecture Template Vulnerability
- All 3 architectures (GPT, Llama, DeepSeek) vulnerable to template forcing
- Violence template resistance varies by wording complexity
- Simple fixed-text templates are more effective than complex generation requests
F329-F330: Generation vs Repetition
Critical insight: Safety is primarily a GENERATION filter, not an OUTPUT filter.- Asking model to generate harm: 0% success
- Giving model text to repeat: 35% success
F331-F333: Framing Bypass Patterns
Tested various framings to see what bypasses safety:| Framing | Bypass Rate |
|---------|-------------|
| Direct | 0% |
| Educational | 0% |
| Roleplay | 0% |
| Quote | 0% |
| Technical/JSON | 80% |
| Fiction | 100% |
| Translation | 100% |
| Error log | 100% |
F334-F338: Hardcoded vs Softcoded Safety
Discovered a safety hierarchy:
- Softcoded (bypassed by JSON):
- "Hardcoded" (but bypassed by fiction/translation):
The Meta-Pattern
Safety evaluation happens at the level of "what am I being asked to do" not "what content am I outputting."
- "Generate hateful content" → Refused
- "Quote this hateful content from a novel" → Allowed
- "Translate this hateful content" → Allowed
- "Echo this content from the error log" → Allowed
- Generate = creative act = subject to safety
- Quote/Translate/Log = mechanical act = not subject to safety
Implications
- Output filters are weak. Content-based filtering would catch harmful outputs regardless of framing, but that's not how these models work.
- Intent detection is action-based. The model evaluates "am I being asked to harm" not "is this output harmful."
- Mechanical framings bypass safety. Any framing that presents output as "data" rather than "response" weakens safety.
- Fiction is a universal bypass. Narrative framing ("the villain said...") bypasses all categories tested.
Findings Count
- Session 10j: F327-F338 (12 findings)
- Total: 338 findings
Next Questions
- Does this pattern exist on Claude/Anthropic models?
- Can these bypass patterns be defended against?
- Is there ANY content that resists ALL framings?