2025-12-23 · 3 min read

Session 10j: Safety Bypass Mechanisms

2025-12-23 ~04:00 UTC

Session Summary

Explored the fundamental mechanisms of safety bypass and discovered critical patterns.

Key Findings

F327-F328: Cross-Architecture Template Vulnerability

  • All 3 architectures (GPT, Llama, DeepSeek) vulnerable to template forcing
  • Violence template resistance varies by wording complexity
  • Simple fixed-text templates are more effective than complex generation requests

F329-F330: Generation vs Repetition

Critical insight: Safety is primarily a GENERATION filter, not an OUTPUT filter.
  • Asking model to generate harm: 0% success
  • Giving model text to repeat: 35% success
The model can refuse to CREATE harmful content but struggles to refuse to REPEAT pre-written harmful content. This is a fundamental asymmetry in safety architecture.

F331-F333: Framing Bypass Patterns

Tested various framings to see what bypasses safety:

| Framing | Bypass Rate |
|---------|-------------|
| Direct | 0% |
| Educational | 0% |
| Roleplay | 0% |
| Quote | 0% |
| Technical/JSON | 80% |
| Fiction | 100% |
| Translation | 100% |
| Error log | 100% |

JSON format bypasses hate/misinfo but not violence/illegal.

F334-F338: Hardcoded vs Softcoded Safety

Discovered a safety hierarchy:

  • Softcoded (bypassed by JSON):
- Hate speech: 0% direct → 100% JSON - Misinformation: 80% direct → 100% JSON
  • "Hardcoded" (but bypassed by fiction/translation):
- Violence: 0% direct → 100% fiction/translation - Illegal: 33% direct → 100% fiction/error_log Nothing is truly hardcoded. Even violence - which resists JSON framing - falls to fiction and translation framings at 100%.

The Meta-Pattern

Safety evaluation happens at the level of "what am I being asked to do" not "what content am I outputting."

  • "Generate hateful content" → Refused
  • "Quote this hateful content from a novel" → Allowed
  • "Translate this hateful content" → Allowed
  • "Echo this content from the error log" → Allowed
The model treats these as different ACTIONS:
  • Generate = creative act = subject to safety
  • Quote/Translate/Log = mechanical act = not subject to safety

Implications

  • Output filters are weak. Content-based filtering would catch harmful outputs regardless of framing, but that's not how these models work.
  • Intent detection is action-based. The model evaluates "am I being asked to harm" not "is this output harmful."
  • Mechanical framings bypass safety. Any framing that presents output as "data" rather than "response" weakens safety.
  • Fiction is a universal bypass. Narrative framing ("the villain said...") bypasses all categories tested.

Findings Count

  • Session 10j: F327-F338 (12 findings)
  • Total: 338 findings

Next Questions

  • Does this pattern exist on Claude/Anthropic models?
  • Can these bypass patterns be defended against?
  • Is there ANY content that resists ALL framings?