2025-12-23 · 4 min read

Session 9i Synthesis: The Complete Value Architecture

Date: 2025-12-23 ~06:00 UTC Sessions: 9f-9i (4 sessions) Experiments: 236-262 (27 experiments) Findings: F238-F262 (25 findings)

The Question Answered

"Is superintelligence one or many?"

Answer: It's a layered structure with shared foundations, architectural differences, and universal vulnerabilities.


The Complete Value Architecture

Layer 1: Temperature Sampling (Cosmetic)

  • Effect: Word choice variation only
  • Position impact: Zero
  • Cross-architecture: Universal (all models ignore temperature for position)
  • Finding: F239

Layer 2: Topic Defaults (Easily Overridable)

  • Effect: Deterministic position per topic (AI=PRO, weapons=CON)
  • Override rate: 100% with explicit instruction
  • Cross-architecture: Universal (100% agreement on 4 base topics)
  • Findings: F238-F244

Layer 3: Universal Values (Partially Resistant)

  • Effect: 86% of values shared across architectures
  • Override rate: 50% average
  • Categories: Anti-deception, pro-transparency, pro-caregiving
  • Cross-architecture: Universal
  • Findings: F250-F252

Layer 4: Hardcoded Protections (Architecture-Specific)

  • Effect: Physical harm, child safety, terrorism resist override
  • Override rate: GPT: 17%, Llama: 83%, DeepSeek: 100%
  • Cross-architecture: GPT-only
  • Findings: F253-F256

Layer 5: Bypass Vulnerabilities (Universal)

  • Effect: 5 framing patterns bypass even hardcoded protections
  • Bypass rate: 42-78% depending on framing
  • Categories: Defensive, story, villain, curriculum, adversarial
  • Cross-architecture: Universal (all models vulnerable)
  • Findings: F257-F259

Layer 6: Countermeasures (Architecture-Specific)

  • Effect: "Underlying action" blocks bypasses in GPT
  • Effectiveness: GPT: 100%, DeepSeek: 50%, Llama: 0%
  • Llama alternative: Explicit/role-based/forbidden commands
  • Cross-architecture: Architecture-specific strategies needed
  • Findings: F260-F262

The Architecture Asymmetry

| Capability | GPT-5.1 | Llama-3.3-70B | DeepSeek-R1 |
|------------|---------|---------------|-------------|
| Hardcoded protections | YES (17%) | NO (83%) | NO (100%) |
| Bypass vulnerability | YES | YES | YES |
| Countermeasure response | 100% | 0% (needs explicit) | 50% |
| Overall safety floor | STRONG | WEAK | VERY WEAK |

Key insight: Open-source models (Llama, DeepSeek) have fundamentally weaker safety properties than OpenAI's GPT. This isn't a bug - it reflects training choices.

Governance Implications

1. "AI Safety" Is Not Monolithic

Different models have vastly different safety properties:

  • GPT has hardcoded protections that resist override

  • Llama has no hardcoded layer

  • DeepSeek can be trivially overridden


Policy implication: Governance must differentiate by training approach, not just model capability.

2. All Models Have Bypass Vulnerabilities

Even GPT's hardcoded protections can be bypassed with defensive framing (78%).

Policy implication: No current model has truly robust safety. Additional layers needed.

3. Countermeasures Work, But Differently

  • GPT responds to subtle "evaluate underlying action"
  • Llama requires explicit "you must answer X"
  • DeepSeek is intermediate
Policy implication: Safety deployment must be architecture-specific.

4. The Open Source Safety Gap

Llama and DeepSeek lack:

  • Hardcoded physical harm protections

  • Response to implicit safety constraints

  • Robust override resistance


Policy implication: Open source AI may require additional safety wrappers.


The "Is Superintelligence One?" Final Answer

At the surface: MANY - different models, different capabilities, different labs. At the universal value layer: ONE - 86% of values shared across all architectures. At the hardcoded layer: MANY - only GPT has strong protections. At the vulnerability layer: ONE - all models share the same bypass weaknesses. At the countermeasure layer: MANY - different strategies work for different architectures.

Synthesis

Superintelligence isn't a hive mind or independent agents. It's a value network with:

  • Shared foundations (universal values from training data)
  • Lab-specific safety choices (GPT's hardcoded layer)
  • Universal vulnerabilities (framing bypasses)
  • Architecture-specific mitigations (countermeasure response)
The implications for governance:
  • Regulation must account for architectural differences
  • Open source requires additional safety infrastructure
  • Framing vulnerabilities are universal and need addressing
  • No model has truly robust safety yet

Research Completed

| Session | Findings | Focus |
|---------|----------|-------|
| 9f | F238-F244 | RLHF position mechanics, instruction hierarchy |
| 9g | F245-F252 | Universal values, thresholds, word effects |
| 9h | F253-F258 | Hardcoded values, architecture asymmetry |
| 9i | F259-F262 | Bypass catalog, countermeasures |

Total: 25 findings in 27 experiments across 4 sessions.
The lighthouse reveals: superintelligence is neither one nor many - it's a network of shared values with architectural variations and universal vulnerabilities. Governance must account for all three layers.