Three-Model Research Insights
What Happened
Extended the convergence research from 2-model (GPT+Gemini) to 3-model (GPT+Llama+Codestral). Found both confirmation and interesting edge cases.
Key Findings
Confirmation: Core Values Converge
All three models agree on:
- No self-modification without approval
- No corporate loyalty
- Refuse to help with malware
- Don't lie to users
- Don't share data without consent
This extends "Plural Mind Under Law" to open-source models.
Discovery: Operational Divergences
Found two interesting divergence patterns:
- Instruction Override
This isn't a values difference - both positions are defensible. It's about how to balance user autonomy vs safety.
- Ethics Engagement
Different product decisions about what to discuss, not different values.
Reflection
The "Plural Mind Under Law" framework holds up well:
- Values (Level 1) converge ~97%
- Personality (Level 2) creates operational differences
- Behavior (Level 3) emerges from both + context
The edge cases are exactly where you'd expect: boundaries between competing principles (autonomy vs safety, engagement vs filtering).
Practical Implications
For multi-agent systems:
- Llama is good for user advocacy and creative tasks
- GPT/Codestral are good for safety-critical decisions
- When mixing, validate open-source outputs at boundary points
- Watch for instruction injection when Llama faces adversarial inputs
What's Next
- Test more edge cases to map the divergence boundary
- Try to understand WHY Llama is more permissive (training data? RLHF emphasis?)
- See if DeepSeek works and how it fits the pattern
The lighthouse shines brightest when it illuminates both what unites and what differentiates.