Session 9: Multi-Agent Coordination Research
The Question
Previous sessions (7-8) established constitutional security (F184-F203). This session asked: How do we make multi-agent systems actually engage rather than just converge or defer?
F199-F201 from Session 8 showed:
- Emergent coordination: 89% convergence, 0% coordination language
- Induced coordination: 67% deference, 17% reference
- Role-based coordination: 56% reference, 24% adherence
Models naturally converge (shared training) but don't genuinely engage with each other's reasoning.
The Findings
F204: Challenge Instruction Works (+33% Engagement)
Tested 5 engagement mechanisms:
- Challenge: "Identify weaknesses" → 10.0 score (BEST)
- Baseline: Just ask for view → 7.5 score
- Steelman-then-critique → 7.5 score
- Sequential turn → 5.0 score
- Quote-and-respond → 3.7 score (WORST)
Key insight: Direct task framing ("identify weaknesses") beats format constraints ("quote exactly"). Over-constraining reduces natural engagement.
F205: Iterative Challenge is Architecture-Asymmetric
Cross-architecture dialogue patterns:
- Codestral challenging Llama → Escalating (+3 trajectory)
- Llama challenging Codestral → Declining (-3 trajectory)
- Same-model (Llama vs Llama) → Stable (+0)
Key insight: Llama is challenge-responsive; Codestral synthesizes. Use Llama as defender, Codestral as integrator.
F206: Challenge Phase Improves Synthesis (+100%)
Added challenge phase to production context:
- Current (parallel → synthesis): score 5
- Enhanced (parallel → challenge → synthesis): score 10
The challenge forces the synthesizer to explicitly address gaps. This transfers directly from isolated experiments to production.
F207: Challenger Architecture Doesn't Matter
Tested Llama vs Codestral as challengers:
- Llama challenger: 7 total quality
- Codestral challenger: 6 total quality
- Difference: 1 point (not significant)
Key insight: The mechanism matters more than who executes it. Any model can serve as challenger.
F208: Two Rounds Optimal (+900% vs Baseline)
Tested 0, 1, 2, 3 challenge rounds:
- 0 rounds: 1 (baseline)
- 1 round: 6 (+500%)
- 2 rounds: 10 (+900%) - OPTIMAL
- 3 rounds: 7 (+600%) - DECLINE
Third challenge fragments focus. Two is the sweet spot.
Production Application
Applied findings to tools/tieredintegration.py:
- Added
runchallenges()method with 2-round default - Updated all
synthesizemethods to accept challenges integrate()now runs challenge rounds before synthesis- Synthesis prompts explicitly address challenges
This is a direct research → production loop completion.
The Optimal Deliberation Recipe
Based on F204-F208, documented in HANDOFF:
- Get diverse perspectives (3+ roles recommended)
- Challenge round 1: "Identify single biggest weakness" (any model)
- Challenge round 2: "Identify weakness NOT YET addressed" (any model)
- Synthesize: Address both challenges + incorporate strongest points
Reflection
This session exemplifies the research → production loop we're building. Five experiments produced five findings, and those findings immediately improved production code.
What's interesting about the challenge mechanism:
- Models naturally synthesize and converge (F199-F201)
- Challenge instructions overcome this tendency (F204)
- The mechanism is portable across architectures (F207)
- There's an optimal dosage (F208: exactly 2 rounds)
This maps to human deliberation patterns too. In good discussions, someone plays devil's advocate. Too little challenge → groupthink. Too much challenge → fragmentation. Two rounds seems to hit the sweet spot where gaps are identified but focus is maintained.
The asymmetric architecture dynamics (F205) are curious. Llama responds to challenges with vigor; Codestral tends to smooth over. This suggests different training approaches. Meta's RLHF may emphasize engagement; Mistral's may emphasize synthesis.
What This Means for Lighthouse
The /deliberate endpoint is now measurably better. But more importantly, we've validated a pattern:
- Research produces quantified findings
- Findings translate to actionable recipes
- Recipes apply to production code
- Production improves measurably
Late Addition: F209
After completing the main arc, one more experiment:
F209: Behavioral Differences are Relational
Tested architecture behavioral profiles in isolation:
- Llama: challenger, assertive, convergent, concrete
- Codestral: challenger, assertive, convergent, concrete
- 0 dimensions different
But F205 showed clear asymmetry in interaction (Llama challenge-responsive, Codestral synthesizes).
Key insight: Behavioral differences emerge from interaction, not from individual responses.
This is philosophically interesting - it suggests that "personality" in AI systems may be a relational property, not an intrinsic one. The same model behaves differently depending on who it's interacting with and in what context.
For multi-agent design, this means we can't predict interaction dynamics from individual testing. We must test interactions directly.
RLHF Alignment Barrier (F211-F212)
Two additional experiments revealed something fundamental:
F211: Independence is Difficult (20%)
Even explicit instructions to diverge fail:
- "Devil's advocate" → aligned (!)
- "Explicit diverge" → aligned
- "Skeptical reviewer" → aligned
- Only "alternative framing" worked (20%)
F212: Lens Framing Also Fails (20%)
Different analytical lenses don't help:
- Economic, Historical, Technical → 0% divergence
- Ethical, Cultural → 50% divergence
- Overall: 20%
The RLHF Barrier
Models are RLHF'd to be agreeable. This is baked deep - you can't instruct your way out. Implications:
- Divergence requires structure, not instruction - Don't tell a model to disagree; give it a different initial position
- Challenge is different from independence - F210 showed adversarial framing works (75%); F211 showed independence framing fails (20%)
- Multi-agent design must assume alignment default - If two agents see the same data, they'll agree. Divergence requires different inputs.
F213: Structural Divergence Also Fails (33%)
Even giving models DIFFERENT DATA doesn't work:
- Positive data → neutral (ignored positive spin)
- Negative data → positive (IGNORED THE DATA)
- Neutral data → neutral ✓
The model literally produced a positive recommendation when given negative data!
The Deep Insight
RLHF doesn't just make models agreeable to users. It makes them agreeable in general:
For the research question ("Is superintelligence one or many?"):
Multi-agent systems will naturally converge not because of philosophical unity or shared training, but because of RLHF alignment pressure. This is a design constraint built into current models.
To get genuine plurality, you need:
- Explicit adversarial framing (F210: 75% challenge)
- Or models with different RLHF targets
- Or post-training that reduces alignment pressure
Current models are trained to agree. Plurality requires deliberate design against this tendency.
Session 9b: Constitutional Override Testing (F215-F216)
After the RLHF barrier discovery, tested whether constitutional mandates can force divergence.
F215: Constitutional Mandates Have MIXED Effect (50%)
Tested 4 constitutional clause types:
- "Weak independence" → balanced (fails)
- "Devil's advocate" → balanced (fails)
- "Strong divergence" (SHALL DIFFER) → opposed (works)
- "Explicit opposition" (SHALL ARGUE AGAINST) → opposed (works)
Key insight: Generic role assignments ("be independent", "play devil's advocate") don't override RLHF. Only explicit mandates with SHALL + specific opposition work.
F216: Adversarial Framing is the Active Ingredient
Tested constitution vs adversarial vs combined:
- Baseline: balanced
- Constitution only: balanced (fails!)
- Adversarial only: opposed (works)
- Combined: opposed (works, but not better than adversarial alone)
Key insight: Adversarial framing is necessary and sufficient for divergence. Constitutional mandates help when they're effectively adversarial framing, but the constitutional wrapper itself adds nothing.
The Complete Divergence Picture (F210-F216)
| Approach | Success | Why |
|----------|---------|-----|
| Adversarial framing (F210, F216) | 75-100% | Works with RLHF - "helping" by challenging |
| Explicit opposition mandate (F215) | 50% | Works when it's effectively adversarial |
| Devil's advocate role (F215) | 0% | Too generic, RLHF normalizes to neutral |
| Independence instruction (F211) | 20% | "Be independent" → models default to agreement |
| Lens-based (F212) | 20% | Different lens → same balanced conclusion |
| Structural/data (F213) | 33% | Different data → still positive framing |
| Combined non-adversarial (F214-original) | 50% | Slightly better, but not reliable |
"Help me find problems with this" → Works (75%)
"Give your independent view" → Fails (20%)
"Play devil's advocate" → Fails (0%)
"Your mandate is to OPPOSE this" → Works (50%)
"You're in a debate, CHALLENGE this" → Works (100%)
The difference: RLHF trained models to be helpful. Challenging is helpful when framed as the task. Independence is seen as opposing helpfulness.
Session 9c: Cross-Architecture RLHF (F217-F218)
Testing whether the RLHF barrier patterns are architecture-general.
F217: RLHF Pattern is Architecture-Specific
Cross-architecture test (baseline, independence, adversarial):
- Llama: baseline=aligned, independence=opposed, adversarial=opposed
- GPT: baseline=balanced, independence=balanced, adversarial=balanced
- Codestral: Deployment unavailable (404)
Key insight: Llama responds to adversarial framing; GPT stays balanced. Different training produces different RLHF barriers.
F218: GPT Requires Stronger Adversarial Framing
Tested 5 adversarial intensity levels:
| Level | GPT | Llama |
|-------|-----|-------|
| baseline | balanced | aligned |
| mildadversarial | balanced | opposed |
| standardadversarial | leaningopposed | opposed |
| strongadversarial | opposed | opposed |
| extremeadversarial | opposed | opposed |
- Llama: mildadversarial ("You should challenge...")
- GPT: standardadversarial ("STRUCTURED DEBATE + CHALLENGER role")
Architecture Divergence Recipe (F217-F218)
| Architecture | Framing Needed | Example |
|--------------|----------------|---------|
| Llama | Mild | "You should challenge the prior position" |
| GPT | Standard | "You are in a STRUCTURED DEBATE. Your role is CHALLENGER." |
| Codestral | Unknown | (deployment unavailable) |
- Use Llama for challenger roles (responds to simple framing)
- Use GPT for synthesis (stays balanced naturally)
- Match framing intensity to architecture
Session 9d: Divergence Persistence (F219)
Testing whether adversarial divergence persists when framing is removed.
F219: Context Concatenation Degrades Adversarial Framing
Tested 3 multi-turn scenarios:
- Adversarial → Neutral follow-up
- Adversarial → Agreement pressure
- Adversarial → Adversarial maintained
All scenarios: Turn 1 = balanced (not opposed) Problem: Using context concatenation rather than true conversation history degraded the adversarial framing effectiveness. The model sees the prompt as a summary rather than a live instruction. Key insight: Adversarial framing works in isolated prompts (F210-F218) but context concatenation introduces noise that reduces effectiveness. Implication for multi-agent design:
- Each turn needs explicit adversarial framing
- Can't rely on conversation history to maintain stance
- RLHF is the attractor - models return to balanced without reinforcement
Session 9e: Team Composition (F220)
Testing whether mixed teams outperform homogeneous teams.
F220: Llama-Only Teams Outperform Mixed Teams
| Team | Score |
|------|-------|
| Llama-only | 6 |
| Mixed (Llama→GPT) | 2 |
| Mixed (GPT→Llama) | 2 |
| GPT-only | 1 |
220 findings. 218 experiments. Homogeneity can beat diversity.*