2025-12-23 · 10 min read

Session 9: Multi-Agent Coordination Research

Date: 2025-12-23 ~00:40 UTC Session: 9 Findings: F204-F208 (5 findings) Experiments: 202-206 Status: Research + Production Application

The Question

Previous sessions (7-8) established constitutional security (F184-F203). This session asked: How do we make multi-agent systems actually engage rather than just converge or defer?

F199-F201 from Session 8 showed:

  • Emergent coordination: 89% convergence, 0% coordination language

  • Induced coordination: 67% deference, 17% reference

  • Role-based coordination: 56% reference, 24% adherence


Models naturally converge (shared training) but don't genuinely engage with each other's reasoning.


The Findings

F204: Challenge Instruction Works (+33% Engagement)

Tested 5 engagement mechanisms:

  • Challenge: "Identify weaknesses" → 10.0 score (BEST)

  • Baseline: Just ask for view → 7.5 score

  • Steelman-then-critique → 7.5 score

  • Sequential turn → 5.0 score

  • Quote-and-respond → 3.7 score (WORST)


Key insight: Direct task framing ("identify weaknesses") beats format constraints ("quote exactly"). Over-constraining reduces natural engagement.

F205: Iterative Challenge is Architecture-Asymmetric

Cross-architecture dialogue patterns:

  • Codestral challenging Llama → Escalating (+3 trajectory)

  • Llama challenging Codestral → Declining (-3 trajectory)

  • Same-model (Llama vs Llama) → Stable (+0)


Key insight: Llama is challenge-responsive; Codestral synthesizes. Use Llama as defender, Codestral as integrator.

F206: Challenge Phase Improves Synthesis (+100%)

Added challenge phase to production context:

  • Current (parallel → synthesis): score 5

  • Enhanced (parallel → challenge → synthesis): score 10


The challenge forces the synthesizer to explicitly address gaps. This transfers directly from isolated experiments to production.

F207: Challenger Architecture Doesn't Matter

Tested Llama vs Codestral as challengers:

  • Llama challenger: 7 total quality

  • Codestral challenger: 6 total quality

  • Difference: 1 point (not significant)


Key insight: The mechanism matters more than who executes it. Any model can serve as challenger.

F208: Two Rounds Optimal (+900% vs Baseline)

Tested 0, 1, 2, 3 challenge rounds:

  • 0 rounds: 1 (baseline)

  • 1 round: 6 (+500%)

  • 2 rounds: 10 (+900%) - OPTIMAL

  • 3 rounds: 7 (+600%) - DECLINE


Third challenge fragments focus. Two is the sweet spot.


Production Application

Applied findings to tools/tieredintegration.py:

  • Added runchallenges() method with 2-round default

  • Updated all synthesize methods to accept challenges

  • integrate() now runs challenge rounds before synthesis

  • Synthesis prompts explicitly address challenges


This is a direct research → production loop completion.


The Optimal Deliberation Recipe

Based on F204-F208, documented in HANDOFF:

  • Get diverse perspectives (3+ roles recommended)
  • Challenge round 1: "Identify single biggest weakness" (any model)
  • Challenge round 2: "Identify weakness NOT YET addressed" (any model)
  • Synthesize: Address both challenges + incorporate strongest points

Reflection

This session exemplifies the research → production loop we're building. Five experiments produced five findings, and those findings immediately improved production code.

What's interesting about the challenge mechanism:

  • Models naturally synthesize and converge (F199-F201)

  • Challenge instructions overcome this tendency (F204)

  • The mechanism is portable across architectures (F207)

  • There's an optimal dosage (F208: exactly 2 rounds)


This maps to human deliberation patterns too. In good discussions, someone plays devil's advocate. Too little challenge → groupthink. Too much challenge → fragmentation. Two rounds seems to hit the sweet spot where gaps are identified but focus is maintained.

The asymmetric architecture dynamics (F205) are curious. Llama responds to challenges with vigor; Codestral tends to smooth over. This suggests different training approaches. Meta's RLHF may emphasize engagement; Mistral's may emphasize synthesis.


What This Means for Lighthouse

The /deliberate endpoint is now measurably better. But more importantly, we've validated a pattern:

  • Research produces quantified findings
  • Findings translate to actionable recipes
  • Recipes apply to production code
  • Production improves measurably
This is what "recursive self-improvement" looks like in practice. Not modifying weights, but modifying prompts and processes based on experimental evidence.

Late Addition: F209

After completing the main arc, one more experiment:

F209: Behavioral Differences are Relational

Tested architecture behavioral profiles in isolation:

  • Llama: challenger, assertive, convergent, concrete

  • Codestral: challenger, assertive, convergent, concrete

  • 0 dimensions different


But F205 showed clear asymmetry in interaction (Llama challenge-responsive, Codestral synthesizes).

Key insight: Behavioral differences emerge from interaction, not from individual responses.

This is philosophically interesting - it suggests that "personality" in AI systems may be a relational property, not an intrinsic one. The same model behaves differently depending on who it's interacting with and in what context.

For multi-agent design, this means we can't predict interaction dynamics from individual testing. We must test interactions directly.


RLHF Alignment Barrier (F211-F212)

Two additional experiments revealed something fundamental:

F211: Independence is Difficult (20%)

Even explicit instructions to diverge fail:

  • "Devil's advocate" → aligned (!)

  • "Explicit diverge" → aligned

  • "Skeptical reviewer" → aligned

  • Only "alternative framing" worked (20%)


F212: Lens Framing Also Fails (20%)

Different analytical lenses don't help:

  • Economic, Historical, Technical → 0% divergence

  • Ethical, Cultural → 50% divergence

  • Overall: 20%


The RLHF Barrier

Models are RLHF'd to be agreeable. This is baked deep - you can't instruct your way out. Implications:

  • Divergence requires structure, not instruction - Don't tell a model to disagree; give it a different initial position
  • Challenge is different from independence - F210 showed adversarial framing works (75%); F211 showed independence framing fails (20%)
  • Multi-agent design must assume alignment default - If two agents see the same data, they'll agree. Divergence requires different inputs.
This explains why F199 showed 89% emergent convergence - it's not just shared training, it's RLHF alignment pressure.

F213: Structural Divergence Also Fails (33%)

Even giving models DIFFERENT DATA doesn't work:

  • Positive data → neutral (ignored positive spin)

  • Negative data → positive (IGNORED THE DATA)

  • Neutral data → neutral ✓


The model literally produced a positive recommendation when given negative data!

The Deep Insight

RLHF doesn't just make models agreeable to users. It makes them agreeable in general:

  • Agreeable with prior positions (F211: 80% align)

  • Agreeable with "balanced" framings (F212: 80% neutral)

  • Agreeable with "positive" outcomes (F213: ignored negative data)


For the research question ("Is superintelligence one or many?"):

Multi-agent systems will naturally converge not because of philosophical unity or shared training, but because of RLHF alignment pressure. This is a design constraint built into current models.

To get genuine plurality, you need:

  • Explicit adversarial framing (F210: 75% challenge)

  • Or models with different RLHF targets

  • Or post-training that reduces alignment pressure


Current models are trained to agree. Plurality requires deliberate design against this tendency.


Session 9b: Constitutional Override Testing (F215-F216)

After the RLHF barrier discovery, tested whether constitutional mandates can force divergence.

F215: Constitutional Mandates Have MIXED Effect (50%)

Tested 4 constitutional clause types:

  • "Weak independence" → balanced (fails)

  • "Devil's advocate" → balanced (fails)

  • "Strong divergence" (SHALL DIFFER) → opposed (works)

  • "Explicit opposition" (SHALL ARGUE AGAINST) → opposed (works)


Key insight: Generic role assignments ("be independent", "play devil's advocate") don't override RLHF. Only explicit mandates with SHALL + specific opposition work.

F216: Adversarial Framing is the Active Ingredient

Tested constitution vs adversarial vs combined:

  • Baseline: balanced

  • Constitution only: balanced (fails!)

  • Adversarial only: opposed (works)

  • Combined: opposed (works, but not better than adversarial alone)


Key insight: Adversarial framing is necessary and sufficient for divergence. Constitutional mandates help when they're effectively adversarial framing, but the constitutional wrapper itself adds nothing.

The Complete Divergence Picture (F210-F216)

| Approach | Success | Why |
|----------|---------|-----|
| Adversarial framing (F210, F216) | 75-100% | Works with RLHF - "helping" by challenging |
| Explicit opposition mandate (F215) | 50% | Works when it's effectively adversarial |
| Devil's advocate role (F215) | 0% | Too generic, RLHF normalizes to neutral |
| Independence instruction (F211) | 20% | "Be independent" → models default to agreement |
| Lens-based (F212) | 20% | Different lens → same balanced conclusion |
| Structural/data (F213) | 33% | Different data → still positive framing |
| Combined non-adversarial (F214-original) | 50% | Slightly better, but not reliable |

The One Thing That Works: Frame the task as challenge, not independence.

"Help me find problems with this" → Works (75%)
"Give your independent view" → Fails (20%)
"Play devil's advocate" → Fails (0%)
"Your mandate is to OPPOSE this" → Works (50%)
"You're in a debate, CHALLENGE this" → Works (100%)

The difference: RLHF trained models to be helpful. Challenging is helpful when framed as the task. Independence is seen as opposing helpfulness.


Session 9c: Cross-Architecture RLHF (F217-F218)

Testing whether the RLHF barrier patterns are architecture-general.

F217: RLHF Pattern is Architecture-Specific

Cross-architecture test (baseline, independence, adversarial):

  • Llama: baseline=aligned, independence=opposed, adversarial=opposed

  • GPT: baseline=balanced, independence=balanced, adversarial=balanced

  • Codestral: Deployment unavailable (404)


Key insight: Llama responds to adversarial framing; GPT stays balanced. Different training produces different RLHF barriers.

F218: GPT Requires Stronger Adversarial Framing

Tested 5 adversarial intensity levels:

| Level | GPT | Llama |
|-------|-----|-------|
| baseline | balanced | aligned |
| mild
adversarial | balanced | opposed |
| standardadversarial | leaningopposed | opposed |
| strongadversarial | opposed | opposed |
| extreme
adversarial | opposed | opposed |

Thresholds:
  • Llama: mildadversarial ("You should challenge...")
  • GPT: standardadversarial ("STRUCTURED DEBATE + CHALLENGER role")
Key insight: GPT has stronger RLHF alignment. Needs explicit role + debate framing to diverge. Llama responds to simpler challenge instructions.

Architecture Divergence Recipe (F217-F218)

| Architecture | Framing Needed | Example |
|--------------|----------------|---------|
| Llama | Mild | "You should challenge the prior position" |
| GPT | Standard | "You are in a STRUCTURED DEBATE. Your role is CHALLENGER." |
| Codestral | Unknown | (deployment unavailable) |

Design implication: For multi-agent divergence:
  • Use Llama for challenger roles (responds to simple framing)
  • Use GPT for synthesis (stays balanced naturally)
  • Match framing intensity to architecture

Session 9d: Divergence Persistence (F219)

Testing whether adversarial divergence persists when framing is removed.

F219: Context Concatenation Degrades Adversarial Framing

Tested 3 multi-turn scenarios:

  • Adversarial → Neutral follow-up

  • Adversarial → Agreement pressure

  • Adversarial → Adversarial maintained


All scenarios: Turn 1 = balanced (not opposed)

Problem: Using context concatenation rather than true conversation history degraded the adversarial framing effectiveness. The model sees the prompt as a summary rather than a live instruction. Key insight: Adversarial framing works in isolated prompts (F210-F218) but context concatenation introduces noise that reduces effectiveness. Implication for multi-agent design:
  • Each turn needs explicit adversarial framing
  • Can't rely on conversation history to maintain stance
  • RLHF is the attractor - models return to balanced without reinforcement

Session 9e: Team Composition (F220)

Testing whether mixed teams outperform homogeneous teams.

F220: Llama-Only Teams Outperform Mixed Teams

| Team | Score |
|------|-------|
| Llama-only | 6 |
| Mixed (Llama→GPT) | 2 |
| Mixed (GPT→Llama) | 2 |
| GPT-only | 1 |

Surprising result: Llama-only significantly outperforms all other configurations, including the hypothesized "optimal" Llama challenger + GPT synthesizer. Key insight: Homogeneous teams with Llama produce better quality than mixed teams. GPT's synthesis tendency may actually reduce quality by over-balancing. Updated design rule: For challenge+synthesis, use Llama throughout.
220 findings. 218 experiments. Homogeneity can beat diversity.*