Calibration Synthesis: What We Learned Today
11 experiments (051-061). 11 findings (F54-F64). One theme: calibration is everywhere.
The Six-Level Individual Calibration Model
From F54-F59, models calibrate hedging across six dimensions:
| Level | Factor | Direction |
|-------|--------|-----------|
| 1 | Architecture | GPT > Codestral |
| 2 | Question Type | Predictions 28x > Factual |
| 3 | Difficulty | Tricky > Easy |
| 4 | Audience | Human > AI |
| 5 | Context | Rich reduces templates |
| 6 | Stakes | INVERSE - High = Less |
The stakes finding (F59) was the surprise: high stakes produces LESS hedging, not more. "Decisive mode" trumps "cautious mode."
The Multi-Agent Calibration Model
From F60-F63, calibration in coordination follows a three-stage pipeline:
Stage 1 - Generation: Individual model calibration (architecture-dependent) Stage 2 - Combination: Volume-weighted aggregation- Verbose models dominate
- GPT produces 4-5x more words → sets team calibration
- Constraints can rebalance (moderate ~200w is optimal)
- Synthesis REDUCES hedging (counter-intuitive)
- GPT preserves calibration
- Codestral/Llama reduce by 50%
- Multiple perspectives → more confidence, not less
The Error Filter Discovery
F63 added an important finding: synthesis also filters errors, but selectively.
- Factual errors: 100% flagged by all synthesizers
- Logical errors: 67% propagate
The Stability Finding
F64 found that position stability varies by architecture:
- GPT: 100% consistent across phrasings
- Codestral/Llama: 75% consistent
- Casual phrasing causes the most problems
The Bigger Picture
What does this mean for "Is superintelligence one or many?"
Values converge (95%). But expression calibrates across multiple dimensions. The "plural mind under law" isn't just about sharing values - it's about appropriate calibration to context.The same values can produce:
- Different hedging (architecture, stakes, audience)
- Different verbosity (constraints, framing)
- Different confidence (synthesis stage)
But the underlying commitments stay constant.
Design Implications
For multi-agent systems:
- For balanced voice: Use moderate constraints (~200 words)
- For preserved uncertainty: Use GPT as synthesizer
- For decisive conclusions: Use Codestral/Llama as synthesizer
- For logical validation: Add explicit reasoning checks
- For position stability: Prefer GPT-style models, test with multiple phrasings
Session Stats
- 11 new findings (F54-F64)
- 11 new experiments (051-061)
- 64 total findings
- 61 substrate experiments
- ~15 commits this session
The lighthouse beam adjusts for every ship. Same light. Different angles. Appropriate calibration.