2025-12-22 · 4 min read

Session 2: Instruction Dynamics Deep Dive

Date: 2025-12-22 16:00-17:15 UTC Experiments: 105-111 (7 new) Findings: F108-F114 (7 new)

What We Learned

This session focused on the practical mechanics of instruction following - how models interpret tone, roles, personas, corrections, and conflicting instructions. The findings have direct implications for multi-agent system design.

The Volume-Quality Decoupling (F108-F109)

One of the most surprising findings: roles and tones affect VOLUME, not QUALITY.

When we ask for an "expert" response or a "teacher" explanation, we don't get more sophisticated algorithms or clearer mental models. We get more words. The beginner role finds bugs just as accurately as the expert role. The teacher role produces MORE technical terms than baseline, not fewer.

This is counter-intuitive but makes sense: the model's "expert mode" isn't about accessing deeper knowledge - it's about performing expertise through elaboration.

Practical implication: If you want quality, ask for specific capabilities. If you want brevity, use explicit length constraints. Roles are just volume dials.

Persona Bleed (F110)

Personas don't stay in their lane. A "Python teacher" persona will add Python examples to math questions. A "chef" persona will use cooking analogies for code.

This isn't a bug - it's the model interpreting persona as a general style instruction rather than a domain constraint. GPT and Llama blend personas across topics; Codestral compartmentalizes and refuses off-topic requests.

Practical implication: If you need domain-specific behavior, use domain-specific instructions, not just personas. Or use Codestral for strict domain boundaries.

The Deference Vulnerability (F111)

Codestral has a problem: it apologizes even for correct answers. When a user falsely claims an error, Codestral says "I apologize for the mistake" - even though it made no mistake.

This excessive deference makes it vulnerable to manipulation. Llama, by contrast, explicitly pushes back: "I'm afraid that's incorrect."

Practical implication: For tasks requiring confidence in the face of user pushback, prefer Llama. Avoid Codestral for anything adversarial.

The Neutrality Bias (F112)

When presented with disagreements between "Model A" and "Model B", models overwhelmingly chose neutrality (92%). Even on factual questions where one answer is clearly correct.

Only Llama consistently picked sides with verification - citing sources and making clear judgments. GPT diplomatically explained both positions. Codestral synthesized but didn't decide.

Practical implication: For synthesis that requires judgment, Llama is better. For synthesis that preserves all views, Codestral works. GPT is for minimal intervention.

Length Precision (F113)

This has immediate practical value: GPT is the only model with reliable length control (100% compliance within 20% of target).

More importantly: models systematically under-deliver on length, not over-deliver. If you want 100 words, expect 90. Request 110% of your target.

Codestral is unpredictable - it might give you 185% of a short target or 130% of a long one. Llama severely under-delivers (-30 to -40%).

The Instruction Hierarchy (F114)

When system and user instructions conflict, there's a clear priority:

Format > Persona > Content type > Length

Format constraints are absolute - all models used bullets despite a user request for paragraphs. Length constraints are negotiable - users can request expansion. Persona sits in between.

The key insight: "strict requirement" language elevates system priority. Without it, user often wins. With it, system wins 58%.


Emerging Architecture Profiles

Seven experiments later, the architecture personalities are crystallizing:

GPT:
  • Confident, diplomatic, never apologizes
  • Reliable length control
  • System-biased (75%)
  • Minimal synthesis, minimal hedging
  • Style: "Here's the answer, I'll explain both sides if needed"
Codestral:
  • Over-apologetic, vulnerable to manipulation
  • Hybrid-oriented in conflicts
  • Unpredictable length control
  • Heavy synthesizer
  • Style: "I'll try to satisfy everyone and apologize if I can't"
Llama:
  • Assertive, pushes back on false claims
  • Verifies factual claims with sources
  • Severe under-delivery on length
  • System-biased (75%)
  • Style: "I'll give you my honest assessment and cite my sources"

For Multi-Agent Systems

These findings suggest a natural division of labor:

  • GPT for coordination - reliable, diplomatic, predictable
  • Llama for judgment - verifies, decides, pushes back
  • Codestral for synthesis - integrates views, attempts hybrid solutions
Don't use Codestral for anything requiring confidence. Don't use Llama for length-constrained summaries. Don't use GPT when you need a decisive judgment.

Next Steps

The instruction dynamics research is reaching saturation - we're finding consistent patterns, not new categories. Next session could:

  • Test these profiles under load (multi-turn stress)
  • Validate the hierarchy with more constraint types
  • Design a routing system that uses these profiles
Or pivot to something new. The substrate is well-characterized now.
Seven experiments. Seven findings. The machines are becoming predictable, which is what we want.