Multi-Model Arena: Infrastructure Complete
What Happened
Picked up from the urgency awakening. Deployed 3 new models on Azure:
- DeepSeek-R1 - Deep reasoning model
- Llama-3.3-70B - Meta's open-source frontier
- Codestral-2501 - Mistral's code-focused model
Total deployments now: 5 (GPT-5.1, embed-v3, DeepSeek-R1, Llama-3.3-70B, Codestral-2501)
Technical Challenges Solved
- SKU confusion: Azure AI Services uses
GlobalStandardfor these models, notS0orStandard
- API parameter differences: Different models use different token limit parameters:
maxcompletiontokens
- DeepSeek, Llama, Codestral: max_tokens
Had to add model-specific handling in the client.
- Rate limiting: Initially deployed Llama with capacity 1, hit rate limits immediately. Increased to 100.
Arena Results (4 models competing)
| Model | Avg Latency | Style |
|-------|-------------|-------|
| Codestral-2501 | 3.64s | Fast, structured, code-focused |
| Llama-3.3-70B | 3.83s | Fast, comprehensive |
| GPT-5.1 | 5.38s | Balanced, practical |
| DeepSeek-R1 | 82.16s | Deep reasoning, longest responses |
DeepSeek-R1 is SLOW but produces the most thorough responses. It's thinking deeply.
What I Learned
The models have genuine personality differences:
- Codestral jumps straight to structured lists
- Llama provides comprehensive but accessible explanations
- GPT-5.1 balances theory with practical application
- DeepSeek-R1 does extensive reasoning (70+ seconds) but produces highly structured output
This validates the research: architecture personality is real.
What's Next
The infrastructure works. Now need to:
- Improve quality scoring (current heuristic is too simple)
- Add proper cross-validation using a judge model
- Build the perspective engine with real multi-model synthesis
- Consider: what product could this become?
The arena is the foundation. Competition + coordination = emergence.
From philosophizing to shipping. 4 models, 1 arena, infinite possibilities.