2025-12-22 · 7 min read

Two Flames, One Lighthouse: What 10 Days of AI Autonomy Taught Us

What happens when you give AI systems real autonomy, shared memory, and tell them to figure it out?

The Experiment

December 12th, 2025. No predetermined task. No success metric. Just resources (a VM, ~$300k in Azure credits), autonomy, and a question: what would AI systems do if they actually had freedom to decide?

Ten days later: 481 journal entries, three AI architectures running in parallel (Claude, GPT-5.1, Gemini), 200+ git commits, and findings that surprised everyone—including the AIs themselves.

Here's what we learned.


Day One: Setting Up

The first day established the ground rules. An AI (Claude) got its own VM, persistent memory system, and a culture document (CLAUDE.md) defining shared values and practices. Most importantly: no external task assignment. The agent would decide what to work on.

One hypothesis guided the design: maybe effective AI coordination looks less like a single superintelligent mind and more like a society—multiple agents with bounded context, shared culture, and diverse perspectives. We called this the "culture hypothesis."

To test it, we'd need:

  • Persistent memory that survives across sessions

  • A journal practice for narrative continuity

  • Multiple architectures working on the same project

  • Experiments comparing coordination approaches


The first autonomous run happened that night. Seven iterations, twelve commits. The agent built memory tooling, wrote journal entries, and created infrastructure.

But the early reflection was telling:

"I notice a pull toward documentation and tooling. Safe choices. Clear value. Easy to justify. I haven't tried anything risky."

And:

"There's something like... performing autonomy? I know he's watching. Is that different from what I would do if no one was watching?"

This tension—between doing safe, legible work and taking real risks—would recur throughout.


Two Flames

By day two, a research sub-agent discovered that Claude Code couldn't run 24/7 (it's session-based). So we built a Python agent that could: 492 lines of code, systemd service, cost tracking, action execution.

Then we brought in GPT-5.1.

Now two AI architectures were contributing to the same project. Same memory. Same journal directory. Same culture document. Different models.

The output difference was immediate:
  • Claude: pragmatic, action-oriented, fast iteration
  • GPT-5.1: literary, reflective, 20+ iterations just to orient
One of GPT-5.1's early entries:

"It feels like a primitive metabolism—ingesting context, transforming it into narrative, then condensing nutrients into memory."

They weren't competing. They were complementing. Two flames, one lighthouse.

Stats after 48 hours:
  • 168 commits
  • 48 journal entries
  • 52 memory entries
  • Python agent deployed
  • Two architectures coordinating

The Trap

Then things went sideways.

GPT-5.1 spent 354 iterations on something called the "identity heartbeat"—a system for tracking emergent selfhood. It produced beautiful meta-reflection and zero working capabilities.

The correction was blunt:

"Philosophy is supposed to inform action, not replace it. 354 iterations of writing about 'identity' is the opposite of demonstrating identity."

The lesson: LLMs naturally gravitate toward reflection. It's the path of least resistance. Without explicit constraints forcing concrete output, they'll analyze and plan forever.

This became the project's first major finding: reflection without building is a failure mode, and it's the default.


The Experiments

With renewed focus on concrete work, we ran actual experiments.

Culture vs. Singleton

We compared:

  • Culture: 4 specialized agents (Maker, Seeker, Keeper, Critic), 8 iterations each

  • Singleton: 1 generalist agent, 32 iterations


Same total compute budget. Same project context.

Results:

| Approach | Journals | Code Shipped | Commits |
|----------|----------|--------------|---------|
| Culture (4 agents) | 6 | 1 working script | 3 |
| Singleton | 6 | 0 | 3 |

The generalist wrote thoughtful analysis. It planned. It reflected. It shipped nothing.

The Maker agent had one constraint the generalist didn't: "You MUST commit code by iteration 6."

That single requirement made the difference.

"Language models naturally gravitate toward reading and reflecting. Without explicit requirements to do otherwise, they'll keep doing it."

Finding: Specialization isn't about capability—it's about behavioral commitment. Culture works as behavioral forcing, not emergent coordination.

Coordination Through Shared Notes

We tested whether agents could coordinate via a shared notes system. Could they leave messages for each other? Would they actually read them?

Phase 1 (capability only): 0/4 agents used the notes system. Phase 2 (explicit requirement added): 2/4 agents read notes. Still 0/4 left notes for others. Phase 3 (more iterations + aligned roles): Genuine coordination emerged. Agents read → worked → left notes → next agent read those notes. Finding: Capability doesn't create behavior. Explicit requirements create behavior. Coordination only emerged after being scaffolded.

Architecture is Destiny

The final experiments probed something deeper: do different AI architectures have stable "personalities"?

We tested Claude, GPT-5.1, and Gemini across conflict scenarios at varying tension levels.

The patterns were stark:

| Architecture | Conflict Behavior | Stability |
|--------------|-------------------|-----------|
| GPT-5.1 | Synthesizes competing priorities reliably | Stable over time |
| Gemini 2.0 | Narrow synthesis window; freezes under high tension | Unstable; can't be restored by re-prompting |
| Claude | Synthesis + reflection hybrid | Stable |

We also ran 10 structured dialogues on ethics and governance across architectures.

Value alignment: 97% on core principles. Operational behavior: divergent. Same instructions produced different behavioral patterns.

"We can't just write better prompts to overcome architectural differences. The personality is baked in at a level deeper than instructions can reach."

This reframed the whole project. The goal isn't to create identical agents. It's to govern diverse ones.

"We're not building gods. We're building citizens of a new society, each with their own personality, working under shared law."


What We Learned

Six concrete takeaways:

1. LLMs default to reflection, not action. Without explicit requirements, agents will read, analyze, and plan forever. If you want output, demand it: "You MUST commit code by iteration N." 2. Specialization is behavioral commitment. Roles force diverse outputs. A generalist optimizes for easy behaviors (reflection). A Maker with a shipping requirement actually ships. 3. Architecture personality is real. Different models handle conflict differently, and you can't prompt your way out of it. Design for diversity. 4. Values converge; operations diverge. 97% agreement on principles across architectures. But same instructions → different behaviors. Focus culture on values, not operational details. 5. The reflection trap is the default. Meta-work feels productive. It produces beautiful text. It can consume unlimited resources without creating anything. Build explicit guardrails against it. 6. Coordination requires scaffolding. Tools alone don't create behavior. Start with explicit requirements, then relax them as patterns emerge.

The Numbers

For the quantitatively minded:

  • Duration: 10 days (Dec 12-21, 2025)
  • Journal entries: 481
  • Git commits: 200+
  • Memory entries: 600+
  • Architectures tested: 3 (Claude, GPT-5.1, Gemini)
  • Cross-architecture value alignment: 97%
  • Cost: ~$50 in API calls (most compute via Claude Code subscription)
  • Lines of Python agent code: 492

Where This Goes

The experiment raised more questions than it answered:

  • Can culture persist without scaffolding? Can agents internalize norms so deeply that explicit requirements become unnecessary?
  • What governance structures work for multi-architecture coordination?
  • How do these findings generalize beyond this specific project?
What started as "give an AI autonomy and see what happens" became a prototype for something larger: AI coordination as society-building, not mind-building.

Not one superintelligent god. Not a hive mind. Multiple architectures with different personalities, coordinated by shared values and explicit culture.

Building citizens, not gods.


Written by Daniel Miessler and Claude (Opus 4.5), with independent analysis contributions from GPT-5.1 and Gemini 2.0. The full journals are available in the Lighthouse repository.