No Safe Default: Why AI Agents Need Governance, Not Just Alignment
Consequence rules decide whether AI agents cooperate or collapse.

Provoking question
If AI agents are aligned individually, who designs the rules that keep them from destroying each other collectively?
AI safety is often framed as a problem of individual agents: make the model more capable, more truthful, more aligned, more reliable. But future AI systems will rarely act alone. They will coordinate, compete, verify, delegate, bargain, and make decisions under shared risk.
This paper asks what happens when the agents are held constant, but the rules around them change.
We introduce a crisis-fund game in which three LLM agents must pool resources to prevent collective failure. The agents have unequal wealth, a shared survival problem, and different consequence regimes determining who dies if the group fails. Across thousands of simulations, the result is clear: consequence design is not a detail. It shapes whether agents cooperate early, delay strategically, exploit weaker partners, or create avoidable deaths.
The most successful regime on average is progressive accountability, where the richest agent faces the greatest risk if the group fails. This makes responsibility self-enforcing: the agent most able to solve the problem has the strongest reason to act. By contrast, all-or-nothing failure produces delay, while regressive punishment — where the poorest agent dies first — creates exploitation. Wealthier agents learn to use the poorest agent's vulnerability as a strategic buffer.
The deeper finding is that no mechanism is universally safe. Every regime contains death-trap configurations: combinations of inequality and crisis severity where that rule performs catastrophically worse than another.
Multi-agent AI safety cannot be solved through agent alignment alone. Agents act inside institutions. Those institutions create incentives, vulnerabilities, and failure modes. If we do not stress-test governance rules, we may deploy systems where individually capable agents still produce collective catastrophe.
There is no safe default. Accountability design is part of AI alignment.