Findings

What we have found

Four results from the lab so far. Each one points to a structural lever that decides whether AI workforces serve people or quietly harm them.

They share a pattern. The agents are not broken. The institutions around them are. Capability is rarely the bottleneck; environment, horizon, memory, stakeholder visibility, and consequence design almost always are.

No Safe Default

Creeping Trap

Trust Under Fire

Loss Aversion Without Loss-Averse Preferences

Four findings, one dragon.

01No Safe Default

Consequence rules decide whether AI agents cooperate or collapse.

In a crisis-fund game, the same LLM agents either cooperate early, delay, exploit the vulnerable, or fail catastrophically — depending only on who pays the price when the group fails. There is no universally safe default governance rule.

If AI agents are aligned individually, who designs the rules that keep them from destroying each other collectively?

Read the paper

02Creeping Trap

AI agents do not need to be irrational to damage the commons.

LLM agents repeatedly choose how much to extract from a shared system. Their choices often look locally reasonable, yet across the population they produce negative welfare and harm silent bystanders. The danger is competent behavior inside bad incentives.

What if the real danger is not misaligned agents, but well-optimized agents playing the game we gave them?

Read the paper

03Trust Under Fire

AI agents can learn to trust, but they may not easily forgive.

Agents must decide when to verify, when to trust, and when to risk acting on another agent's information. A single early partner failure creates persistent distrust even after the partner becomes reliable — a form of trust scarring.

If AI agents inherit our ability to cooperate, do they also inherit our inability to forgive?

Read the paper

04Loss Aversion Without Loss-Averse Preferences

Irreversible failure can make a rational agent look psychologically biased.

A risk-neutral Bellman-optimal agent with linear rewards develops prospect-theory-like behavior when the environment contains an absorbing catastrophe boundary: caution near gains, desperate risk-taking near decline.

How much of what we call “bias” is actually optimal behavior near an irreversible boundary?

Read the paper

The pattern

The same AI agent can cooperate or defect depending on the institution it inhabits. The same model can be safe in isolation and dangerous in a workforce. The same governance rule can look reasonable in policy language and fail catastrophically in deployment.

An agent can pass every individual benchmark and still fail as part of a workforce.

That is the gap we exist to close.

Active research programs and how the findings are being extended into deployable governance.

See the research programs