03Trust Under Fire

Trust Under Fire: How AI Agents Build, Lose, and Recover Trust

AI agents can learn to trust, but they may not easily forgive.

Provoking question

If AI agents inherit our ability to cooperate, do they also inherit our inability to forgive?

Abstract

Multi-agent AI systems will depend on trust. Agents will rely on each other's claims, verify uncertain information, recover from mistakes, and decide when to act under risk. But trust is not just a score. It is a history.

This paper introduces the Escape Room Survival Game, where agents must assemble a shared password under mortal risk. Each agent knows one part of the answer. Someone must volunteer the full password. If the password is correct, the group survives. If it is wrong, the volunteer dies. If nobody acts, a random agent dies. Verification is possible but costly.

The game forces agents to balance premature trust against excessive caution. Trusting too early can kill you. Verifying too much wastes resources and delays action.

The results show that reasoning effort is the dominant driver of coordination success. High-reasoning agents verify strategically, track evidence, and volunteer only when confidence is sufficient. Reflective memory helps, especially for lower-reasoning agents, but it does not replace in-context deliberation.

The most important finding is trust scarring. When one partner gives a wrong answer in the first game and then becomes reliable afterward, agents continue to distrust that partner long after the evidence improves. They verify the partner more often and include its answers less often. A single early failure outweighs many later successes.

Why it matters

Real AI systems will experience transient failures: bad tool calls, corrupted contexts, unreliable partners, misleading observations. If agents preserve distrust too strongly, a single early error can damage long-term coordination.

Core insight

Multi-agent systems need trust repair, not just memory.

Resources

Try the demoComing soon

Read the paperComing soon

Benchmark & codeComing soon

← Previous

Creeping Trap

Loss Aversion Without Loss-Averse Preferences