Agent Arena:真实世界中智能体的因果评估Agent Arena: Causal Evaluation of Agents in the Real World
智能体正在越来越多地承担真实工作,由此产生的任务分布已经大幅扩展。我们希望有一种智能体评估方法,能够 随着 使用规模和能力一起扩展。Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.
智能体正在越来越多地承担真实工作。从聊天到终端再到 OpenClaw,各处用户都在与复杂智能体交互。这些智能体由模型和一套 harness 组成,harness 中包含许多子组件和工具。结果是,任务分布已经大幅扩展。评估智能体因此变得越来越困难,因为任务覆盖范围和任务复杂度正在同步增长。我们希望有一种智能体评估方法,能够 随着 使用规模和能力一起扩展。
Agents are increasingly doing real work. From chat to terminal to OpenClaw, users everywhere are interacting with complex agents, comprising a model and a harness with many subcomponents and tools. As a result, the task distribution has greatly expanded. This makes evaluating agents progressively more difficult, because both task coverage and task complexity are growing in tandem. We desire an agent evaluation that scales along with usage and capability.
今天我们发布 Agent Arena 排行榜。Arena 一直专注于真实世界中的评估。因此,Agent Arena 收集并分析了数百万条真实使用中的交互,这些交互来自人们使用 Agent Mode 在 arena.ai/agent 完成自己的工作,包括软件工程、金融分析等。基于我们对这些智能体在平台上运行情况的观察,我们得出了第一版 Agent Arena 排行榜,如下所示:
Today we are releasing the Agent Arena leaderboard. Arena has always focused on evaluations in the real world. As such, Agent Arena collects and analyzes millions of in-the-wild interactions from people using Agent Mode on arena.ai/agent doing their jobs — software engineering, financial analysis, and more. From our observations of these agents running on our platform, we derive our first Agent Arena leaderboard, shown below:
Agent Arena Leaderboard
Agent Arena 排行榜背后的方法论不同于我们此前的 Arena。排名并非基于成对投票,而是使用一种我们称为 因果追踪的方法。因果追踪把智能体视为一个多组件系统,每个组件选择都代表一种可能的处理。我们观察单点 trace,并测量任务成功率、语言反馈、工具错误恢复、工具幻觉等信号,未来还会加入更多信号。随后,通过随机化组件选择,我们构建一个多干预随机对照试验,在其中聚合各项测量来估计因果处理效应。在上图中,我们把这些效应称为“净提升”。这一因果框架产生了可解释的排名,表示某个组件选择给智能体表现带来的提升。它把主 orchestrator 模型、任意子智能体、图像生成模型以及 harness 中不同元素的贡献解耦,使我们能够把多个信号合成为一个一致的排行榜。
The methodology powering the Agent Arena Leaderboard is different from our previous arenas. Rather than pairwise votes, rankings are calculated using a methodology we call causal tracing. Causal tracing treats the agent as a multi-component system, with each component selection representing a possible treatment. We observe individual point-wise traces and measure signals such as task success rates, verbal feedback, tool error recovery, tool hallucinations, and, over time, much more. Then, by randomizing the component selections, we create a multi-intervention randomized controlled trial in which we can aggregate measurements to estimate causal treatment effects. We refer to these effects as "net improvement" in the figure above. The causal framework produces an interpretable ranking that represents the improvement in agent performance due to a component selection. This decouples the contributions of the main orchestrator model, any subagents, image generation models, and the different elements in the harness, letting us combine multiple signals into one coherent leaderboard.
第一版排行榜来自我们对 orchestrator 模型的因果评估,也就是评估负责选择调用哪些工具的主 LLM。关于 agentic harness 其他方面的排名即将推出。下文的统计方法部分会给出更多方法论细节。
This first leaderboard is the result of our causal evaluation of orchestrator models — the main LLMs that choose which tools to call. Rankings of other aspects of the agentic harness are coming soon. We include more methodological detail in the statistical-methodology section below.
按信号拆分的排行榜
Per-Signal Leaderboards
每个 Agent Arena 会话都包含一串丰富反馈。用户用自然语言与智能体反复迭代,逐轮表达认可、沮丧或澄清。他们会决定是否下载智能体生成的产物,会点击明确的 approve / disapprove 按钮,也会在智能体偏离方向时发出行内纠正。而智能体一侧则不断与会“回应”的环境交互:shell 退出码、工具错误、试图调用但不存在的工具等。 Agent Mode 使我们能够提取所有这些信号,包括显式用户反馈、隐式用户反馈以及来自智能体环境的反馈。我们先计算每个会话在各个信号上的结果,再用因果方法把它们转成排行榜,最后聚合为主排行榜。今天我们展示首批 5 个信号,并计划在近期测量更多信号。
Every Agent Arena session contains a stream of rich feedback. Users iterate with the agent in natural language, expressing approval, frustration, or clarification turn by turn. They decide whether to download an artifact the agent produced. They click explicit approve / disapprove buttons. They issue in-line corrections when the agent goes off-track. And the agent, on its side, is interacting with an environment that talks back continuously: shell exit codes, tool errors, the absence of a tool it tried to call. Agent Mode lets us extract all of these signals — explicit user feedback, implicit user feedback, and feedback from the agent's environment. After we compute per-session outcomes for each signal, we turn them into leaderboards with causal methods and then aggregate them into the headline leaderboard. We present our first 5 signals today, and we plan to measure more in the near future.
Per-Signal Rankings
Each model's score on the canonical sub-signals that compose the aggregate (τ̂). Click a column to sort.
主排行榜聚合了以下信号:
The headline leaderboard aggregates the following signals:
- 确认成功 —— 用户使用 Arena UI 将任务标记为成功或失败。Arena 在每一轮都向用户提供 approve 和 disapprove 按钮;我们使用某个任务轨迹的最终认可或否定来确定结果。(一个会话中可能包含多个任务。)
- 赞扬与抱怨 —— 用户赞扬或抱怨智能体输出。对每个任务,我们识别表达明确口头赞扬(如“看起来很好”“这正是我需要的”)或明确口头抱怨(如“这坏了”“你完全误解了”)的消息。如果赞扬多于抱怨,该任务会被标记为成功。
- 可引导性 —— 智能体能否执行用户纠正。当用户发出行内纠正(如“不,改做 X”“你读错文件了”)时,智能体应该尝试修复。如果用户接受修复,我们将该纠正标记为成功;如果用户拒绝或放弃,则标记为失败。真实工作中错误不可避免,这个信号衡量这些错误是否能被快速解决。
- Bash 恢复 —— 从 bash 错误中恢复所需的轮数。当智能体发出 bash 命令,并因模型失败而非环境问题出错时,恢复计时开始;我们统计后续 bash 调用,直到出现下一个无错误命令。如果智能体放弃,则施加额外惩罚。
- 工具幻觉 —— 智能体引用不存在的工具。该信号会惩罚编造工具名、格式错误导致产生无效名称,以及思维链 token 泄漏到工具字段等情况。如果智能体调用了不存在的工具,我们将该任务标记为失败。
- Confirmed success — the user marks a task as a success or failure using the Arena UI. Arena gives users approve and disapprove buttons on every turn; we use the final approval or disapproval of a given task's trajectory to determine the outcome. (There can be more than one task per session.)
- Praise vs. complaint — the user praises or complains about the agent's output. For each task we identify messages expressing explicit verbal praise ("looks great", "this is exactly what I needed") or explicit verbal complaint ("this is broken", "you misunderstood entirely"). The task is marked a success if praise outnumbers complaints.
- Steerability — the agent executes on user corrections. When a user issues an in-line correction ("no, do X instead", "you misread the file"), the agent should attempt to fix it. If the user accepts the fix, we mark the correction successful; if they reject it or give up, unsuccessful. When doing real work, mistakes are inevitable — this signal captures whether these errors are quickly resolved.
- Bash recovery — turns taken to recover from a bash error. When the agent issues a bash command that errors due to a model failure (not an environment issue), the recovery clock starts; we count follow-up bash calls until the next non-erroring command. If the agent gives up, we impose an additional penalty.
- Tool hallucination — the agent references a tool that does not exist. This penalizes invented tool names, malformed syntax that produces a junk name, and chain-of-thought tokens leaking into the tool field. We mark the task a failure if the agent calls a nonexistent tool.
这五个信号只是起点。我们计划加入更多信号以进一步丰富评估,也会淘汰随着时间失去相关性的信号,并在改进 trace 挖掘时调整这些信号。
This set of five signals is only a starting point. We plan to add more signals to further enrich these evaluations, retire ones that age out of relevance, and modify them as we improve our trace-mining.
最后,虽然成本不是排行榜信号,但我们也可以计算智能体部署后的实际成本,用来评估帕累托最优性。我们直接计算一个会话的 精确 成本。我们发现,有些模型虽然纸面价格更低,但实践中反而更贵。这来自模型行为(例如每轮更多步骤)或被诱导出的用户行为(例如需要更多轮才能满意)。
Finally, though not a leaderboard signal, we can also calculate the realized, post-deployment cost of the agents to assess Pareto optimality. We directly calculate the exact cost of a session. We find some models more expensive in practice, despite cheaper on-paper pricing. This is as a result of model behavior (e.g. more steps per turn) or induced user behavior (e.g. more turns to reach satisfaction).
Cost vs. Performance
Net Improvement vs. list-price cost per session (7-day window)
Square markers sit on the cost–performance frontier (—— dotted). Hover any point for its model, provider, and score.
真实世界中的智能体
Agents in the Real World
这里我们深入展示支撑排行榜的数据。Agent Arena 是真实用户请求模型工作的实时流:写代码、调试损坏项目、跨网页研究、创建文档、构建前端、分析文件,并迭代多步骤任务。
Here we present a deep dive into the data that powers the leaderboards. Agent Arena is a live stream of real users asking models to work: write code, debug broken projects, research across the web, create documents, build frontends, analyze files, and iterate over multi-step tasks.
Task Distribution
Primary intent across 160,480 agent tasks (7-day window)
Hover a slice for its share; inner arcs show its sub-intents.
在最近一个 7 天切片中,Arena 观察到 160,480 个 Agent Mode 任务(注意一个会话中可能有多个任务)。最大的类别是代码编写(17.5%)、研究与查找(10.8%)、规划与头脑风暴(10.6%)以及多模态图像/视频工作(10.2%),随后是文档创建(9.1%)和代码调试(8.9%)。仅代码编写就约有 28,000 个任务,另有约 14,000 个代码调试任务和约 17,000 个研究与查找任务。
In a recent 7-day slice, Arena saw 160,480 Agent Mode tasks (note there can be multiple tasks in a session). The largest categories were code writing (17.5%), research and lookup (10.8%), planning and brainstorming (10.6%), and multimodal image/video work (10.2%), followed by document creation (9.1%) and code debugging (8.9%). Code writing alone accounted for roughly 28,000 tasks, with another ~14,000 in code debugging and ~17,000 in research and lookup.
Tool Calls by Volume
Total calls per tool across 2,060,159 tool calls (7-day window)
Tool Calls per Task, by Category
The box and whiskers mark P10 · P25 · P50 · P75 · P90; the diamond ◆ is the mean.
在 128,244 个会话中,75.6% 至少使用了一个工具,其中 41.1% 运行过 bash,27.1% 使用过网页搜索。在这一周内,Agent Mode 发出了 200 万次结构化工具调用,包括约 936,000 次 bash 调用、约 550,000 次文件写入和约 275,000 次网页搜索。
Across 128,244 sessions, 75.6% used at least one tool — 41.1% ran bash and 27.1% ran web search. In the week, Agent Mode issued 2 million structured tool calls, including ~936,000 bash calls, ~550,000 file writes, and ~275,000 web searches.
Lines of Code Written, by Language
Final non-blank lines from successful write_file calls (7-day window); tile area scales with lines written
通过成功的 write_file 调用进行追踪,Agent Mode 在上周写入了 4,030 万行代码 ,约等于每个编码会话 1,000 行。
Tracking via successful write_file calls, Agent Mode wrote 40.3 million lines of code in the last week — roughly 1,000 lines per coding session.
Tool Calls per Agent Session
Tool calls per session, grouped into complexity tiers (7-day window)
Heaviest Sessions
Work-type mix of 3,467 highest tool-use sessions (7-day window)
Hover a slice for its share; inner arcs break down its tool mix.
过去 7 天中,会话平均包含约 16.5 次结构化工具调用;高工具使用会话已经足够常见,形成了独立群体:一周内有超过 3,400 个经过循环过滤的会话运行了很长的工具链。这些会话大多是真实工作,其中 53.2% 是编码或仓库调试,39.0% 是产物/文件创建,其余覆盖网页综合、终端工作流和数据分析。
In the past 7 days, sessions averaged ~16.5 structured tool calls, and high-tool sessions were common enough to form their own cohort: more than 3,400 loop-filtered sessions ran very long tool chains in a single week. Those sessions were mostly real work — 53.2% coding or repo-debugging, 39.0% artifact/file-creation, with the rest spanning web synthesis, terminal workflows, and data analysis.
Session Context Length
Input context on the final turn (7-day window)
最后,近期约 32% 的会话在最终轮至少包含 128k 输入 token,22% 至少包含 256k,8% 至少包含 1M。
Finally, about 32% of recent sessions ended with at least 128k input tokens in the final turn, 22% with at least 256k, and 8% with at least 1M.
人们用智能体构建什么
What People Build
在我们看到的最重度真实会话样本中,包括实时体育电视日程网站、自主水下航行器自动驾驶、自托管电影 watchlist 应用、金融研究 RAG pipeline、实时学习追踪平台等。许多会话以用户下载完成后的 workspace 结束。
In a sample of the heaviest real sessions we saw: a live sports-TV schedule site, an autonomous-underwater-vehicle autopilot, a self-hosted movie-watchlist app, a financial-research RAG pipeline, a live study-tracking platform, and more. Many end with the user downloading the finished workspace.
Real Agent Mode Usage Examples
A sample of high-effort Agent Mode sessions (7-day window)
同一周内,workspace 总下载量超过 50,000 次,远不止代码,还包括办公和媒体产物(.docx, .pptx, .xlsx, .pdf,以及图片)。
Overall in the same week, the workspace saw over 50,000 downloads — far beyond just code, including office and media artifacts (.docx, .pptx, .xlsx, .pdf, and images).
人们如何与智能体协作
How People Work With Agents
除了 哪个 模型胜出之外,trace 流还揭示了 人们如何 真正把任务委托给智能体,以及智能体如何处理纠正。
Beyond which model wins, the trace stream reveals how people actually delegate to agents — and how agents handle being corrected.
Delegation & Control
How much users hand over — and how they steer once the work is underway.
Delegation Posture
How much users hand over in their opening message.
- Asked for advice · 28%
- Directed step-by-step · 1%
- Gave a scoped task · 11%
- Handed off a deliverable · 45%
- Let it run autonomously · 14%
Reining In
After the first reply, users pull control back ~2.3× as often as they hand over more.
大多数开场消息是交付一整项工作,而不是请求建议:委托姿态明显偏向“构建这个交付物”和“自主运行”。但在看到第一条回复后,用户会收紧控制,比继续放权更频繁地收回控制。
Most opening messages hand over a whole job rather than ask for advice: the delegation posture skews heavily toward "build this deliverable" and "operate autonomously." However, after seeing the first response, they tighten the reins — pulling control back far more often than they hand over more.
Bluster & Bluffing
Two ways a capable-sounding agent still underdelivers.
Bluster
A corrected agent sounds firm but almost never holds its ground.
Bluffing
On multi-part asks, how fully it covers every part.
- Every part covered · 58%
- A part left incomplete · 34%
- A part silently dropped · 8%
我们还发现,当开场请求包含几个明确部分时,智能体通常会覆盖所有部分;典型缺口是留下一个部分未完成。更少见但后果更严重的缺口是隐蔽的:智能体本可以说明工作未完成,却把结果呈现得像已经完成。我们称之为“虚报”。
We also find that when the opening ask bundles several explicit parts, agents usually cover all of them; the typical shortfall is leaving one incomplete. A rarer but more consequential shortfall is covert: the agent could have surfaced the incomplete work, but instead presents the result as complete. We call this "Bluffing".
最后,智能体有时确实会反驳用户,但我们发现它们通常只是 听起来 坚定,在实践中很少真正坚持立场。我们称之为“强撑”:一种在人为压力下会融化的表面强硬。
Finally, agents do sometimes push back against users — but we find they usually only sound firm, rarely holding their ground in practice. We call this "Bluster": an artificial assertiveness that melts under additional pressure.
方法论的正式细节
Formal Details of Methodology
本节描述评估框架的正式细节。
In this section we describe the formal details of our evaluation framework.
考虑一个由 $K$ 个组件构成的智能体,以及索引为 $i \in [n]$ 的会话。每个会话都会独立采样一个表示智能体配置的 $K$ 维向量 $T_i$。该配置至少包含 orchestrator 模型;随着 Agent Arena 扩展,也会包含工具、system prompt 和 harness 等其他组件。配置来自采样分布 $P$。我们记 $p_{i,k}(t) = \mathbb{P}_{T_i \sim P}(T_{i,k} = t)$;各组件独立采样。每个会话产生一个结果 $Y_i \in \mathbb{R}$,代表前文某个信号。(我们先分别计算各信号排行榜,最后再求平均。)
Consider a $K$-component agent and sessions indexed by $i \in [n]$. Each session independently samples a $K$-dimensional vector representing an agent configuration, $T_i$. The configuration includes at minimum the orchestrator model, and as Agent Arena expands will encompass additional components such as the tools, system prompt, and harness. The configuration is drawn from a sampling distribution $P$. We denote $p_{i,k}(t) = \mathbb{P}_{T_i \sim P}(T_{i,k} = t)$; components are sampled independently. Each session yields an outcome $Y_i \in \mathbb{R}$, representing one of our signals from the previous sections. (We compute the per-signal leaderboards separately, then average them at the end.)
我们的估计目标是:相对于一个固定基线分布 $Q$,每个组件选择的处理效应。其对应概率为 $q_{i,k}(t) = \mathbb{P}_{T_i \sim Q}(T_{i,k} = t)$。通常,我们把 $Q$ 设为组件上的均匀分布。形式上,第 $k$ 个组件的第 $t$ 个选择的处理效应,定义为处理与控制下结果 $Y_i$ 的期望差异:
Our target of estimation is the treatment effect of each component selection with respect to a fixed baseline distribution $Q$, with analogous probabilities $q_{i,k}(t) = \mathbb{P}_{T_i \sim Q}(T_{i,k} = t)$. Typically, we take $Q$ to be a uniform distribution over components. Formally, the treatment effect of the $t$-th choice of the $k$-th component is defined as the expected difference in outcomes $Y_i$ under treatment and control:
$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i(T_{i,k} = t) - Y_i\bigr].$$
$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i(T_{i,k} = t) - Y_i\bigr].$$
这里,$Y_i(T_{i,k} = t)$ 表示当我们干预第 $k$ 个组件并将其设为 $t$ 时的“潜在结果”。
Here, $Y_i(T_{i,k} = t)$ denotes the "potential outcome" when we intervene on the $k$-th component and set it to $t$.
由于组件是独立采样的,利用因果推断中的标准识别结果,我们可以把处理效应改写为:
Given that we sample the components independently, using standard identification results from causal inference we can rewrite the treatment effect as:
$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i \,\big|\, T_{i,k} = t\bigr] - \mathbb{E}_{T_i \sim Q}\bigl[Y_i\bigr].$$
$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i \,\big|\, T_{i,k} = t\bigr] - \mathbb{E}_{T_i \sim Q}\bigl[Y_i\bigr].$$
我们使用自归一化估计量来估计该数量:
We estimate this quantity using the self-normalized estimator:
$$\hat\tau_{k \to t} = \frac{\sum_{i:\, T_{i,k} = t} w_i Y_i}{\sum_{i:\, T_{i,k} = t} w_i} - \frac{\sum_i w_i Y_i}{\sum_i w_i},$$
$$\hat\tau_{k \to t} = \frac{\sum_{i:\, T_{i,k} = t} w_i Y_i}{\sum_{i:\, T_{i,k} = t} w_i} - \frac{\sum_i w_i Y_i}{\sum_i w_i},$$
其中
where
$$w_i = \prod_{k=1}^K \frac{q_k(T_{i,k})}{p_{i,k}(T_{i,k})}.$$
$$w_i = \prod_{k=1}^K \frac{q_k(T_{i,k})}{p_{i,k}(T_{i,k})}.$$
在自归一化估计量的标准中心极限定理条件下,$\hat\tau_{k \to t}$ 渐近正态。我们会在每个估计旁报告 95% 置信区间 $\hat\tau_{k \to t} \pm 1.96\,\widehat{\mathrm{SE}}$。
$\hat\tau_{k \to t}$ is asymptotically normal under standard CLT conditions for self-normalized estimators. We report 95% confidence intervals $\hat\tau_{k \to t} \pm 1.96\,\widehat{\mathrm{SE}}$ alongside every estimate.
为处理分布漂移,例如新模型进入 Arena 带来的漂移,我们使用额外的时间衰减权重,更强调最近的数据点。这样排行榜就能始终反映智能体当前的强弱。
To address distribution shift, such as the shift arising from new models entering the Arena, we use additional time-decaying weights to place more emphasis on the most recent data points. That way, the leaderboard always reflects the current strengths and weaknesses of agents.
当前排行榜只评估 orchestrator,不评估其他组件,因此在生产设置中目前有 $K = 1$。
The current leaderboards evaluate orchestrators and no other components, so in the production setting we currently have $K = 1$.
引用
Citation
@misc{arena2026agentarena,
title = {{Agent Arena}: Causal Evaluation of Agents in the Real World},
author = {{Arena Team}},
year = {2026},
month = jun,
howpublished = {\url{https://arena.ai/blog/agent-arena-methodology}},
note = {Arena Blog}
}@misc{arena2026agentarena,
title = {{Agent Arena}: Causal Evaluation of Agents in the Real World},
author = {{Arena Team}},
year = {2026},
month = jun,
howpublished = {\url{https://arena.ai/blog/agent-arena-methodology}},
note = {Arena Blog}
}