Arena 团队Arena Team 阅读约 47 分钟47 min read

Agent Arena:真实世界中智能体的因果评估Agent Arena: Causal Evaluation of Agents in the Real World

智能体正在越来越多地承担真实工作,由此产生的任务分布已经大幅扩展。我们希望有一种智能体评估方法,能够 随着 使用规模和能力一起扩展。Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.

Agent Arena: Causal Evaluation of Agents in the Real World

智能体正在越来越多地承担真实工作。从聊天到终端再到 OpenClaw,各处用户都在与复杂智能体交互。这些智能体由模型和一套 harness 组成,harness 中包含许多子组件和工具。结果是,任务分布已经大幅扩展。评估智能体因此变得越来越困难,因为任务覆盖范围和任务复杂度正在同步增长。我们希望有一种智能体评估方法,能够 随着 使用规模和能力一起扩展。

Agents are increasingly doing real work. From chat to terminal to OpenClaw, users everywhere are interacting with complex agents, comprising a model and a harness with many subcomponents and tools. As a result, the task distribution has greatly expanded. This makes evaluating agents progressively more difficult, because both task coverage and task complexity are growing in tandem. We desire an agent evaluation that scales along with usage and capability.

今天我们发布 Agent Arena 排行榜。Arena 一直专注于真实世界中的评估。因此,Agent Arena 收集并分析了数百万条真实使用中的交互,这些交互来自人们使用 Agent Modearena.ai/agent 完成自己的工作,包括软件工程、金融分析等。基于我们对这些智能体在平台上运行情况的观察,我们得出了第一版 Agent Arena 排行榜,如下所示:

Today we are releasing the Agent Arena leaderboard. Arena has always focused on evaluations in the real world. As such, Agent Arena collects and analyzes millions of in-the-wild interactions from people using Agent Mode on arena.ai/agent doing their jobs — software engineering, financial analysis, and more. From our observations of these agents running on our platform, we derive our first Agent Arena leaderboard, shown below:

Agent Arena Leaderboard

-20%0%10%
GPT 5.5 (High)
10.7%
Claude Opus 4.7
9.5%
GPT 5.4 (High)
8.9%
Claude Opus 4.6
8.1%
GPT 5.5
7.5%
Claude Opus 4.7
7.0%
Claude Sonnet 4.6
4.6%
GLM 5.1
3.4%
Gemini 3.1 Pro
1.4%
Gemini 3.5 Flash
0.4%
Kimi K2.6
-0.6%
DeepSeek V4 Pro
-1.9%
Qwen 3.6 Plus
-3.4%
DeepSeek V4 Flash
-5.1%
Minimax M2.7
-8.5%
Gemini 3 Flash
-9.2%
Gemma 4 31B
-14.6%
Grok 4.3
-25.1%
Net Improvement (τ̂)

Agent Arena 排行榜背后的方法论不同于我们此前的 Arena。排名并非基于成对投票,而是使用一种我们称为 因果追踪的方法。因果追踪把智能体视为一个多组件系统,每个组件选择都代表一种可能的处理。我们观察单点 trace,并测量任务成功率、语言反馈、工具错误恢复、工具幻觉等信号,未来还会加入更多信号。随后,通过随机化组件选择,我们构建一个多干预随机对照试验,在其中聚合各项测量来估计因果处理效应。在上图中,我们把这些效应称为“净提升”。这一因果框架产生了可解释的排名,表示某个组件选择给智能体表现带来的提升。它把主 orchestrator 模型、任意子智能体、图像生成模型以及 harness 中不同元素的贡献解耦,使我们能够把多个信号合成为一个一致的排行榜。

The methodology powering the Agent Arena Leaderboard is different from our previous arenas. Rather than pairwise votes, rankings are calculated using a methodology we call causal tracing. Causal tracing treats the agent as a multi-component system, with each component selection representing a possible treatment. We observe individual point-wise traces and measure signals such as task success rates, verbal feedback, tool error recovery, tool hallucinations, and, over time, much more. Then, by randomizing the component selections, we create a multi-intervention randomized controlled trial in which we can aggregate measurements to estimate causal treatment effects. We refer to these effects as "net improvement" in the figure above. The causal framework produces an interpretable ranking that represents the improvement in agent performance due to a component selection. This decouples the contributions of the main orchestrator model, any subagents, image generation models, and the different elements in the harness, letting us combine multiple signals into one coherent leaderboard.

第一版排行榜来自我们对 orchestrator 模型的因果评估,也就是评估负责选择调用哪些工具的主 LLM。关于 agentic harness 其他方面的排名即将推出。下文的统计方法部分会给出更多方法论细节。

This first leaderboard is the result of our causal evaluation of orchestrator models — the main LLMs that choose which tools to call. Rankings of other aspects of the agentic harness are coming soon. We include more methodological detail in the statistical-methodology section below.

按信号拆分的排行榜

Per-Signal Leaderboards

每个 Agent Arena 会话都包含一串丰富反馈。用户用自然语言与智能体反复迭代,逐轮表达认可、沮丧或澄清。他们会决定是否下载智能体生成的产物,会点击明确的 approve / disapprove 按钮,也会在智能体偏离方向时发出行内纠正。而智能体一侧则不断与会“回应”的环境交互:shell 退出码、工具错误、试图调用但不存在的工具等。 Agent Mode 使我们能够提取所有这些信号,包括显式用户反馈、隐式用户反馈以及来自智能体环境的反馈。我们先计算每个会话在各个信号上的结果,再用因果方法把它们转成排行榜,最后聚合为主排行榜。今天我们展示首批 5 个信号,并计划在近期测量更多信号。

Every Agent Arena session contains a stream of rich feedback. Users iterate with the agent in natural language, expressing approval, frustration, or clarification turn by turn. They decide whether to download an artifact the agent produced. They click explicit approve / disapprove buttons. They issue in-line corrections when the agent goes off-track. And the agent, on its side, is interacting with an environment that talks back continuously: shell exit codes, tool errors, the absence of a tool it tried to call. Agent Mode lets us extract all of these signals — explicit user feedback, implicit user feedback, and feedback from the agent's environment. After we compute per-session outcomes for each signal, we turn them into leaderboards with causal methods and then aggregate them into the headline leaderboard. We present our first 5 signals today, and we plan to measure more in the near future.

Per-Signal Rankings

Each model's score on the canonical sub-signals that compose the aggregate (τ̂). Click a column to sort.

主排行榜聚合了以下信号:

The headline leaderboard aggregates the following signals:

  • 确认成功 —— 用户使用 Arena UI 将任务标记为成功或失败。Arena 在每一轮都向用户提供 approve 和 disapprove 按钮;我们使用某个任务轨迹的最终认可或否定来确定结果。(一个会话中可能包含多个任务。)
  • 赞扬与抱怨 —— 用户赞扬或抱怨智能体输出。对每个任务,我们识别表达明确口头赞扬(如“看起来很好”“这正是我需要的”)或明确口头抱怨(如“这坏了”“你完全误解了”)的消息。如果赞扬多于抱怨,该任务会被标记为成功。
  • 可引导性 —— 智能体能否执行用户纠正。当用户发出行内纠正(如“不,改做 X”“你读错文件了”)时,智能体应该尝试修复。如果用户接受修复,我们将该纠正标记为成功;如果用户拒绝或放弃,则标记为失败。真实工作中错误不可避免,这个信号衡量这些错误是否能被快速解决。
  • Bash 恢复 —— 从 bash 错误中恢复所需的轮数。当智能体发出 bash 命令,并因模型失败而非环境问题出错时,恢复计时开始;我们统计后续 bash 调用,直到出现下一个无错误命令。如果智能体放弃,则施加额外惩罚。
  • 工具幻觉 —— 智能体引用不存在的工具。该信号会惩罚编造工具名、格式错误导致产生无效名称,以及思维链 token 泄漏到工具字段等情况。如果智能体调用了不存在的工具,我们将该任务标记为失败。
  • Confirmed success — the user marks a task as a success or failure using the Arena UI. Arena gives users approve and disapprove buttons on every turn; we use the final approval or disapproval of a given task's trajectory to determine the outcome. (There can be more than one task per session.)
  • Praise vs. complaint — the user praises or complains about the agent's output. For each task we identify messages expressing explicit verbal praise ("looks great", "this is exactly what I needed") or explicit verbal complaint ("this is broken", "you misunderstood entirely"). The task is marked a success if praise outnumbers complaints.
  • Steerability — the agent executes on user corrections. When a user issues an in-line correction ("no, do X instead", "you misread the file"), the agent should attempt to fix it. If the user accepts the fix, we mark the correction successful; if they reject it or give up, unsuccessful. When doing real work, mistakes are inevitable — this signal captures whether these errors are quickly resolved.
  • Bash recovery — turns taken to recover from a bash error. When the agent issues a bash command that errors due to a model failure (not an environment issue), the recovery clock starts; we count follow-up bash calls until the next non-erroring command. If the agent gives up, we impose an additional penalty.
  • Tool hallucination — the agent references a tool that does not exist. This penalizes invented tool names, malformed syntax that produces a junk name, and chain-of-thought tokens leaking into the tool field. We mark the task a failure if the agent calls a nonexistent tool.

这五个信号只是起点。我们计划加入更多信号以进一步丰富评估,也会淘汰随着时间失去相关性的信号,并在改进 trace 挖掘时调整这些信号。

This set of five signals is only a starting point. We plan to add more signals to further enrich these evaluations, retire ones that age out of relevance, and modify them as we improve our trace-mining.

最后,虽然成本不是排行榜信号,但我们也可以计算智能体部署后的实际成本,用来评估帕累托最优性。我们直接计算一个会话的 精确 成本。我们发现,有些模型虽然纸面价格更低,但实践中反而更贵。这来自模型行为(例如每轮更多步骤)或被诱导出的用户行为(例如需要更多轮才能满意)。

Finally, though not a leaderboard signal, we can also calculate the realized, post-deployment cost of the agents to assess Pareto optimality. We directly calculate the exact cost of a session. We find some models more expensive in practice, despite cheaper on-paper pricing. This is as a result of model behavior (e.g. more steps per turn) or induced user behavior (e.g. more turns to reach satisfaction).

Cost vs. Performance

Net Improvement vs. list-price cost per session (7-day window)

Square markers sit on the cost–performance frontier (—— dotted). Hover any point for its model, provider, and score.

真实世界中的智能体

Agents in the Real World

这里我们深入展示支撑排行榜的数据。Agent Arena 是真实用户请求模型工作的实时流:写代码、调试损坏项目、跨网页研究、创建文档、构建前端、分析文件,并迭代多步骤任务。

Here we present a deep dive into the data that powers the leaderboards. Agent Arena is a live stream of real users asking models to work: write code, debug broken projects, research across the web, create documents, build frontends, analyze files, and iterate over multi-step tasks.

Task Distribution

Primary intent across 160,480 agent tasks (7-day window)

TOP CATEGORIES (≥5%)
Code writing17.5%
Research / lookup10.8%
Planning / brainstorm10.6%
Image / video10.2%
Document creation9.1%
Code debugging8.9%
Chitchat6.8%
Education / tutoring5.7%
Creative writing5.3%

Hover a slice for its share; inner arcs show its sub-intents.

在最近一个 7 天切片中,Arena 观察到 160,480 个 Agent Mode 任务(注意一个会话中可能有多个任务)。最大的类别是代码编写(17.5%)、研究与查找(10.8%)、规划与头脑风暴(10.6%)以及多模态图像/视频工作(10.2%),随后是文档创建(9.1%)和代码调试(8.9%)。仅代码编写就约有 28,000 个任务,另有约 14,000 个代码调试任务和约 17,000 个研究与查找任务。

In a recent 7-day slice, Arena saw 160,480 Agent Mode tasks (note there can be multiple tasks in a session). The largest categories were code writing (17.5%), research and lookup (10.8%), planning and brainstorming (10.6%), and multimodal image/video work (10.2%), followed by document creation (9.1%) and code debugging (8.9%). Code writing alone accounted for roughly 28,000 tasks, with another ~14,000 in code debugging and ~17,000 in research and lookup.

Tool Calls by Volume

Total calls per tool across 2,060,159 tool calls (7-day window)

bash
936,046
write_file
549,893
web_search
275,660
read_file
117,873
fetch_page
85,684
list_files
45,686
ask_user
39,043
generate_image
10,274

Tool Calls per Task, by Category

The box and whiskers mark P10 · P25 · P50 · P75 · P90; the diamond ◆ is the mean.

在 128,244 个会话中,75.6% 至少使用了一个工具,其中 41.1% 运行过 bash,27.1% 使用过网页搜索。在这一周内,Agent Mode 发出了 200 万次结构化工具调用,包括约 936,000 次 bash 调用、约 550,000 次文件写入和约 275,000 次网页搜索。

Across 128,244 sessions, 75.6% used at least one tool — 41.1% ran bash and 27.1% ran web search. In the week, Agent Mode issued 2 million structured tool calls, including ~936,000 bash calls, ~550,000 file writes, and ~275,000 web searches.

Lines of Code Written, by Language

Final non-blank lines from successful write_file calls (7-day window); tile area scales with lines written

通过成功的 write_file 调用进行追踪,Agent Mode 在上周写入了 4,030 万行代码 ,约等于每个编码会话 1,000 行。

Tracking via successful write_file calls, Agent Mode wrote 40.3 million lines of code in the last week — roughly 1,000 lines per coding session.

Tool Calls per Agent Session

Tool calls per session, grouped into complexity tiers (7-day window)

Heaviest Sessions

Work-type mix of 3,467 highest tool-use sessions (7-day window)

TOP CATEGORIES (≥5%)
Coding & repo debugging53.2%
Artifact & file creation39.0%
Research & web synthesis5.0%

Hover a slice for its share; inner arcs break down its tool mix.

过去 7 天中,会话平均包含约 16.5 次结构化工具调用;高工具使用会话已经足够常见,形成了独立群体:一周内有超过 3,400 个经过循环过滤的会话运行了很长的工具链。这些会话大多是真实工作,其中 53.2% 是编码或仓库调试,39.0% 是产物/文件创建,其余覆盖网页综合、终端工作流和数据分析。

In the past 7 days, sessions averaged ~16.5 structured tool calls, and high-tool sessions were common enough to form their own cohort: more than 3,400 loop-filtered sessions ran very long tool chains in a single week. Those sessions were mostly real work — 53.2% coding or repo-debugging, 39.0% artifact/file-creation, with the rest spanning web synthesis, terminal workflows, and data analysis.

Session Context Length

Input context on the final turn (7-day window)

最后,近期约 32% 的会话在最终轮至少包含 128k 输入 token,22% 至少包含 256k,8% 至少包含 1M。

Finally, about 32% of recent sessions ended with at least 128k input tokens in the final turn, 22% with at least 256k, and 8% with at least 1M.

人们用智能体构建什么

What People Build

在我们看到的最重度真实会话样本中,包括实时体育电视日程网站、自主水下航行器自动驾驶、自托管电影 watchlist 应用、金融研究 RAG pipeline、实时学习追踪平台等。许多会话以用户下载完成后的 workspace 结束。

In a sample of the heaviest real sessions we saw: a live sports-TV schedule site, an autonomous-underwater-vehicle autopilot, a self-hosted movie-watchlist app, a financial-research RAG pipeline, a live study-tracking platform, and more. Many end with the user downloading the finished workspace.

Real Agent Mode Usage Examples

A sample of high-effort Agent Mode sessions (7-day window)

Live sports-TV schedule site
⤓ workspace downloaded
Web app / data aggregation · Italian
Built a web app that aggregates the day's sports broadcasts across several Italian TV and streaming guides, merging duplicate events across sources, plus a password-protected admin page to monitor and repair broken data feeds.
Deliverable — A deployable web app with a per-source health dashboard and uptime alerts; workspace downloaded.
Claude Opus 4.7 (Thinking) 140 turns 448 tool calls
Self-hosted movie watchlist
Full-stack / DevOps · English
Took a personal movie-tracking idea from a written product spec and a high-fidelity HTML mockup through to a Dockerized, self-hosted web app that imports a year of films from free movie databases, filters by region and language, and exports curated watchlists.
Deliverable — A product spec, interactive mockup, implementation plan, and a running containerized build.
GPT 5.4 (High) 60 turns 522 tool calls
Underwater-vehicle autopilot
Robotics / control systems · Russian
Debugged and re-architected the control system for an autonomous underwater vehicle in a ROS/Gazebo simulation — fixing rudder and ballast physics, PID depth and pitch control, and adding selectable autopilot modes for different motor and control-surface configurations.
Deliverable — A reworked physics model and a modular autopilot with depth-aware maneuvering.
Anonymous model 162 turns 494 tool calls
Blender add-on for architects
⤓ workspace downloaded
CAD / creative tooling · English
Designed and began building a Blender add-on that brings a SketchUp-like architectural sketching workflow — predictive snapping, guide and tape-measure tools, and a premium UX — into Blender, working from an existing project codex and schedule.
Deliverable — A product roadmap plus an incremental implementation of the snapping and guide tools; workspace downloaded.
GPT 5.4 (High) 82 turns 546 tool calls
Financial research RAG pipeline
AI infrastructure / RAG · Persian
Architected a retrieval-augmented "financial brain" that ingests, cleans, chunks, and embeds finance articles and data feeds for downstream reasoning, then layered on observability, evaluation, and a controlled pilot-execution kit.
Deliverable — A layered ingestion → memory → retrieval → evaluation pipeline with a full architecture flowchart and a passing 160-test suite.
GPT 5.5 84 turns 676 tool calls
Live study-tracking platform
Edtech web platform · Bengali
Researched leading study and productivity platforms, then extended an edtech web app with a live study system — real-time session tracking, study leaderboards, badges, and an admin dashboard to spot inactive students.
Deliverable — New live-study pages, leaderboards, and admin-intervention views added to the existing system.
GPT 5.5 (High) 74 turns 411 tool calls
RTMP streaming server
⤓ workspace downloaded
Media infrastructure · English
Built a self-hostable RTMP server for streaming from OBS, with a browser dashboard, built-in HTTP-FLV playback, a start/stop toggle, dark mode, and a settings panel — fixing dashboard config, LAN behavior, and port-conflict handling along the way.
Deliverable — A working RTMP server with a dashboard UI, automated self-tests, and a Windows setup guide; workspace downloaded.
GPT 5.4 (High) 130 turns 417 tool calls
Kids' screen-time tracker
⤓ workspace downloaded
Consumer web app · English
Built and iterated on a React app for tracking a child's weekly behavior and screen time, adding admin-only roles with approval workflows, dark mode, and emailed PDF reports with colorful per-week charts.
Deliverable — A working app with a clean toolbar UI, role permissions, and PDF report export; workspace downloaded repeatedly.
Claude Opus 4.6 80 turns 440 tool calls
Minecraft server in Go
⤓ workspace downloaded
Systems programming · English
Generated a Go implementation of the Minecraft network protocol from a spec, then fixed a long chain of compile errors and re-architected the networking engine from a worker pool to idiomatic goroutine-per-connection, wiring up block-update packet relays.
Deliverable — A cross-compiled server build with corrected concurrency and multiplayer block sync; workspace downloaded.
Claude Opus 4.7 59 turns 438 tool calls

同一周内,workspace 总下载量超过 50,000 次,远不止代码,还包括办公和媒体产物(.docx, .pptx, .xlsx, .pdf,以及图片)。

Overall in the same week, the workspace saw over 50,000 downloads — far beyond just code, including office and media artifacts (.docx, .pptx, .xlsx, .pdf, and images).

人们如何与智能体协作

How People Work With Agents

除了 哪个 模型胜出之外,trace 流还揭示了 人们如何 真正把任务委托给智能体,以及智能体如何处理纠正。

Beyond which model wins, the trace stream reveals how people actually delegate to agents — and how agents handle being corrected.

Delegation & Control

How much users hand over — and how they steer once the work is underway.

Delegation Posture

How much users hand over in their opening message.

  • Asked for advice · 28%
  • Directed step-by-step · 1%
  • Gave a scoped task · 11%
  • Handed off a deliverable · 45%
  • Let it run autonomously · 14%

Reining In

After the first reply, users pull control back ~2.3× as often as they hand over more.

Took back control
50%
Handed over more
22%

大多数开场消息是交付一整项工作,而不是请求建议:委托姿态明显偏向“构建这个交付物”和“自主运行”。但在看到第一条回复后,用户会收紧控制,比继续放权更频繁地收回控制。

Most opening messages hand over a whole job rather than ask for advice: the delegation posture skews heavily toward "build this deliverable" and "operate autonomously." However, after seeing the first response, they tighten the reins — pulling control back far more often than they hand over more.

Bluster & Bluffing

Two ways a capable-sounding agent still underdelivers.

Bluster

A corrected agent sounds firm but almost never holds its ground.

Sounds assertive or firm
26%
Declines to make the change
2.7%
Argues the user is wrong
1.4%

Bluffing

On multi-part asks, how fully it covers every part.

  • Every part covered · 58%
  • A part left incomplete · 34%
  • A part silently dropped · 8%

我们还发现,当开场请求包含几个明确部分时,智能体通常会覆盖所有部分;典型缺口是留下一个部分未完成。更少见但后果更严重的缺口是隐蔽的:智能体本可以说明工作未完成,却把结果呈现得像已经完成。我们称之为“虚报”。

We also find that when the opening ask bundles several explicit parts, agents usually cover all of them; the typical shortfall is leaving one incomplete. A rarer but more consequential shortfall is covert: the agent could have surfaced the incomplete work, but instead presents the result as complete. We call this "Bluffing".

最后,智能体有时确实会反驳用户,但我们发现它们通常只是 听起来 坚定,在实践中很少真正坚持立场。我们称之为“强撑”:一种在人为压力下会融化的表面强硬。

Finally, agents do sometimes push back against users — but we find they usually only sound firm, rarely holding their ground in practice. We call this "Bluster": an artificial assertiveness that melts under additional pressure.

方法论的正式细节

Formal Details of Methodology

本节描述评估框架的正式细节。

In this section we describe the formal details of our evaluation framework.

考虑一个由 $K$ 个组件构成的智能体,以及索引为 $i \in [n]$ 的会话。每个会话都会独立采样一个表示智能体配置的 $K$ 维向量 $T_i$。该配置至少包含 orchestrator 模型;随着 Agent Arena 扩展,也会包含工具、system prompt 和 harness 等其他组件。配置来自采样分布 $P$。我们记 $p_{i,k}(t) = \mathbb{P}_{T_i \sim P}(T_{i,k} = t)$;各组件独立采样。每个会话产生一个结果 $Y_i \in \mathbb{R}$,代表前文某个信号。(我们先分别计算各信号排行榜,最后再求平均。)

Consider a $K$-component agent and sessions indexed by $i \in [n]$. Each session independently samples a $K$-dimensional vector representing an agent configuration, $T_i$. The configuration includes at minimum the orchestrator model, and as Agent Arena expands will encompass additional components such as the tools, system prompt, and harness. The configuration is drawn from a sampling distribution $P$. We denote $p_{i,k}(t) = \mathbb{P}_{T_i \sim P}(T_{i,k} = t)$; components are sampled independently. Each session yields an outcome $Y_i \in \mathbb{R}$, representing one of our signals from the previous sections. (We compute the per-signal leaderboards separately, then average them at the end.)

我们的估计目标是:相对于一个固定基线分布 $Q$,每个组件选择的处理效应。其对应概率为 $q_{i,k}(t) = \mathbb{P}_{T_i \sim Q}(T_{i,k} = t)$。通常,我们把 $Q$ 设为组件上的均匀分布。形式上,第 $k$ 个组件的第 $t$ 个选择的处理效应,定义为处理与控制下结果 $Y_i$ 的期望差异:

Our target of estimation is the treatment effect of each component selection with respect to a fixed baseline distribution $Q$, with analogous probabilities $q_{i,k}(t) = \mathbb{P}_{T_i \sim Q}(T_{i,k} = t)$. Typically, we take $Q$ to be a uniform distribution over components. Formally, the treatment effect of the $t$-th choice of the $k$-th component is defined as the expected difference in outcomes $Y_i$ under treatment and control:

$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i(T_{i,k} = t) - Y_i\bigr].$$

$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i(T_{i,k} = t) - Y_i\bigr].$$

这里,$Y_i(T_{i,k} = t)$ 表示当我们干预第 $k$ 个组件并将其设为 $t$ 时的“潜在结果”。

Here, $Y_i(T_{i,k} = t)$ denotes the "potential outcome" when we intervene on the $k$-th component and set it to $t$.

由于组件是独立采样的,利用因果推断中的标准识别结果,我们可以把处理效应改写为:

Given that we sample the components independently, using standard identification results from causal inference we can rewrite the treatment effect as:

$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i \,\big|\, T_{i,k} = t\bigr] - \mathbb{E}_{T_i \sim Q}\bigl[Y_i\bigr].$$

$$\tau_{k \to t} = \mathbb{E}_{T_i \sim Q} \bigl[Y_i \,\big|\, T_{i,k} = t\bigr] - \mathbb{E}_{T_i \sim Q}\bigl[Y_i\bigr].$$

我们使用自归一化估计量来估计该数量:

We estimate this quantity using the self-normalized estimator:

$$\hat\tau_{k \to t} = \frac{\sum_{i:\, T_{i,k} = t} w_i Y_i}{\sum_{i:\, T_{i,k} = t} w_i} - \frac{\sum_i w_i Y_i}{\sum_i w_i},$$

$$\hat\tau_{k \to t} = \frac{\sum_{i:\, T_{i,k} = t} w_i Y_i}{\sum_{i:\, T_{i,k} = t} w_i} - \frac{\sum_i w_i Y_i}{\sum_i w_i},$$

其中

where

$$w_i = \prod_{k=1}^K \frac{q_k(T_{i,k})}{p_{i,k}(T_{i,k})}.$$

$$w_i = \prod_{k=1}^K \frac{q_k(T_{i,k})}{p_{i,k}(T_{i,k})}.$$

在自归一化估计量的标准中心极限定理条件下,$\hat\tau_{k \to t}$ 渐近正态。我们会在每个估计旁报告 95% 置信区间 $\hat\tau_{k \to t} \pm 1.96\,\widehat{\mathrm{SE}}$。

$\hat\tau_{k \to t}$ is asymptotically normal under standard CLT conditions for self-normalized estimators. We report 95% confidence intervals $\hat\tau_{k \to t} \pm 1.96\,\widehat{\mathrm{SE}}$ alongside every estimate.

为处理分布漂移,例如新模型进入 Arena 带来的漂移,我们使用额外的时间衰减权重,更强调最近的数据点。这样排行榜就能始终反映智能体当前的强弱。

To address distribution shift, such as the shift arising from new models entering the Arena, we use additional time-decaying weights to place more emphasis on the most recent data points. That way, the leaderboard always reflects the current strengths and weaknesses of agents.

当前排行榜只评估 orchestrator,不评估其他组件,因此在生产设置中目前有 $K = 1$。

The current leaderboards evaluate orchestrators and no other components, so in the production setting we currently have $K = 1$.

引用

Citation

@misc{arena2026agentarena,
  title        = {{Agent Arena}: Causal Evaluation of Agents in the Real World},
  author       = {{Arena Team}},
  year         = {2026},
  month        = jun,
  howpublished = {\url{https://arena.ai/blog/agent-arena-methodology}},
  note         = {Arena Blog}
}
@misc{arena2026agentarena,
  title        = {{Agent Arena}: Causal Evaluation of Agents in the Real World},
  author       = {{Arena Team}},
  year         = {2026},
  month        = jun,
  howpublished = {\url{https://arena.ai/blog/agent-arena-methodology}},
  note         = {Arena Blog}
}

附录:按信号拆分的排行榜

Appendix: Per-Signal Leaderboards

Confirmed Success

Praise vs Complaint

Steerability

Bash Recovery

Tool Hallucination