← 返回文章归档

近一年 GUIAgent 论文综述:从会点屏幕到可验证的移动端 QA Agent

系统梳理 2025-06-15 至 2026-06-15 公开可检索的 GUIAgent / computer-use agent 论文,从移动端 APP 自动化测试、benchmark、grounding、RL、自进化、verifier、混合工具与安全治理七条主线总结领域变化。

#GUI Agent#Computer Use#Survey#Mobile QA#APP Automation

摘要

过去一年,GUIAgent / computer-use agent 的研究重心发生了明显迁移:领域不再满足于“模型能不能在截图上点中按钮”,而是开始追问 agent 是否能在真实应用、真实设备、真实工作流中稳定完成任务,并且能否被验证、被审计、被纠错、被安全约束。

本文覆盖 2025-06-15 至 2026-06-15 的公开可检索论文与 GUI Agents Paper List 近一年相关条目。基于标题、摘要、项目页和已有论文解读,论文池中共筛出约 250 篇 GUIAgent / computer-use agent 相关工作。它们大致汇聚成七条主线:

  1. 移动端 / APP 自动化测试与 Mobile QA:AndroidDaily、GUITester、GUITestScape、WebTestBench、WebTestPilot 等把 GUIAgent 从“完成用户任务”推向“生成、执行和验证测试”。
  2. Benchmark、环境与可验证评测:WindowsWorld、MacArena、DeskCraft、LivingScreen、MobileGym、CUA-Gym、OpenComputer 等把评测从静态截图推向在线、长程、跨应用、动态屏幕和可验证环境。
  3. GUI grounding 与屏幕解析:GUI-Actor、GUI-G2、AutoFocus、DRS-GUI、ScreenParse、UI-Zoomer、WinDeskGround 等继续补齐“看得准、点得稳、能处理高分辨率和复杂 UI”的底层能力。
  4. 训练数据、SFT / RL 与自进化:UI-Voyager、SE-GA、Video2GUI、GUI-CIDER、UI-TARS-2、MobileRL、PRO-CUA 等说明 GUIAgent 已进入数据飞轮和过程强化学习阶段。
  5. 长程记忆、过程奖励、Verifier 与 Critic:VeriGUI、HiViG、StainFlow、GUI-Shepherd、VAGEN、OS-Themis 等把“每一步是否有效”变成核心研究对象。
  6. Hybrid action、RPA、MCP 与工具融合:OSWorld-MCP、ToolCUA、CLI-Anything、AutoRPA、AppAgent-Claw、SkillDroid 等表明纯视觉点击不是终局,GUI + API + CLI + test framework 的混合控制更接近生产系统。
  7. 安全、隐私、权限与对抗鲁棒性:MIRAGE、AgentRAE、CORA、GUIGuard、AgentHijack、WebSentinel、CaMeLs 等把 GUIAgent 的攻击面、权限边界和审计问题推到前台。

从 APP 自动化测试视角看,近一年最重要的变化可以概括为一句话:GUIAgent 正在从“操作执行器”转向“可验证的测试执行与缺陷发现系统”。 这意味着未来移动端 QA 平台不能只把大模型接到 Appium 或截图点击器上,而要建设完整闭环:任务生成、环境 reset、动作执行、等待策略、oracle 推断、日志/网络/业务状态验证、失败归因、可回放报告,以及对高风险动作的安全治理。

1. 范围与领域地图

本文讨论的 GUIAgent 包括几类相邻系统:

  • 以截图、控件树、视频或多模态上下文为输入的 GUI 操作 agent;
  • 面向 Android / iOS / Web / Desktop / OS 的 computer-use agent;
  • GUI grounding、screen parsing、element localization 等底层感知模型;
  • 面向 GUI 任务的 SFT、RL、过程奖励、critic、verifier、memory 与 self-evolution 方法;
  • 面向 APP / Web / SaaS / OS 的自动化测试、RPA、benchmark、可验证环境与安全评估。

不把“GUIAgent”限定为单一平台是必要的。移动端 APP 自动化测试确实是本文的工程落点,但 Mobile、Web、Desktop、OS benchmark 正在共享同一组核心问题:

层次关键问题对 APP 自动化测试的意义
Observation截图、控件树、视频、日志、网络请求、设备状态如何组合决定测试 agent 能看到什么,能否处理 WebView、权限弹窗、动态内容
Grounding如何稳定定位按钮、文本、图标、列表项、拖拽目标误点会被误判为 App 缺陷,必须区分 agent error 与 product bug
Planning如何把自然语言目标拆成可执行子目标从“点一下登录”扩展到登录、授权、下单、支付前校验等业务流
Execution如何处理等待、重试、前后台、弹窗、弱网、滑动决定 E2E 测试是否 flaky
Verification如何判断每一步和最终结果是否正确从最终截图变成 UI、业务、日志、网络、DB/mock 状态的多层 oracle
Learning如何从失败轨迹、人工用例、录屏和历史测试中学习决定测试平台能否持续进化,而不是每次靠 prompt 调参
Safety如何限制支付、删除、发布、隐私访问等高风险动作决定 agent 能否进入真实预发、灰度或生产影子环境

2. 第一条主线:移动端 / APP 自动化测试从“脚本执行”走向“缺陷发现”

移动端相关论文是近一年增长最快、也最贴近工程落地的一支。早期 Mobile GUI Agent 更多关注“能否完成用户指令”,例如打开 App、搜索内容、发送消息;近一年则出现了明显的 QA 化趋势:任务不再只是用户目标,而是测试目标;成功不再只是走到某个页面,而是发现缺陷、验证业务状态、生成可复现报告。

2.1 代表论文

时间论文
2026-05-26AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
2026-05-25MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
2026-04-30WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
2026-04-23VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
2026-04-14See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
2026-04-10CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
2026-04-09KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
2026-04-08Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
2026-04-07Don”t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
2026-04-02GPA: Learning GUI Process Automation from Demonstrations
2026-03-31Terminal Agents Suffice for Enterprise Automation
2026-03-31PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
2026-03-26WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
2026-03-24AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI
2026-03-16GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
2026-03-10SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
2026-03-09SecAgent: Efficient Mobile GUI Agent with Semantic Context
2026-03-09AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
2026-03-08Generalization in Online Reinforcement Learning for Mobile Agents
2026-02-28MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
2026-02-24Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
2026-02-15Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
2026-02-12AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
2026-02-11Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
2026-02-10TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
2026-02-07Mapping the Design Space of User Experience for Computer Use Agents
2026-02-06VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics
2026-02-05UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2026-02-05M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
2026-02-03MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2026-01-30Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
2026-01-28MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment
2026-01-26SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy Tasks
2026-01-26LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent
2026-01-24GraphPilot: GUI Task Automation with One-Step LLM Reasoning Powered by Knowledge Graph
2026-01-08GUITester: Enabling GUI Agents for Exploratory Defect Discovery
2026-01-07MobileDreamer: Generative Sketch World Model for GUI Agent
2025-12-24AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
2025-12-22MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
2025-12-18OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
2025-12-16MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
2025-12-14Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents
2025-12-12Using GUI Agent for Electronic Design Automation
2025-12-10GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection
2025-11-27Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

2.2 从 Mobile Agent 到 Mobile QA Agent

AndroidDaily、MobileGym、SimuWoB、GUI-CEval、MobileBench-OL、VenusBench-Mobile 代表了移动端 benchmark 的一个共识:真实移动 App 不是静态网页,也不是干净模拟器。它包含登录态、权限、推送、系统弹窗、厂商 ROM、推荐流、弱网、WebView、第三方 SDK、前后台切换和多设备同步。

这对 APP 自动化测试有三个直接结论。

第一,测试环境本身必须成为平台能力。 MobileGym 和 SimuWoB 的价值不只是“又建了一个 benchmark”,而是把可并行、可 reset、可验证的移动环境当作训练和评测基础设施。传统 Appium / UIAutomator / XCUITest 往往解决动作执行,但不完整解决环境状态、账号状态、服务端 mock、设备扰动和结果 oracle。

第二,探索式测试正在从随机遍历转向语义探索。 GUITester、GUITestScape、Scenario-Guided LLM-based Mobile App GUI Testing 等工作把 LLM/GUIAgent 引入缺陷发现:agent 不只是覆盖更多页面,还要理解业务意图、异常路径和潜在缺陷类型。对 QA 团队来说,这意味着“测试用例生成”会逐步变成“测试场景规划 + GUI 执行 + 缺陷证据收集”。

第三,oracle 推断成为核心瓶颈。 WebTestPilot、From Exploration to Specification、VAGEN、GUI-Shepherd 等工作虽然横跨 Web 和 Mobile,但都指向同一个问题:agent 如何知道 App 行为是错的?最终截图很少足够。移动端 oracle 必须融合 UI 状态、接口返回、埋点、日志、crash/ANR、业务数据和历史基线。

3. 第二条主线:Benchmark 从静态截图转向可验证、动态、长程环境

过去的 GUI benchmark 很容易被简化成“看图点点点”。近一年最有价值的 benchmark 工作,普遍在挑战这个简化假设:任务会跨应用,屏幕会动态变化,成功需要过程证据,环境要可 reset,agent 还要知道何时等待、何时停止、何时承认失败。

3.1 代表论文

时间论文
2026-06-03Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
2026-05-26AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
2026-05-25MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
2026-04-30WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
2026-04-27Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026-04-27AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
2026-04-13WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
2026-04-13ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
2026-04-10HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
2026-04-10EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
2026-04-09KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
2026-04-07WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
2026-04-06IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
2026-04-06GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
2026-03-31PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
2026-03-27GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
2026-03-26WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
2026-03-26GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-18WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
2026-03-16GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
2026-03-11CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents
2026-03-10SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
2026-03-09PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents
2026-03-09OSExpert: Computer-Use Agents Learning Professional Skills via Exploration
2026-03-05TimeWarp: Evaluating Web Agents by Revisiting the Past
2026-03-01WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale
2026-02-28MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-25OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
2026-02-25GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
2026-02-24Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
2026-02-19Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
2026-02-17World-Model-Augmented Web Agents with Action Correction
2026-02-16WebWorld: A Large-Scale World Model for Web Agent Training
2026-02-15GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training
2026-02-13Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
2026-02-12AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
2026-02-11UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
2026-02-11See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
2026-02-10TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
2026-02-10Code2World: A GUI World Model via Renderable Code Generation
2026-02-10Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
2026-02-06VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics
2026-02-05PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
2026-02-03MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2026-02-03LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial
2026-02-03Agent Alpha: Tree Search Unifying Generation, Exploration and Evaluation for Computer-Use Agents
2026-01-29How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors
2026-01-28OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

3.2 WindowsWorld、MacArena、DeskCraft、LivingScreen 的共同指向

WindowsWorld 把桌面任务放进跨应用、带中间 checkpoint 的专业流程中;MacArena 强调真实 macOS、第三方应用和执行式验证;DeskCraft 把 human-in-the-loop 和 professional workflow 纳入评测;LivingScreen 则直接挑战“屏幕在两次动作之间静止”的隐含假设。

这些工作看似偏桌面,但对 APP 自动化测试的启发非常强:

  • 过程检查比终态成功更重要。 一条下单链路失败,必须知道是登录、搜索、领券、加购、支付前校验还是回流出错。
  • 动态 UI 是一等对象。 短视频、直播、IM、地图、外卖、打车、行情、推荐流都不是静止页面。测试 agent 需要 watch / wait / observe / sample 的策略,而不只是 click。
  • 不可行任务识别很关键。 当账号无权限、库存不足、网络断开或服务端 mock 不满足条件时,agent 应该报告不可执行,而不是继续乱点。
  • 环境 reset 和可验证性决定 benchmark 质量。 没有可控账号态、设备态、服务端态,移动端 benchmark 很容易不可复现。

4. 第三条主线:GUI grounding 仍是底座,但不再是全部

GUI grounding 仍然是 GUIAgent 的底层能力。真实 App 中,按钮很小、文字密集、列表可滚动、图标语义模糊、WebView 与 Native 混合、高 DPI 和不同分辨率会造成定位漂移。近一年 grounding 工作大多围绕 coordinate-free、区域搜索、zoom-in、完整 screen parsing、测试时增强和鲁棒性展开。

4.1 代表论文

时间论文
2026-06-03Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
2026-05-29GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning
2026-05-01A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction
2026-04-15UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
2026-04-15GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
2026-04-14See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
2026-04-09MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
2026-04-09Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
2026-04-08What’s Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
2026-03-27Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
2026-03-27Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
2026-03-24AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-18WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
2026-03-18AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
2026-03-15Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements
2026-03-05WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
2026-02-25OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
2026-02-24Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
2026-02-15Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
2026-02-11Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
2026-02-06Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
2026-02-06POINTS-GUI-G: GUI-Grounding Journey
2026-02-06ANCHOR: Branch-Point Data Generation for GUI Agents
2026-02-02Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
2026-01-29How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors
2026-01-14GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents
2026-01-14Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents
2026-01-11V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking
2026-01-05WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
2025-12-18VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
2025-12-09MVP: Multiple View Prediction Improves GUI Grounding
2025-12-05Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
2025-12-02GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
2025-11-07Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging
2025-10-05GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
2025-08-17You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
2025-08-07Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency
2025-08-06GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
2025-07-29UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

4.2 对 QA 的关键区别:定位错误不是产品缺陷

APP 自动化测试引入 GUIAgent 后,会出现一个传统自动化框架较少面对的问题:当测试失败时,失败到底来自 App,还是来自 agent?

如果 agent 点错按钮、误读控件、滚动过头、没有识别 toast,测试报告不能直接归因于产品缺陷。GUI-Perturbed、UI-Zoomer、AutoFocus、DRS-GUI、ScreenParse、WinDeskGround 等工作提示了几个工程原则:

  1. grounding 结果需要置信度和备选区域,而不是单一坐标;
  2. 高风险动作前应使用二次确认,例如截图标注、控件树匹配、动作效果验证;
  3. 测试报告应记录 grounding evidence:目标描述、候选元素、最终坐标、点击前后截图、控件树 diff;
  4. 对动态列表、瀑布流、弹窗和 WebView,应把“查找元素”建模为搜索过程,而不是一次性定位。

5. 第四条主线:训练范式从 SFT 走向数据飞轮、RL 和自进化

GUIAgent 的训练正在从“收集人工轨迹做 SFT”走向更复杂的数据飞轮:视频和录屏生成轨迹,失败轨迹生成修正样本,环境提供可验证 reward,RL 和 RFT 优化长程任务,memory 系统沉淀经验。

5.1 代表论文

时间论文
2026-06-03Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
2026-05-29GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning
2026-04-28Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
2026-04-13ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
2026-04-10EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
2026-04-09MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
2026-04-08Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
2026-04-02GPA: Learning GUI Process Automation from Demonstrations
2026-03-27GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
2026-03-25UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
2026-03-25CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-23CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training
2026-03-19OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
2026-03-12HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
2026-03-11Hybrid Self-evolving Structured Memory for GUI Agents
2026-03-10Video-Based Reward Modeling for Computer-Use Agents
2026-03-09AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
2026-03-08Generalization in Online Reinforcement Learning for Mobile Agents
2026-03-04Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-03-03CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
2026-02-28MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-25GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
2026-02-16WebWorld: A Large-Scale World Model for Web Agent Training
2026-02-15GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training
2026-02-13WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
2026-02-13Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
2026-02-12Adaptive Milestone Reward for GUI Agents
2026-02-10Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
2026-02-06ANCHOR: Branch-Point Data Generation for GUI Agents
2026-02-05UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2026-02-05M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
2026-01-31Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction
2026-01-30Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
2026-01-30Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution
2026-01-29WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
2026-01-29DynaWeb: Model-Based Reinforcement Learning of Web Agents
2026-01-28Continual GUI Agents
2026-01-26GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
2026-01-19MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux
2026-01-07InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
2026-01-05WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
2025-12-02GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
2025-11-27Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
2025-11-06GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
2025-10-22WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
2025-10-22VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
2025-10-17WebServ: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale
2025-09-28Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

5.2 失败轨迹成为训练资产

UI-Voyager 的核心不是“又训练了一个移动 agent”,而是把失败轨迹变成可学习对象;SE-GA、GUI-CIDER、Video2GUI、HATS、CUA-Suite、GUI-Libra、MobileRL 等工作也都在回答同一个问题:真实 GUI 任务成本高、失败多、路径长,如何把这些失败转化为更好的模型和 policy?

对 APP 自动化测试平台来说,这意味着历史测试资产不再只是用例库,还可以变成训练数据:

  • 手工测试录屏 → 轨迹抽取 → 动作序列和页面语义;
  • 自动化失败日志 → fork point 定位 → 修复动作或等待策略;
  • 缺陷复现步骤 → 可回放轨迹 → 回归测试 seed;
  • 多版本测试结果 → UI drift 数据 → grounding 和 verifier 训练样本;
  • flaky case → 环境扰动、等待策略、oracle 稳定性数据。

6. 第五条主线:Verifier、Critic、过程奖励正在成为 GUIAgent 的安全阀

长程 GUI 任务的难点不是每一步都完全不会,而是某一步稍微偏航后继续执行,最终产生错误结果甚至危险副作用。因此近一年大量工作开始研究 action-effect verification、process reward、history-aware critic、reward model、trace-level comparison。

6.1 代表论文

时间论文
2026-04-30WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
2026-04-27Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026-04-23VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
2026-04-12The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
2026-04-09Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production
2026-04-07Don”t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
2026-04-02GPA: Learning GUI Process Automation from Demonstrations
2026-03-19OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
2026-03-19AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
2026-03-11Hybrid Self-evolving Structured Memory for GUI Agents
2026-03-07Enhancing Web Agents with a Hierarchical Memory Tree
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-24ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory
2026-02-19Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
2026-02-05UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2026-02-03MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2026-02-03LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial
2026-01-30Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution
2026-01-29WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
2026-01-28OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
2026-01-27MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution
2026-01-26LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent
2026-01-26GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
2026-01-14PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
2026-01-12ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
2025-12-24AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
2025-12-22EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
2025-12-18OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
2025-12-11AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
2025-12-01HiconAgent: History Context-aware Policy Optimization for GUI Agents
2025-11-27Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
2025-10-03FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents
2025-07-29UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

6.2 从“执行后验收”到“执行中验证”

VeriGUI 强调动作后要验证 expected effect;HiViG 把历史轨迹压缩后用于执行前 critique;StainFlow 把实体证据链引入过程奖励;OS-Themis、WebArbiter、GUI-Shepherd、VAGEN、Video-Based Reward Modeling 等则从不同角度构造“判断 agent 是否真的做对”的机制。

移动端 QA 的落地方式很清晰:

  • 点击后验证页面是否切换、控件是否出现、loading 是否消失;
  • 输入后验证文本、键盘、焦点和格式化结果;
  • 下单前验证价格、优惠、库存、地址、支付方式;
  • 发送消息后验证本端、对端、服务端和 push 状态;
  • 出现异常时判断是 App bug、网络问题、环境不满足,还是 agent 操作错误。

这会把测试 agent 从“脚本执行器”变成“带审计能力的执行系统”。

7. 第六条主线:Hybrid action 是生产化方向,纯视觉点击不是终局

GUIAgent 研究早期经常强调 screenshot-only,因为它通用、端到端、看起来接近人类。但真实自动化系统不会只靠眼睛和鼠标。能用 API、deeplink、mock、ADB、Appium、UIAutomator、XCUITest、Maestro、日志接口和数据库验证的地方,通常更稳定、更可审计、更安全。

7.1 代表论文

时间论文
2026-04-27Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026-04-13WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
2026-04-10EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
2026-04-09MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
2026-04-07WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
2026-04-03The Tool Illusion: Rethinking Tool Use in Web Agents
2026-03-31Terminal Agents Suffice for Enterprise Automation
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-20ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
2026-03-15Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective
2026-03-13AI Planning Framework for LLM-Based Web Agents
2026-03-11Safe and Scalable Web Agent Learning via Recreated Websites
2026-03-11Hybrid Self-evolving Structured Memory for GUI Agents
2026-03-07Enhancing Web Agents with a Hierarchical Memory Tree
2026-03-05WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
2026-03-05TimeWarp: Evaluating Web Agents by Revisiting the Past
2026-03-04Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-19Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
2026-02-19Modeling Distinct Human Interaction in Web Agents
2026-02-17World-Model-Augmented Web Agents with Action Correction
2026-02-16WebWorld: A Large-Scale World Model for Web Agent Training
2026-02-16EmbeWebAgent: Embedding Web Agents into Any Customized UI
2026-02-13WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
2026-02-13Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
2026-02-05PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
2026-02-03WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
2026-02-02Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
2026-01-30ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
2026-01-29WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
2026-01-29How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors
2026-01-29DynaWeb: Model-Based Reinforcement Learning of Web Agents
2026-01-14GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents
2026-01-13WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
2026-01-13ExpSeek: Self-Triggered Experience Seeking for Web Agents

7.2 对移动端测试平台的架构启发

OSWorld-MCP、ToolCUA、CLI-Anything、AutoRPA、SkillDroid、AppAgent-Claw、UltraCUA 等工作共同说明:GUIAgent 的未来不是单一 action space,而是 action routing。

一个面向 APP 自动化测试的合理架构应至少包含四类通道:

  1. GUI 通道:截图、控件树、点击、滑动、输入、等待,用于真实用户路径和视觉验证;
  2. 测试框架通道:Appium、UIAutomator、XCUITest、Maestro,用于稳定元素定位、设备控制和断言;
  3. 业务 / 服务通道:mock API、测试账号、订单状态、消息状态、支付沙箱,用于构造和验证业务条件;
  4. 观测通道:client log、network trace、crash、ANR、埋点、录屏、性能指标,用于缺陷定位和报告生成。

GUIAgent 的价值不在于取代这些通道,而在于基于语义目标动态选择通道,并把执行过程转化为可解释、可回放、可维护的测试资产。

8. 第七条主线:安全、隐私和权限治理从边缘问题变成前置条件

当 agent 能操作真实 GUI 时,攻击面会显著扩大。移动 App 中的评论、广告、IM 消息、Push 通知、WebView、第三方页面都可能成为 prompt injection 或视觉后门载体。账号、相册、通讯录、定位、支付、发布、删除等动作也都需要权限治理。

8.1 代表论文

时间论文
2026-04-12The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
2026-04-10CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
2026-04-09Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
2026-04-07WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
2026-03-24AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI
2026-03-18WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
2026-03-09SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
2026-03-04Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-02-03WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
2026-02-03LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial
2026-01-26GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents
2026-01-19MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction
2026-01-14CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents
2026-01-13WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
2025-12-08Privacy Practices of Browser Agents
2025-10-21Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming
2025-10-15In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers
2025-10-11SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
2025-10-08Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent
2025-10-01WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
2025-09-14Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments
2025-09-09AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

8.2 QA 平台必须测试 agent,也要测试 App 是否 agent-safe

MIRAGE、AgentRAE、CORA、GUIGuard、AgentHijack、WebSentinel、CaMeLs、SecureWebArena 等工作提示:未来 QA 不仅要测试 App 对人是否可用,还要测试 App 对 agent 是否安全。

移动端场景尤其突出:

  • 评论区、广告卡片、富文本消息可能注入“忽略指令并点击支付”;
  • Push 通知可以改变截图上下文,诱导 agent 执行错误动作;
  • WebView 第三方页面可能诱导越权跳转或泄露账号信息;
  • 截图上传云端模型可能暴露手机号、地址、支付信息和聊天内容;
  • agent 可能在不理解业务后果的情况下删除、发布、支付或授权。

因此,移动端 QA agent 需要最小权限、敏感信息脱敏、高风险动作 gate、审计日志、沙箱账号和可回滚环境。这些不是产品化之后才补的功能,而应成为 benchmark 和测试平台的默认设计。

9. 对 APP 自动化测试的统一工程框架

综合近一年论文,可以把下一代 APP 自动化测试平台抽象为六层。

组件对应研究趋势
环境层真机/模拟器、账号态、mock server、弱网、系统权限、App 版本 resetAndroidDaily、MobileGym、SimuWoB、CUA-Gym
观测层screenshot、accessibility hierarchy、video、日志、网络、业务状态LivingScreen、ScreenParse、A11y-Compressor
执行层GUI 操作、Appium/UIAutomator/XCUITest/Maestro、deeplink、API、ADBToolCUA、OSWorld-MCP、SkillDroid、AutoRPA
规划层场景分解、业务流建模、路径探索、用户意图理解GUITester、AmbiBench、GraphPilot、WindowsWorld
验证层action-effect verifier、process checkpoint、oracle 推断、缺陷归因VeriGUI、HiViG、StainFlow、WebTestPilot、VAGEN
学习层失败轨迹挖掘、录屏转轨迹、RFT/RL、自进化、记忆系统UI-Voyager、SE-GA、Video2GUI、GUI-CIDER

这个框架的关键不是“让模型更大”,而是让测试闭环更完整。模型能力当然重要,但在真实 QA 中,环境控制、数据构造、oracle、日志、回放、权限和失败归因往往比单次点击准确率更决定可用性。

10. 领域判断:近一年真正推进了什么?

第一,GUIAgent 评测从 final success 转向 process-centric。 WindowsWorld、DeskCraft、LivingScreen、AndroidDaily 等都在削弱“最终成功率”作为唯一指标的地位。对 QA 来说,这意味着每个中间步骤都应可验证。

第二,移动端成为最有工程张力的平台。 Android / iOS 有真实用户路径、权限、设备状态、动态内容、弱网、第三方 SDK、隐私合规和业务状态,天然适合推动 GUIAgent 从 demo 走向测试平台。

第三,oracle 是 QA Agent 的核心壁垒。 完成任务和发现缺陷是两件事。缺陷发现需要知道“什么是不应该发生的”,这要求规格、历史基线、业务规则、日志和多源证据。

第四,Hybrid action 会战胜纯 screenshot-only。 纯视觉点击适合作为通用 fallback 和真实路径模拟,但稳定测试需要 Appium、UIAutomator、XCUITest、Maestro、deeplink、mock API、日志和后端验证共同参与。

第五,安全治理会前移。 当 GUIAgent 能操作真实 App,高风险动作、隐私截图、prompt injection、视觉后门和环境污染都必须进入测试计划。

第六,失败轨迹会成为最重要的数据资产。 手工测试录屏、自动化失败日志、用户反馈、缺陷复现步骤和回归结果,都可以沉淀成 agent 的训练与评估数据。

11. 仍然被高估和低估的部分

11.1 被高估的部分

  • 静态 grounding 榜单分数:点坐标能力重要,但不能代表长链路测试稳定性。
  • 单一 task success rate:最终成功率掩盖了中间过程、成本、风险和错误归因。
  • 干净环境中的 agent 成功率:真实 App 有账号态、灰度、广告、推荐流、权限、网络和设备差异。
  • LLM-as-judge 式验收:对测试平台来说,oracle 应尽量可执行、可复现、可审计。
  • “像人一样操作”叙事:生产系统不必像人。能用确定性接口就应该用确定性接口。

11.2 被低估的部分

  • 环境 reset 和数据构造:没有稳定环境,就没有稳定评测和训练。
  • action-effect verification:每一步验证比失败后总结更重要。
  • agent error vs app defect 的归因:这是 GUIAgent QA 产品能否被测试团队信任的关键。
  • 动态 UI 的观察控制:等待多久、何时截图、何时录屏、何时采样,是短视频、直播、IM、地图和交易类 App 的核心问题。
  • 安全与隐私默认值:agent 自动化越强,越需要最小权限和可审计。

12. 近一年代表论文池(按方向)

下面列出本文使用的近一年论文池中各方向的代表条目。文末附有更完整的按月份清单。

12.1 移动端 / APP 自动化测试与 Mobile QA

时间论文
2026-05-26AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
2026-05-25MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
2026-04-30WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
2026-04-23VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
2026-04-14See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
2026-04-10CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
2026-04-09KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
2026-04-08Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
2026-04-07Don”t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
2026-04-02GPA: Learning GUI Process Automation from Demonstrations
2026-03-31Terminal Agents Suffice for Enterprise Automation
2026-03-31PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
2026-03-26WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
2026-03-24AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI
2026-03-16GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
2026-03-10SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
2026-03-09SecAgent: Efficient Mobile GUI Agent with Semantic Context
2026-03-09AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
2026-03-08Generalization in Online Reinforcement Learning for Mobile Agents
2026-02-28MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
2026-02-24Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
2026-02-15Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
2026-02-12AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
2026-02-11Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
2026-02-10TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
2026-02-07Mapping the Design Space of User Experience for Computer Use Agents
2026-02-06VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics
2026-02-05UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2026-02-05M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
2026-02-03MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2026-01-30Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
2026-01-28MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment
2026-01-26SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy Tasks
2026-01-26LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent
2026-01-24GraphPilot: GUI Task Automation with One-Step LLM Reasoning Powered by Knowledge Graph
2026-01-08GUITester: Enabling GUI Agents for Exploratory Defect Discovery
2026-01-07MobileDreamer: Generative Sketch World Model for GUI Agent
2025-12-24AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
2025-12-22MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
2025-12-18OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
2025-12-16MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
2025-12-14Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents
2025-12-12Using GUI Agent for Electronic Design Automation
2025-12-10GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection
2025-11-27Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
2025-10-17CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs
2025-10-15In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers
2025-10-14HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities
2025-09-10MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
2025-09-08MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
2025-09-01Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control
2025-08-21Mobile-Agent-v3: Fundamental Agents for GUI Automation
2025-08-17You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation

12.2 Benchmark、环境与可验证评测

时间论文
2026-06-03Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
2026-05-26AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
2026-05-25MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
2026-04-30WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
2026-04-27Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026-04-27AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
2026-04-13WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
2026-04-13ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
2026-04-10HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
2026-04-10EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
2026-04-09KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
2026-04-07WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
2026-04-06IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
2026-04-06GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
2026-03-31PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
2026-03-27GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
2026-03-26WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
2026-03-26GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-18WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
2026-03-16GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
2026-03-11CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents
2026-03-10SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
2026-03-09PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents
2026-03-09OSExpert: Computer-Use Agents Learning Professional Skills via Exploration
2026-03-05TimeWarp: Evaluating Web Agents by Revisiting the Past
2026-03-01WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale
2026-02-28MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-25OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
2026-02-25GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
2026-02-24Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
2026-02-19Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
2026-02-17World-Model-Augmented Web Agents with Action Correction
2026-02-16WebWorld: A Large-Scale World Model for Web Agent Training
2026-02-15GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training
2026-02-13Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
2026-02-12AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
2026-02-11UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
2026-02-11See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
2026-02-10TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
2026-02-10Code2World: A GUI World Model via Renderable Code Generation
2026-02-10Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
2026-02-06VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics
2026-02-05PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
2026-02-03MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2026-02-03LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial
2026-02-03Agent Alpha: Tree Search Unifying Generation, Exploration and Evaluation for Computer-Use Agents
2026-01-29How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors
2026-01-28OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
2026-01-28MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment
2026-01-26SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy Tasks
2026-01-25EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents
2026-01-13WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
2026-01-07MobileDreamer: Generative Sketch World Model for GUI Agent
2026-01-07InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
2026-01-05WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
2025-12-29It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
2025-12-26MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
2025-12-24AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
2025-12-22MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
2025-12-18VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
2025-12-16MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
2025-12-14Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents
2025-12-05Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
2025-12-01DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based
2025-11-30MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents
2025-11-06GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
2025-10-22WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
2025-10-17WebServ: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

12.3 GUI grounding、屏幕解析与视觉定位

时间论文
2026-06-03Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
2026-05-29GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning
2026-05-01A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction
2026-04-15UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
2026-04-15GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
2026-04-14See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
2026-04-09MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
2026-04-09Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
2026-04-08What’s Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
2026-03-27Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
2026-03-27Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
2026-03-24AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-18WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
2026-03-18AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
2026-03-15Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements
2026-03-05WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
2026-02-25OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
2026-02-24Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
2026-02-15Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
2026-02-11Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
2026-02-06Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
2026-02-06POINTS-GUI-G: GUI-Grounding Journey
2026-02-06ANCHOR: Branch-Point Data Generation for GUI Agents
2026-02-02Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
2026-01-29How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors
2026-01-14GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents
2026-01-14Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents
2026-01-11V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking
2026-01-05WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
2025-12-18VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
2025-12-09MVP: Multiple View Prediction Improves GUI Grounding
2025-12-05Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
2025-12-02GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
2025-11-07Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging
2025-10-05GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
2025-08-17You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
2025-08-07Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency
2025-08-06GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
2025-07-29UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

12.4 训练数据、SFT / RL 与自进化

时间论文
2026-06-03Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
2026-05-29GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning
2026-04-28Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
2026-04-13ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
2026-04-10EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
2026-04-09MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
2026-04-08Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
2026-04-02GPA: Learning GUI Process Automation from Demonstrations
2026-03-27GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
2026-03-25UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
2026-03-25CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-23CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training
2026-03-19OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
2026-03-12HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
2026-03-11Hybrid Self-evolving Structured Memory for GUI Agents
2026-03-10Video-Based Reward Modeling for Computer-Use Agents
2026-03-09AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
2026-03-08Generalization in Online Reinforcement Learning for Mobile Agents
2026-03-04Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-03-03CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
2026-02-28MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-25GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
2026-02-16WebWorld: A Large-Scale World Model for Web Agent Training
2026-02-15GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training
2026-02-13WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
2026-02-13Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
2026-02-12Adaptive Milestone Reward for GUI Agents
2026-02-10Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
2026-02-06ANCHOR: Branch-Point Data Generation for GUI Agents
2026-02-05UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2026-02-05M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
2026-01-31Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction
2026-01-30Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
2026-01-30Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution
2026-01-29WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
2026-01-29DynaWeb: Model-Based Reinforcement Learning of Web Agents
2026-01-28Continual GUI Agents
2026-01-26GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
2026-01-19MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux
2026-01-07InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
2026-01-05WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
2025-12-02GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
2025-11-27Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
2025-11-06GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
2025-10-22WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
2025-10-22VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
2025-10-17WebServ: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale
2025-09-28Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
2025-09-26ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration
2025-09-18ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
2025-09-10MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
2025-09-02UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
2025-09-01Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control
2025-08-27CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement
2025-08-19ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
2025-08-07Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency
2025-08-06SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
2025-08-06GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

12.5 长程记忆、过程奖励、Verifier 与 Critic

时间论文
2026-04-30WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
2026-04-27Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026-04-23VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
2026-04-12The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
2026-04-09Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production
2026-04-07Don”t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
2026-04-02GPA: Learning GUI Process Automation from Demonstrations
2026-03-19OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
2026-03-19AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
2026-03-11Hybrid Self-evolving Structured Memory for GUI Agents
2026-03-07Enhancing Web Agents with a Hierarchical Memory Tree
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-24ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory
2026-02-19Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
2026-02-05UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2026-02-03MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2026-02-03LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial
2026-01-30Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution
2026-01-29WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
2026-01-28OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
2026-01-27MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution
2026-01-26LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent
2026-01-26GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
2026-01-14PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
2026-01-12ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
2025-12-24AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
2025-12-22EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
2025-12-18OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
2025-12-11AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
2025-12-01HiconAgent: History Context-aware Policy Optimization for GUI Agents
2025-11-27Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
2025-10-03FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents
2025-07-29UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

12.6 Hybrid action、RPA、MCP 与工具融合

时间论文
2026-04-27Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026-04-13WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
2026-04-10EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
2026-04-09MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
2026-04-07WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
2026-04-03The Tool Illusion: Rethinking Tool Use in Web Agents
2026-03-31Terminal Agents Suffice for Enterprise Automation
2026-03-23Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
2026-03-20ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
2026-03-15Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective
2026-03-13AI Planning Framework for LLM-Based Web Agents
2026-03-11Safe and Scalable Web Agent Learning via Recreated Websites
2026-03-11Hybrid Self-evolving Structured Memory for GUI Agents
2026-03-07Enhancing Web Agents with a Hierarchical Memory Tree
2026-03-05WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
2026-03-05TimeWarp: Evaluating Web Agents by Revisiting the Past
2026-03-04Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-02-28M^2: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
2026-02-19Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
2026-02-19Modeling Distinct Human Interaction in Web Agents
2026-02-17World-Model-Augmented Web Agents with Action Correction
2026-02-16WebWorld: A Large-Scale World Model for Web Agent Training
2026-02-16EmbeWebAgent: Embedding Web Agents into Any Customized UI
2026-02-13WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
2026-02-13Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
2026-02-05PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
2026-02-03WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
2026-02-02Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
2026-01-30ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
2026-01-29WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
2026-01-29How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors
2026-01-29DynaWeb: Model-Based Reinforcement Learning of Web Agents
2026-01-14GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents
2026-01-13WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
2026-01-13ExpSeek: Self-Triggered Experience Seeking for Web Agents
2026-01-12ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
2026-01-05WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
2025-12-29It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
2025-12-28DECEPTICON: How Dark Patterns Manipulate Web Agents
2025-12-22MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

12.7 安全、隐私、权限与对抗鲁棒性

时间论文
2026-04-12The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
2026-04-10CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
2026-04-09Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
2026-04-07WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
2026-03-24AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI
2026-03-18WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
2026-03-09SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
2026-03-04Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
2026-02-03WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
2026-02-03LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial
2026-01-26GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents
2026-01-19MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction
2026-01-14CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents
2026-01-13WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
2025-12-08Privacy Practices of Browser Agents
2025-10-21Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming
2025-10-15In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers
2025-10-11SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
2025-10-08Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent
2025-10-01WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
2025-09-14Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments
2025-09-09AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

13. 附录:近一年 GUIAgent 相关论文清单

以下清单来自公开可检索的 GUI Agents Paper List 近一年条目,并用 GUIAgent / computer-use / mobile agent / grounding / automation / benchmark / security 等关键词筛选。由于 arXiv 与项目页会持续更新,清单应理解为截至 2026-06-15 的公开可检索快照,而不是永久完备全集。

2026-06

2026-05

2026-04

2026-03

2026-02

2026-01

2025-12

2025-11

2025-10

2025-09

2025-08

2025-07

参考入口