This is a collection of research papers for UI Agent, which includes models, tools, and datasets. And the repository will be continuously updated to track the frontier of UI Agent or related fields.
Welcome to follow and star!
UI Agent aims to build a generalist agent that can interact with various user interfaces (UIs) in different environments, such as mobile apps, web pages, and PC applications. The agent can understand the UIs through vision-language models and interact with them to complete tasks. The agent can be applied to various scenarios, such as mobile device operation, web browsing, and game playing. The agent can be trained in a simulated environment or with real-world data. The agent can be evaluated in terms of task completion rate, efficiency, and generalization ability.
The research on UI Agent is still in its early stage, and there are many challenges to be addressed, such as the scalability of the agent, the robustness of the agent, and the interpretability of the agent. The research on UI Agent is interdisciplinary, involving computer vision, natural language processing, reinforcement learning, human-computer interaction, and software engineering. The research on UI Agent has the potential to revolutionize the way we interact with computers and improve the efficiency and usability of computer systems.
format:
- [title](paper link) [links]
- author1, author2, and author3...
- year
- publisher
- key
- code
- experiment environment
-
Apple Intelligence Foundation Language Models
- Apple
- Key: Vision-Language Model, Private Cloud Compute
- 2024
-
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Xinbei Ma and Zhuosheng Zhang and Hai Zhao
- Key: Vision-Language Model, Phone
- 2024
- code
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu
- Key: Vision-Language Model, PC
- 2024
- code
-
- Cheng Qian and Bingxiang He and Zhong Zhuang and Jia Deng and Yujia Qin and Xin Cong and Zhong Zhang and Jie Zhou and Yankai Lin and Zhiyuan Liu and Maosong Sun
- Key: Language Model, User Intention
- 2024
- code
-
LATS: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
- Andy Zhou and Kai Yan and Michal Shlapentokh-Rothman and Haohan Wang and Yu-Xiong Wang
- Key: Tree Search, Language Model
- 2024
- code
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Hao Bai and Yifei Zhou and Mert Cemri and Jiayi Pan and Alane Suhr and Sergey Levine and Aviral Kumar
- Key: Vision-Language Model, Android, Reinforcement Learning
- 2024
- code
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma
- Key: Vision-Language Model, Mobile, Infographics
- 2024
-
ScreenAgent: A Vision Language Model-driven Computer Control Agent
- Runliang Niu and Jindong Li and Shiqi Wang and Yali Fu and Xiyu Hu and Xueyuan Leng and He Kong and Yi Chang and Qi Wang
- Key: Vision Language Model, PC
- 2024
- code
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang
- Key: Vision-Language Model, Android, Chain-of-Action-Thought
- 2024
- code
-
AppAgent: Multimodal Agents as Smartphone Users
- Chi Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu
- Key: Vision-Language Model, Android
- 2023
- code
-
- Junyang Wang and Haiyang Xu and Haitao Jia and Xi Zhang and Ming Yan and Weizhou Shen and Ji Zhang and Fei Huang and Jitao Sang
- Key: Vision-Language Model, Android
- 2024
- code
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- Junyang Wang and Haiyang Xu and Jiabo Ye and Ming Yan and Weizhou Shen and Ji Zhang and Fei Huang and Jitao Sang
- Key: Vision-Language Model, Android
- 2024
- code
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Hongliang He and Wenlin Yao and Kaixin Ma and Wenhao Yu and Yong Dai and Hongming Zhang and Zhenzhong Lan and Dong Yu
- Key: Vision-Language Model, Web
- 2024
- code
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Zhiyong Wu and Chengcheng Han and Zichen Ding and Zhenmin Weng and Zhoumianze Liu and Shunyu Yao and Tao Yu and Lingpeng Kong
- Key: Vision-Language Model, PC
- 2024
- code
-
UFO: A UI-Focused Agent for Windows OS Interaction
- Chaoyun Zhang and Liqun Li and Shilin He and Xu Zhang and Bo Qiao and Si Qin and Minghua Ma and Yu Kang and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang and Qi Zhang
- Key: Vision-Language Model, PC, Windows OS
- 2024
- code
-
Octopus v2: On-device language model for super agent
- Wei Chen and Zhiyuan Li
- Key: Vision-Language Model, Android, IOS
- 2024
-
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
- Weihao Tan and Ziluo Ding and Wentao Zhang and Boyu Li and Bohan Zhou and Junpeng Yue and Haochong Xia and Jiechuan Jiang and Longtao Zheng and Xinrun Xu and Yifei Bi and Pengjie Gu and Xinrun Wang and Börje F. Karlsson and Bo An and Zongqing Lu
- Key: Vision-Language Model, PC, Game
- 2024
- code
-
CogAgent: A Visual Language Model for GUI Agents
- Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxuan Zhang and Juanzi Li and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang
- Key: Vision-Language Model, PC, Android, screenshots
- 2023
- code
-
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
- Jingkang Yang and Yuhao Dong and Shuai Liu and Bo Li and Ziyue Wang and Chencheng Jiang and Haoran Tan and Jiamu Kang and Yuanhan Zhang and Kaiyang Zhou and Ziwei Liu
- Key: Vision-Language Model, Android, IOS
- 2023
- code
-
You Only Look at Screens: Multimodal Chain-of-Action Agents
- Zhuosheng Zhang and Aston Zhang
- Key: Vision-Language Model, Android, Chain-of-Action-Thought
- 2023
- code
-
LASER: LLM Agent with State-Space Exploration for Web Navigation
- Kaixin Ma and Hongming Zhang and Hongwei Wang and Xiaoman Pan and Wenhao Yu and Dong Yu
- Key: Vision-Language Model, Web, State-Space Exploration
- 2023
- code
-
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- Izzeddin Gur and Hiroki Furuta and Austin Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust
- Key: Vision-Language Model, Web, Planning, Program Synthesis
- 2023
-
Augmenting Autotelic Agents with Large Language Models
- Cédric Colas and Laetitia Teodorescu and Pierre-Yves Oudeyer and Xingdi Yuan and Marc-Alexandre Côté
- Key: Language Model
- 2023
-
Language Models can Solve Computer Tasks
- Geunwoo Kim and Pierre Baldi and Stephen McAleer
- Key: Language Model
- 2023
- code
-
LEGENT: An Open Platform for Embodied Agentb Agents on Large Language Models
- Iat Long Iong and Xiao Liu and Yuxuan Chen and Hanyu Lai and Shuntian Yao and Pengbo Shen and Hao Yu and Yuxiao Dong and Jie Tang
- Key: Webpage, deployment
- 2024
- code
-
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- Li Zhang and Shihe Wang and Xianqing Jia and Zhihan Zheng and Yunhe Yan and Longxi Gao and Yuanchun Li and Mengwei Xu
- Key: Mobile UI, Simulator
- 2024
- code
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
- Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig
- Key: Web, Simulator
- 2023
- code
-
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Danyang Zhang and Zhennan Shen and Rui Xie and Situo Zhang and Tianbao Xie and Zihan Zhao and Siyuan Chen and Lu Chen and Hongshen Xu and Ruisheng Cao and Kai Yu
- Key: Android, Simulator
- 2023
- code
-
AndroidEnv: A Reinforcement Learning Platform for Android
- Daniel Toyama and Philippe Hamel and Anita Gergely and Gheorghe Comanici and Amelia Glaese and Zafarali Ahmed and Tyler Jackson and Shibl Mourad and Doina Precup
- Key: Android, Reinforcement Learning, Simulator
- 2021
- code
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks
- Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant
- Key: Web, Realistic, Time-Consuming, Benchmark
- 2024
- code
-
WebCanvas: Benchmarking Web Agents in Online Environments
- Yichen Pan1 and Dehan Kong and Sida Zhou and Cheng Cui and Yifei Leng and Bing Jiang and Hangyu Liu and Yanyi Shang and Shuyan Zhou and Tongshuang Wu and Zhengyang Wu
- Key: Web, Online Environments, Benchmark
- 2024
- code
-
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Luyuan Wang and Yongyu Deng and Yiwei Zha and Guodong Mao and Qinmin Wang and Tianchen Min and Wei Chen and Shoufa Chen
- Key: Mobile, Benchmark
- 2024
- code
-
- Yubo Dong and Xukun Zhu and Zhengzhe Pan and Linchao Zhu and Yi Yang
- Key: Vision-Language Model, Game
- 2024
- code
-
CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions
- Guo, Zishan and Huang, Yufei and Xiong, Deyi
- Key: Vision-Language Model, Phone
- 2024
- code
-
Multi-Turn Mind2Web: On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng and Xuan Zhang and Wenxuan Zhang and Yifei Yuan and See-Kiong Ng and Tat-Seng Chua
- Key: Vision-Language Model, Web Tasks
- 2024
- code
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried
- Key: Vision-Language Model, Web Tasks
- 2024
- code
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang
- Key: Vision-Language Model, Android, Chain-of-Action-Thought
- 2024
- code
-
Android in the Wild: A Large-Scale Dataset for Android Device Control
- Christopher Rawles and Alice Li and Daniel Rodriguez and Oriana Riva and Timothy Lillicrap
- Key: Android, datasets
- 2023
-
Mind2Web: Towards a Generalist Agent for the Web
- Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su
- Key: Web, datasets
- 2023
- code
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan
- Key: Web, datasets
- 2022
- code
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha
- Key: mobile app, datasets
- 2017
- awesome-llm-powered-agent
- Awesome-LLM-based-Web-Agent-and-Tools
- Awesome-GUI-Agent
- computer-control-agent-knowledge-base
Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.
This repository is released under the Apache 2.0 license.