Awesome UI Agent

This is a collection of research papers for UI Agent, which includes models, tools, and datasets. And the repository will be continuously updated to track the frontier of UI Agent or related fields.

Welcome to follow and star!

Overview of UI Agent

UI Agent aims to build a generalist agent that can interact with various user interfaces (UIs) in different environments, such as mobile apps, web pages, and PC applications. The agent can understand the UIs through vision-language models and interact with them to complete tasks. The agent can be applied to various scenarios, such as mobile device operation, web browsing, and game playing. The agent can be trained in a simulated environment or with real-world data. The agent can be evaluated in terms of task completion rate, efficiency, and generalization ability.

The research on UI Agent is still in its early stage, and there are many challenges to be addressed, such as the scalability of the agent, the robustness of the agent, and the interpretability of the agent. The research on UI Agent is interdisciplinary, involving computer vision, natural language processing, reinforcement learning, human-computer interaction, and software engineering. The research on UI Agent has the potential to revolutionize the way we interact with computers and improve the efficiency and usability of computer systems.

Papers

format:
- [title](paper link) [links]
    - author1, author2, and author3...
    - year
    - publisher
    - key 
    - code 
    - experiment environment

Models

2024

Apple Intelligence Foundation Language Models
- Apple
- Key: Vision-Language Model, Private Cloud Compute
- 2024
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Xinbei Ma and Zhuosheng Zhang and Hai Zhao
- Key: Vision-Language Model, Phone
- 2024
- code
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu
- Key: Vision-Language Model, PC
- 2024
- code
Intention-inInteraction (IN3): Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
- Cheng Qian and Bingxiang He and Zhong Zhuang and Jia Deng and Yujia Qin and Xin Cong and Zhong Zhang and Jie Zhou and Yankai Lin and Zhiyuan Liu and Maosong Sun
- Key: Language Model, User Intention
- 2024
- code
LATS: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
- Andy Zhou and Kai Yan and Michal Shlapentokh-Rothman and Haohan Wang and Yu-Xiong Wang
- Key: Tree Search, Language Model
- 2024
- code
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Hao Bai and Yifei Zhou and Mert Cemri and Jiayi Pan and Alane Suhr and Sergey Levine and Aviral Kumar
- Key: Vision-Language Model, Android, Reinforcement Learning
- 2024
- code
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma
- Key: Vision-Language Model, Mobile, Infographics
- 2024
ScreenAgent: A Vision Language Model-driven Computer Control Agent
- Runliang Niu and Jindong Li and Shiqi Wang and Yali Fu and Xiyu Hu and Xueyuan Leng and He Kong and Yi Chang and Qi Wang
- Key: Vision Language Model, PC
- 2024
- code
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang
- Key: Vision-Language Model, Android, Chain-of-Action-Thought
- 2024
- code
AppAgent: Multimodal Agents as Smartphone Users
- Chi Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu
- Key: Vision-Language Model, Android
- 2023
- code
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- Junyang Wang and Haiyang Xu and Haitao Jia and Xi Zhang and Ming Yan and Weizhou Shen and Ji Zhang and Fei Huang and Jitao Sang
- Key: Vision-Language Model, Android
- 2024
- code
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- Junyang Wang and Haiyang Xu and Jiabo Ye and Ming Yan and Weizhou Shen and Ji Zhang and Fei Huang and Jitao Sang
- Key: Vision-Language Model, Android
- 2024
- code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Hongliang He and Wenlin Yao and Kaixin Ma and Wenhao Yu and Yong Dai and Hongming Zhang and Zhenzhong Lan and Dong Yu
- Key: Vision-Language Model, Web
- 2024
- code
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Zhiyong Wu and Chengcheng Han and Zichen Ding and Zhenmin Weng and Zhoumianze Liu and Shunyu Yao and Tao Yu and Lingpeng Kong
- Key: Vision-Language Model, PC
- 2024
- code
UFO: A UI-Focused Agent for Windows OS Interaction
- Chaoyun Zhang and Liqun Li and Shilin He and Xu Zhang and Bo Qiao and Si Qin and Minghua Ma and Yu Kang and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang and Qi Zhang
- Key: Vision-Language Model, PC, Windows OS
- 2024
- code
Octopus v2: On-device language model for super agent
- Wei Chen and Zhiyuan Li
- Key: Vision-Language Model, Android, IOS
- 2024
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
- Weihao Tan and Ziluo Ding and Wentao Zhang and Boyu Li and Bohan Zhou and Junpeng Yue and Haochong Xia and Jiechuan Jiang and Longtao Zheng and Xinrun Xu and Yifei Bi and Pengjie Gu and Xinrun Wang and Börje F. Karlsson and Bo An and Zongqing Lu
- Key: Vision-Language Model, PC, Game
- 2024
- code

2023

CogAgent: A Visual Language Model for GUI Agents
- Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxuan Zhang and Juanzi Li and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang
- Key: Vision-Language Model, PC, Android, screenshots
- 2023
- code
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
- Jingkang Yang and Yuhao Dong and Shuai Liu and Bo Li and Ziyue Wang and Chencheng Jiang and Haoran Tan and Jiamu Kang and Yuanhan Zhang and Kaiyang Zhou and Ziwei Liu
- Key: Vision-Language Model, Android, IOS
- 2023
- code
You Only Look at Screens: Multimodal Chain-of-Action Agents
- Zhuosheng Zhang and Aston Zhang
- Key: Vision-Language Model, Android, Chain-of-Action-Thought
- 2023
- code
LASER: LLM Agent with State-Space Exploration for Web Navigation
- Kaixin Ma and Hongming Zhang and Hongwei Wang and Xiaoman Pan and Wenhao Yu and Dong Yu
- Key: Vision-Language Model, Web, State-Space Exploration
- 2023
- code
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- Izzeddin Gur and Hiroki Furuta and Austin Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust
- Key: Vision-Language Model, Web, Planning, Program Synthesis
- 2023
Augmenting Autotelic Agents with Large Language Models
- Cédric Colas and Laetitia Teodorescu and Pierre-Yves Oudeyer and Xingdi Yuan and Marc-Alexandre Côté
- Key: Language Model
- 2023
Language Models can Solve Computer Tasks
- Geunwoo Kim and Pierre Baldi and Stephen McAleer
- Key: Language Model
- 2023
- code

Tools

LEGENT: An Open Platform for Embodied Agentb Agents on Large Language Models
- Iat Long Iong and Xiao Liu and Yuxuan Chen and Hanyu Lai and Shuntian Yao and Pengbo Shen and Hao Yu and Yuxiao Dong and Jie Tang
- Key: Webpage, deployment
- 2024
- code
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- Li Zhang and Shihe Wang and Xianqing Jia and Zhihan Zheng and Yunhe Yan and Longxi Gao and Yuanchun Li and Mengwei Xu
- Key: Mobile UI, Simulator
- 2024
- code
WebArena: A Realistic Web Environment for Building Autonomous Agents
- Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig
- Key: Web, Simulator
- 2023
- code
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Danyang Zhang and Zhennan Shen and Rui Xie and Situo Zhang and Tianbao Xie and Zihan Zhao and Siyuan Chen and Lu Chen and Hongshen Xu and Ruisheng Cao and Kai Yu
- Key: Android, Simulator
- 2023
- code
AndroidEnv: A Reinforcement Learning Platform for Android
- Daniel Toyama and Philippe Hamel and Anita Gergely and Gheorghe Comanici and Amelia Glaese and Zafarali Ahmed and Tyler Jackson and Shibl Mourad and Doina Precup
- Key: Android, Reinforcement Learning, Simulator
- 2021
- code

Datasets

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks
- Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant
- Key: Web, Realistic, Time-Consuming, Benchmark
- 2024
- code
WebCanvas: Benchmarking Web Agents in Online Environments
- Yichen Pan1 and Dehan Kong and Sida Zhou and Cheng Cui and Yifei Leng and Bing Jiang and Hangyu Liu and Yanyi Shang and Shuyan Zhou and Tongshuang Wu and Zhengyang Wu
- Key: Web, Online Environments, Benchmark
- 2024
- code
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Luyuan Wang and Yongyu Deng and Yiwei Zha and Guodong Mao and Qinmin Wang and Tianchen Min and Wei Chen and Shoufa Chen
- Key: Mobile, Benchmark
- 2024
- code
VillagerBench/VillagerAgent: A Graph-Based Multi-Agent Framework for Coordinating Complex Task Dependencies in Minecraft
- Yubo Dong and Xukun Zhu and Zhengzhe Pan and Linchao Zhu and Yi Yang
- Key: Vision-Language Model, Game
- 2024
- code
CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions
- Guo, Zishan and Huang, Yufei and Xiong, Deyi
- Key: Vision-Language Model, Phone
- 2024
- code
Multi-Turn Mind2Web: On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng and Xuan Zhang and Wenxuan Zhang and Yifei Yuan and See-Kiong Ng and Tat-Seng Chua
- Key: Vision-Language Model, Web Tasks
- 2024
- code
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried
- Key: Vision-Language Model, Web Tasks
- 2024
- code
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang
- Key: Vision-Language Model, Android, Chain-of-Action-Thought
- 2024
- code
Android in the Wild: A Large-Scale Dataset for Android Device Control
- Christopher Rawles and Alice Li and Daniel Rodriguez and Oriana Riva and Timothy Lillicrap
- Key: Android, datasets
- 2023
Mind2Web: Towards a Generalist Agent for the Web
- Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su
- Key: Web, datasets
- 2023
- code
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan
- Key: Web, datasets
- 2022
- code
Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha
- Key: mobile app, datasets
- 2017

Related Repositories

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

This repository is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome UI Agent

Table of Contents

Overview of UI Agent

Papers

Models

2024

2023

Tools

Datasets

Related Repositories

Contributing

License

About

Releases

Packages

Contributors 4

License

opendilab/awesome-ui-agents

Folders and files

Latest commit

History

Repository files navigation

Awesome UI Agent

Table of Contents

Overview of UI Agent

Papers

Models

2024

2023

Tools

Datasets

Related Repositories

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages