Awesome Large Multimodal Agents

Last update: 09/25/2024

Table of Contents

Papers
- Taxonomy
  - Type Ⅰ
  - Type Ⅱ
  - Type Ⅲ
  - Type Ⅳ
  - Multi-Agent
- Application
Benchmark

Papers

Taxonomy

Type Ⅰ

CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github
HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github
Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github
Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github
VisProgram - Visual Programming: Compositional visual reasoning without training
DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github
ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github
GPT-Driver - GPT-Driver: Learning to Drive with GPT Github
LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github
MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github
AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github
DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github
GRID - GRID: A Platform for General Robot Intelligence Development Github
DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github
MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github
MuLan - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Github
Mobile-Agent - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Github
SeeAct - GPT-4V(ision) is a Generalist Web Agent, if Grounded Github

Type Ⅱ

STEVE - See and Think: Embodied Agent in Virtual Environment Github
EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github
MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github
LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github
GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github
WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github

Type Ⅲ

DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github
ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github
VideoAgent -- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Project page

Type Ⅳ

JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github
AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github
MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github
Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github
WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github
DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github
Cradle - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Github
VideoAgent -- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Project page

Multi-Agent

MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github
MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
Avis - avis: autonomous visual information seeking with large language model agent
Agent-Smith - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast Github
GenAI - The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Github
P2H - Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs

Application

💡 Complex Visual Reasoning Tasks

ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github
HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github
Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github
Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github
GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github
MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github
M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github
VisProgram - Visual Programming: Compositional visual reasoning without training
DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github
Avis - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
MuLan - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Github

🎵 Audio Editing & Generation

Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github
MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github
AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github
WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github
OpenOmni - OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents Github

🤖 Embodied AI & Robotics

JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github
DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github
Octopus - Octopus: Embodied Vision-Language Programmer from Environmental Feedback Github
GRID - GRID: A Platform for General Robot Intelligence Development Github
MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github
STEVE - See and Think: Embodied Agent in Virtual Environment Github
EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github
MEIA - Multimodal Embodied Interactive Agent for Cafe Scene

🖱️💻 UI-assistants

AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github
DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github
WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github
MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github
MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github
AutoDroid - Empowering LLM to use Smartphone for Intelligent Task Automation Github
GPT-4V-Act - GPT-4V-Act: Chromium Copilot Github
Mobile-Agent - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Github
[OpenAdapt]- OpenAdapt: AI-First Process Automation with Large Multimodal Models Github
[EnvDistraction]- Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions Github

🎨 Visual Generation & Editing

LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github
MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github
SeeAct - GPT-4V(ision) is a Generalist Web Agent, if Grounded Github
GenAI - The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Github
GenArtist - GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing Github

🎥 Video Understanding

DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github
ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
VideoAgent-M -- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Project page
VideoAgent-L - VideoAgent: Long-form Video Understanding with Large Language Model as Agent Project page
Kubrick - Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation Github
Anim-Director - Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation Github

🚗 Autonomous Driving

GPT-Driver - GPT-Driver: Learning to Drive with GPT Github
DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github

🎮 Game-developer

SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github
VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github
Cradle - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Github
Cradle - Can AI Prompt Humans? Multimodal Agents Prompt Players’ Game Actions and Show Consequences to Raise Sustainability Awareness Github

Other

FinAgent - A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist
VisionGPT - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
WirelessAgent - WirelessAgent: Large Language Model Agents for Intelligent Wireless Networks
PhishAgent - PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
MMRole - MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents Github

Benchmark

SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github
VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github
Mind2Web - MIND2WEB: Towards a Generalist Agent for the Web Github
GAIA - GAIA: a benchmark for General AI Assistants Github
OmniACT - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
DSBench - DSBENCH: HOW FAR ARE DATA SCIENCE AGENTS TO BECOMING DATA SCIENCE EXPERTS? Github
GTA - GTA: A Benchmark for General Tool Agents Github

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Large Multimodal Agents

Papers

Taxonomy

Type Ⅰ

Type Ⅱ

Type Ⅲ

Type Ⅳ

Multi-Agent

Application

💡 Complex Visual Reasoning Tasks

🎵 Audio Editing & Generation

🤖 Embodied AI & Robotics

🖱️💻 UI-assistants

🎨 Visual Generation & Editing

🎥 Video Understanding

🚗 Autonomous Driving

🎮 Game-developer

Other

Benchmark

About

Releases

Packages

Contributors 4

jun0wanan/awesome-large-multimodal-agents

Folders and files

Latest commit

History

Repository files navigation

Awesome Large Multimodal Agents

Papers

Taxonomy

Type Ⅰ

Type Ⅱ

Type Ⅲ

Type Ⅳ

Multi-Agent

Application

💡 Complex Visual Reasoning Tasks

🎵 Audio Editing & Generation

🤖 Embodied AI & Robotics

🖱️💻 UI-assistants

🎨 Visual Generation & Editing

🎥 Video Understanding

🚗 Autonomous Driving

🎮 Game-developer

Other

Benchmark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages