[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
-
Updated
Aug 12, 2024 - Python
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
An open-source implementation for training LLaVA-NeXT.
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
🧘🏻♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model.
Multimodal Instruction Tuning for Llama 3
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
[ACM MMGR '24] 🔍 Shotluck Holmes: A family of small-scale LLVMs for shot-level video understanding
PyTorch implementation of OpenAI's CLIP model for image classification, visual search, and visual question answering (VQA).
Efficient Video Question Answering
Add a description, image, and links to the visual-language-learning topic page so that developers can more easily learn about it.
To associate your repository with the visual-language-learning topic, visit your repo's landing page and select "manage topics."