Foundation Models Based on Modality.
Depth Anything: HKU and Tiktok, 2024 - Paper
DINOv2: Meta AI, 2024 - Paper
SAM (Segment Anything Model): Meta AI, 2023 - Paper
SAM 2: Meta FAIR, 2024 - paper
YOLO-NAS: code
ByteTrack: ByteDance 2022 - paper
Grounding DINO: Microsoft Research, 2024 - Paper
Grounded SAM: 2024 - paper
YOLO-World: 2024 - paper
CLIP (Contrastive Language–Image Pretraining): OpenAI, 2021 - Paper
EVA CLIP: 2023 - paper
SigLIP: Google, 2023 - paper
PaliGemma: Google, 2024 - paper
Florence (Microsoft, 2021 - Paper
VLMo (Vision-Language Model): Microsoft, 2022 - Paper
FLAVA (Foundational Language and Vision AI): Meta AI, 2022 - Paper
MaskVLM: Amazon, 2023 - paper
ALIGN (Image and Language Pre-training): Google Research, 2021 - Paper
LLaVA (Large Language and Vision Assistant): Microsoft Research , 2023 - Paper
LLaVA-1.5: Microsoft research, 2024 - paper
LLaVA-Next: 2024 - Paper
Qwen-VL: Alibaba Group, 2023 - Paper
OWL ViT: Google, 2022 - paper
VLPart: Facebook research - paper
CogVLM: 2023 - paper
GPT-4V(ision): 2023 - paper
MiniGPT4: 2023 - paper
MiniGPT5: 2024 - paper
SpatialVLM: Google Research and DeepMind, 2024 - paper
SpatislRGBT 2024 - paper
llama 3.1: Meta AI, 2024 - paper
mistral: 2023 - paper
GPT-3: OpenAI, 2020 - paper
GPT-4: OpenAI, 2023 - paper
Gemini 1.5: Google DeepMind, 2024 - paper
PaLM: Google Research, 2022 - paper
Gopher: DeepMind, 2022 - paper
BLOOM: 2023 - paper
Qwen : ALibaba Group, 2023 - paper
OPT: Open Pre-trained Transformer Language Models: Meta AI, 2022 - paper
Wav2Vec: (Facebook AI, now Meta AI, 2019 - Paper
Wav2Vec 2.0: Facebook AI, now Meta AI, 2020 - Paper
Speech2Text: Fairseq, Facebook AI, now Meta AI, 2022 - Paper
AudioCLIP: 2021 - Paper
Whisper* OpenAI, 2022 - Paper
VATT (Video-Audio-Text Transformer): Google Research, 2021 - Paper
ImageBind: Meta AI, 2023 - paper: Images, text, audio, depth, thermal, and IMU data
GPT-4o OpenAI, 2024 - website
Sora: OpenaI, 2024 - Technical Report
VideoGPT: 2021 - Paper
CogVideo: Tsinghua University, 2022 - Paper
Make-A-Video (Meta AI, 2022 - Paper
Phenaki (Google Research, 2022 - Paper
PLLaVA: 2024 - Paper
Vid2Seq: Google Research, 2023 - Paper
InternVideo: 2022 - Paper
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control: DeepMind, 2023 - paper
Perceiver DeepMind, 2021 - Paper
DALL-E 3: OpenAI - website