Current Vision Language Models

Video Lecture

References

Ferret: Refer and Ground Anything Anywhere at Any GranularityHaoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal.2023
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsKeen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal.2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language ModelsHaotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal.2024
Language-Image Models with 3D UnderstandingJang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal.2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesBoyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal.2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingBrandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal.2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon Team2024
PaliGemma: A versatile 3B VLM for transferLucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal.2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language ModelsMatt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal.2024