Current Vision Language Models

Slides
Video Lecture

References

  1. Ferret: Refer and Ground Anything Anywhere at Any GranularityHaoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal.2023
  2. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsKeen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal.2024
  3. Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language ModelsHaotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal.2024
  4. Language-Image Models with 3D UnderstandingJang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal.2024
  5. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesBoyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal.2024
  6. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingBrandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal.2024
  7. Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon Team2024
  8. PaliGemma: A versatile 3B VLM for transferLucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal.2024
  9. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language ModelsMatt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal.2024