References
- Flamingo: a Visual Language Model for Few-Shot LearningJean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, etal.2022
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationJunnan Li, Dongxu Li, Caiming Xiong, Steven Hoi2022
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningWenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, etal.2023
- Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee2023
- Improved Baselines with Visual Instruction TuningHaotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee2023
- Matryoshka Multimodal ModelsMu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee2024
- CogVLM: Visual Expert for Pretrained Language ModelsWeihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, etal.2023
- OtterHD: A High-Resolution Multi-modality ModelBo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu2023
- VILA: On Pre-training for Visual Language ModelsJi Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, etal.2023
- VeCLIP: Improving CLIP Training via Visual-enriched CaptionsZhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, etal.2023
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondJinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, etal.2023
- Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksJiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi2022
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsJunnan Li, Dongxu Li, Silvio Savarese, Steven Hoi2023