Computer Vision
References
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton 2012 YFCC100M: The New Data in Multimedia Research Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal. 2015 The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, etal. 2018 Places: An Image Database for Deep Scene Understanding Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, Aude Oliva 2016 Revisiting Unreasonable Effectiveness of Data in Deep Learning Era Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta 2017 Scaling Vision Transformers Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer 2021 DINOv2: Learning Robust Visual Features without Supervision Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, etal. 2023 Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 2015 Identity Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 2016 A ConvNet for the 2020s Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie 2022 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, etal. 2020 Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 2021 Learning Multiple Layers of Features from Tiny Images Alex Krizhevsky 2009 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition Antonio Torralba, Rob Fergus, William T. Freeman 2008 ImageNet: A Large-Scale Hierarchical Image Database Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei 2009 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun 2015 Microsoft COCO: Common Objects in Context Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, etal. 2014 Rich feature hierarchies for accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik 2013 You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi 2015 Objects as Points Xingyi Zhou, Dequan Wang, Philipp Krähenbühl 2019 End-to-End Object Detection with Transformers Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, etal. 2020 Center-based 3D Object Detection and Tracking Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl 2020 Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D Jonah Philion, Sanja Fidler 2020 BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han 2022 Fully Convolutional Networks for Semantic Segmentation Jonathan Long, Evan Shelhamer, Trevor Darrell 2014 Stacked Hourglass Networks for Human Pose Estimation Alejandro Newell, Kaiyu Yang, Jia Deng 2016 Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, etal. 2024 The Cityscapes Dataset for Semantic Urban Scene Understanding Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, etal. 2016 Playing for Data: Ground Truth from Computer Games Stephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun 2016 Masked-attention Mask Transformer for Universal Image Segmentation Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar 2021 Segment Anything Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, etal. 2023 Mask R-CNN Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick 2017 The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, Peter Kontschieder 2017 Free Supervision From Video Games Philipp Krähenbühl 2018 U-Net: Convolutional Networks for Biomedical Image Segmentation Olaf Ronneberger, Philipp Fischer, Thomas Brox 2015 Learning Transferable Visual Models From Natural Language Supervision Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, etal. 2021 Open-vocabulary Object Detection via Vision and Language Knowledge Distillation Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui 2021 Detecting Twenty-thousand Classes using Image-level Supervision Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra 2022 Reproducible scaling laws for contrastive language-image learning Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, etal. 2022 DataComp: In search of the next generation of multimodal datasets Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, etal. 2023 Image Captioners Are Scalable Vision Learners Too Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer 2023 LocCa: Visual Pretraining with Location-aware Captioners Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, etal. 2024 Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, etal. 2022 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi 2022 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, etal. 2023 Visual Instruction Tuning Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee 2023 Improved Baselines with Visual Instruction Tuning Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee 2023 Matryoshka Multimodal Models Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee 2024 CogVLM: Visual Expert for Pretrained Language Models Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, etal. 2023 OtterHD: A High-Resolution Multi-modality Model Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu 2023 VILA: On Pre-training for Visual Language Models Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, etal. 2023 VeCLIP: Improving CLIP Training via Visual-enriched Captions Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, etal. 2023 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, etal. 2023 Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi 2022 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi 2023 Ferret: Refer and Ground Anything Anywhere at Any Granularity Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal. 2023 Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal. 2024 Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal. 2024 Language-Image Models with 3D Understanding Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal. 2024 SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal. 2024 MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal. 2024 Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon Team 2024 PaliGemma: A versatile 3B VLM for transfer Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal. 2024 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal. 2024