Computer Vision
References
- ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton - 2012 
- YFCC100M: The New Data in Multimedia Research - Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal. - 2015 
- The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale - Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, etal. - 2018 
- Places: An Image Database for Deep Scene Understanding - Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, Aude Oliva - 2016 
- Revisiting Unreasonable Effectiveness of Data in Deep Learning Era - Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta - 2017 
- Scaling Vision Transformers - Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer - 2021 
- DINOv2: Learning Robust Visual Features without Supervision - Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, etal. - 2023 
- Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun - 2015 
- Identity Mappings in Deep Residual Networks - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun - 2016 
- A ConvNet for the 2020s - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2022 
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, etal. - 2020 
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows - Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo - 2021 
- Learning Multiple Layers of Features from Tiny Images - Alex Krizhevsky - 2009 
- 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition - Antonio Torralba, Rob Fergus, William T. Freeman - 2008 
- ImageNet: A Large-Scale Hierarchical Image Database - Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei - 2009 
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks - Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun - 2015 
- Microsoft COCO: Common Objects in Context - Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, etal. - 2014 
- Rich feature hierarchies for accurate object detection and semantic segmentation - Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik - 2013 
- You Only Look Once: Unified, Real-Time Object Detection - Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi - 2015 
- Objects as Points - Xingyi Zhou, Dequan Wang, Philipp Krähenbühl - 2019 
- End-to-End Object Detection with Transformers - Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, etal. - 2020 
- Center-based 3D Object Detection and Tracking - Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl - 2020 
- Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D - Jonah Philion, Sanja Fidler - 2020 
- BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation - Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han - 2022 
- Fully Convolutional Networks for Semantic Segmentation - Jonathan Long, Evan Shelhamer, Trevor Darrell - 2014 
- Stacked Hourglass Networks for Human Pose Estimation - Alejandro Newell, Kaiyu Yang, Jia Deng - 2016 
- Depth Pro: Sharp Monocular Metric Depth in Less Than a Second - Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, etal. - 2024 
- The Cityscapes Dataset for Semantic Urban Scene Understanding - Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, etal. - 2016 
- Playing for Data: Ground Truth from Computer Games - Stephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun - 2016 
- Masked-attention Mask Transformer for Universal Image Segmentation - Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar - 2021 
- Segment Anything - Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, etal. - 2023 
- Mask R-CNN - Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick - 2017 
- The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes - Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, Peter Kontschieder - 2017 
- Free Supervision From Video Games - Philipp Krähenbühl - 2018 
- U-Net: Convolutional Networks for Biomedical Image Segmentation - Olaf Ronneberger, Philipp Fischer, Thomas Brox - 2015 
- Learning Transferable Visual Models From Natural Language Supervision - Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, etal. - 2021 
- Open-vocabulary Object Detection via Vision and Language Knowledge Distillation - Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui - 2021 
- Detecting Twenty-thousand Classes using Image-level Supervision - Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra - 2022 
- Reproducible scaling laws for contrastive language-image learning - Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, etal. - 2022 
- DataComp: In search of the next generation of multimodal datasets - Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, etal. - 2023 
- Image Captioners Are Scalable Vision Learners Too - Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer - 2023 
- LocCa: Visual Pretraining with Location-aware Captioners - Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, etal. - 2024 
- Flamingo: a Visual Language Model for Few-Shot Learning - Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, etal. - 2022 
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation - Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi - 2022 
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning - Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, etal. - 2023 
- Visual Instruction Tuning - Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee - 2023 
- Improved Baselines with Visual Instruction Tuning - Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee - 2023 
- Matryoshka Multimodal Models - Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee - 2024 
- CogVLM: Visual Expert for Pretrained Language Models - Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, etal. - 2023 
- OtterHD: A High-Resolution Multi-modality Model - Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu - 2023 
- VILA: On Pre-training for Visual Language Models - Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, etal. - 2023 
- VeCLIP: Improving CLIP Training via Visual-enriched Captions - Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, etal. - 2023 
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond - Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, etal. - 2023 
- Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks - Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi - 2022 
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi - 2023 
- Ferret: Refer and Ground Anything Anywhere at Any Granularity - Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal. - 2023 
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs - Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal. - 2024 
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models - Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal. - 2024 
- Language-Image Models with 3D Understanding - Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal. - 2024 
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities - Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal. - 2024 
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training - Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal. - 2024 
- Chameleon: Mixed-Modal Early-Fusion Foundation Models - Chameleon Team - 2024 
- PaliGemma: A versatile 3B VLM for transfer - Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal. - 2024 
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models - Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal. - 2024