Computer Vision

TitleMaterialsReferences
Computer VisionSlides
Image ClassificationSlides[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
Object DetectionSlides[16] [17] [18] [19] [20] [21] [22] [23] [24]
SegmentationSlides[25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35]
Open-Vocabulary RecognitionSlides[36] [37] [38]
Vision Language Models - Image CaptioningSlides[36] [39] [40] [41] [42]
Early Vision Language ModelsSlides[43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55]
Current Vision Language ModelsSlides[56] [57] [58] [59] [60] [61] [62] [63] [64]

References

  1. ImageNet Classification with Deep Convolutional Neural NetworksAlex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton2012
  2. YFCC100M: The New Data in Multimedia ResearchBart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal.2015
  3. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scaleAlina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, etal.2018
  4. Places: An Image Database for Deep Scene UnderstandingBolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, Aude Oliva2016
  5. Revisiting Unreasonable Effectiveness of Data in Deep Learning EraChen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta2017
  6. Scaling Vision TransformersXiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer2021
  7. DINOv2: Learning Robust Visual Features without SupervisionMaxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, etal.2023
  8. Deep Residual Learning for Image RecognitionKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun2015
  9. Identity Mappings in Deep Residual NetworksKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun2016
  10. A ConvNet for the 2020sZhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie2022
  11. An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAlexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, etal.2020
  12. Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsZe Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo2021
  13. Learning Multiple Layers of Features from Tiny ImagesAlex Krizhevsky2009
  14. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene RecognitionAntonio Torralba, Rob Fergus, William T. Freeman2008
  15. ImageNet: A Large-Scale Hierarchical Image DatabaseJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei2009
  16. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal NetworksShaoqing Ren, Kaiming He, Ross Girshick, Jian Sun2015
  17. Microsoft COCO: Common Objects in ContextTsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, etal.2014
  18. Rich feature hierarchies for accurate object detection and semantic segmentationRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik2013
  19. You Only Look Once: Unified, Real-Time Object DetectionJoseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi2015
  20. Objects as PointsXingyi Zhou, Dequan Wang, Philipp Krähenbühl2019
  21. End-to-End Object Detection with TransformersNicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, etal.2020
  22. Center-based 3D Object Detection and TrackingTianwei Yin, Xingyi Zhou, Philipp Krähenbühl2020
  23. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3DJonah Philion, Sanja Fidler2020
  24. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View RepresentationZhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han2022
  25. Fully Convolutional Networks for Semantic SegmentationJonathan Long, Evan Shelhamer, Trevor Darrell2014
  26. Stacked Hourglass Networks for Human Pose EstimationAlejandro Newell, Kaiyu Yang, Jia Deng2016
  27. Depth Pro: Sharp Monocular Metric Depth in Less Than a SecondAleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, etal.2024
  28. The Cityscapes Dataset for Semantic Urban Scene UnderstandingMarius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, etal.2016
  29. Playing for Data: Ground Truth from Computer GamesStephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun2016
  30. Masked-attention Mask Transformer for Universal Image SegmentationBowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar2021
  31. Segment AnythingAlexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, etal.2023
  32. Mask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick2017
  33. The Mapillary Vistas Dataset for Semantic Understanding of Street ScenesGerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, Peter Kontschieder2017
  34. Free Supervision From Video GamesPhilipp Krähenbühl2018
  35. U-Net: Convolutional Networks for Biomedical Image SegmentationOlaf Ronneberger, Philipp Fischer, Thomas Brox2015
  36. Learning Transferable Visual Models From Natural Language SupervisionAlec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, etal.2021
  37. Open-vocabulary Object Detection via Vision and Language Knowledge DistillationXiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui2021
  38. Detecting Twenty-thousand Classes using Image-level SupervisionXingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra2022
  39. Reproducible scaling laws for contrastive language-image learningMehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, etal.2022
  40. DataComp: In search of the next generation of multimodal datasetsSamir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, etal.2023
  41. Image Captioners Are Scalable Vision Learners TooMichael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer2023
  42. LocCa: Visual Pretraining with Location-aware CaptionersBo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, etal.2024
  43. Flamingo: a Visual Language Model for Few-Shot LearningJean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, etal.2022
  44. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationJunnan Li, Dongxu Li, Caiming Xiong, Steven Hoi2022
  45. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningWenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, etal.2023
  46. Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee2023
  47. Improved Baselines with Visual Instruction TuningHaotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee2023
  48. Matryoshka Multimodal ModelsMu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee2024
  49. CogVLM: Visual Expert for Pretrained Language ModelsWeihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, etal.2023
  50. OtterHD: A High-Resolution Multi-modality ModelBo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu2023
  51. VILA: On Pre-training for Visual Language ModelsJi Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, etal.2023
  52. VeCLIP: Improving CLIP Training via Visual-enriched CaptionsZhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, etal.2023
  53. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondJinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, etal.2023
  54. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksJiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi2022
  55. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsJunnan Li, Dongxu Li, Silvio Savarese, Steven Hoi2023
  56. Ferret: Refer and Ground Anything Anywhere at Any GranularityHaoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal.2023
  57. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsKeen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal.2024
  58. Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language ModelsHaotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal.2024
  59. Language-Image Models with 3D UnderstandingJang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal.2024
  60. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesBoyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal.2024
  61. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingBrandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal.2024
  62. Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon Team2024
  63. PaliGemma: A versatile 3B VLM for transferLucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal.2024
  64. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language ModelsMatt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal.2024