Advances in Deep Learning

TitleMaterialsReferences
Getting Started
Welcome to Advances in Deep LearningSlides
Introduction
Structures of Deep NetworksSlides
Training Deep NetworksSlides
Modern GPU ArchitecturesSlidesMaterials
Advanced Training
Training Large ModelsSlides
Mixed Precision TrainingSlides[1]
Distributed TrainingSlides[2] [3]
Zero Redundancy TrainingSlides[4]
Low-Rank AdaptersSlides[5]
QuantizationSlides[6] [7] [8]
Quantized Low-Rank AdaptersSlides[9]
Low-Rank ProjectionsSlides[10]
CheckpointingSlides[11]
FlashAttentionSlides[12] [13] [14]
Open-Source Infrastructure for Model TrainingSlidesMaterials[15] [16]
Generative Models
Generative ModelsSlides
Variational Auto EncodersSlides[17]
Generative Adversarial NetworksSlides[18] [19] [20]
Flow-Based ModelsSlides[21] [22] [23]
Auto-Regressive GenerationSlides[24] [25] [26] [27]
Vector QuantizationSlides[28] [29] [30]
Dall-ESlides[28] [31] [32] [33] [34] [35] [36]
Diffusion ModelsSlides[37] [38] [39]
Latent Diffusion and State-of-the-Art ModelsSlides[40] [31] [41] [42] [43] [44] [45] [46]
Which Generative Model Should I Use?Slides
Large Language Models
Large Language ModelsSlides
ArchitecturesSlides[47] [48] [49] [50] [51] [52] [53] [54] [32] [55] [56] [57] [58] [59]
GenerationSlidesMaterials[51] [60] [61]
Instruction TuningSlides
RLHFSlides[62] [63] [64]
DpoSlides[62] [65]
Tasks and DatasetsSlides[66] [67] [68] [69] [70] [71] [72] [73] [74]
Efficient LLM Training and InferenceSlides
Sequence ParallelismSlides[75] [76] [77] [78]
Page AttentionSlides[79] [80] [81]
Speculative DecodingSlides[82] [83]
Open-Source Infrastructure for LLMsSlides[84] [85] [86] [87] [88] [89]
Tool UseSlides[90] [91] [92]
Structured OutputsSlides
Constrained DecodingSlides[93] [94] [95]
Long ContextSlides[96] [97] [98]
Retrieval Augmented GenerationSlides[99] [100] [101] [102] [103]
Structured DialoguesSlides[104] [105] [55] [106] [107] [108] [109] [110] [111]
Limitations of LLMsSlides[112] [113] [114] [115]
Bonus - Reinforcement Learning and LLMsSlides[116] [117] [118] [119] [120]
Computer Vision
Computer VisionSlides
Image ClassificationSlides[121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135]
Object DetectionSlides[136] [137] [138] [139] [140] [141] [142] [143] [144]
SegmentationSlides[145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155]
Open-Vocabulary RecognitionSlides[156] [157] [158]
Vision Language Models - Image CaptioningSlides[156] [159] [160] [161] [162]
Early Vision Language ModelsSlides[163] [164] [165] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175]
Current Vision Language ModelsSlides[176] [177] [178] [179] [180] [181] [182] [183] [184]
End of Class
End of ClassSlides

References

  1. Mixed Precision TrainingPaulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, etal.2017
  2. GPipe: Efficient Training of Giant Neural Networks using Pipeline ParallelismYanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, etal.2018
  3. GShard: Scaling Giant Models with Conditional Computation and Automatic ShardingDmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, etal.2020
  4. ZeRO: Memory Optimizations Toward Training Trillion Parameter ModelsSamyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He2019
  5. LoRA: Low-Rank Adaptation of Large Language ModelsEdward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, etal.2021
  6. 8-Bit Approximations for Parallelism in Deep LearningTim Dettmers2015
  7. 8-bit Optimizers via Block-wise QuantizationTim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer2021
  8. The case for 4-bit precision: k-bit Inference Scaling LawsTim Dettmers, Luke Zettlemoyer2022
  9. QLoRA: Efficient Finetuning of Quantized LLMsTim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer2023
  10. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank ProjectionJiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian2024
  11. Training Deep Nets with Sublinear Memory CostTianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin2016
  12. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessTri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré2022
  13. FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningTri Dao2023
  14. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precisionJay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao2024
  15. https://github.com/ray-project/ray
  16. https://github.com/Lightning-AI/pytorch-lightning
  17. Auto-Encoding Variational BayesDiederik P Kingma, Max Welling2013
  18. Generative Adversarial NetworksIan J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, etal.2014
  19. Large Scale GAN Training for High Fidelity Natural Image SynthesisAndrew Brock, Jeff Donahue, Karen Simonyan2018
  20. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial NetworkChristian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, etal.2016
  21. Variational Inference with Normalizing FlowsDanilo Jimenez Rezende, Shakir Mohamed2015
  22. Density estimation using Real NVPLaurent Dinh, Jascha Sohl-Dickstein, Samy Bengio2016
  23. Glow: Generative Flow with Invertible 1x1 ConvolutionsDiederik P. Kingma, Prafulla Dhariwal2018
  24. WaveNet: A Generative Model for Raw AudioAaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, etal.2016
  25. Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive TransformerSongwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh2022
  26. Lossless Image Compression through Super-ResolutionSheng Cao, Chao-Yuan Wu, Philipp Krähenbühl2020
  27. Practical Full Resolution Learned Lossless Image CompressionFabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, Luc Van Gool2018
  28. Neural Discrete Representation LearningAaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu2017
  29. Taming Transformers for High-Resolution Image SynthesisPatrick Esser, Robin Rombach, Björn Ommer2020
  30. Language Model Beats Diffusion -- Tokenizer is Key to Visual GenerationLijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, etal.2023
  31. Zero-Shot Text-to-Image GenerationAditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, etal.2021
  32. Language Models are Unsupervised Multitask LearnersAlec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever2019
  33. Simulating 500 million years of evolution with a language modelThomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, etal.2024
  34. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image CaptioningPiyush Sharma, Nan Ding, Sebastian Goodman, Radu Soricut2018
  35. YFCC100M: The New Data in Multimedia ResearchBart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal.2015
  36. Generating Long Sequences with Sparse TransformersRewon Child, Scott Gray, Alec Radford, Ilya Sutskever2019
  37. Denoising Diffusion Probabilistic ModelsJonathan Ho, Ajay Jain, Pieter Abbeel2020
  38. Generative Modeling by Estimating Gradients of the Data DistributionYang Song, Stefano Ermon2019
  39. Diffusion Models Beat GANs on Image SynthesisPrafulla Dhariwal, Alex Nichol2021
  40. High-Resolution Image Synthesis with Latent Diffusion ModelsRobin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer2021
  41. Photorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingChitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, etal.2022
  42. Hierarchical Text-Conditional Image Generation with CLIP LatentsAditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen2022
  43. CCM: Adding Conditional Controls to Text-to-Image Consistency ModelsJie Xiao, Kai Zhu, Han Zhang, Zhiheng Liu, Yujun Shen, Yu Liu, Xueyang Fu, Zheng-Jun Zha2023
  44. Adding Conditional Control to Text-to-Image Diffusion ModelsLvmin Zhang, Anyi Rao, Maneesh Agrawala2023
  45. One-step Diffusion with Distribution Matching DistillationTianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, etal.2023
  46. Diffusion Models: A Comprehensive Survey of Methods and ApplicationsLing Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, etal.2022
  47. PaLM: Scaling Language Modeling with PathwaysAakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, etal.2022
  48. Gemini: A Family of Highly Capable Multimodal Models Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, etal.2023
  49. Mistral 7BAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, etal.2023
  50. Mixtral of ExpertsAlbert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, etal.2024
  51. Improving Language Understanding by Generative PretrainingAlec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever2018
  52. Attention Is All You NeedAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, etal.2017
  53. BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova2018
  54. Physics of Language Models: Part 3.1, Knowledge Storage and ExtractionZeyuan Allen-Zhu, Yuanzhi Li2023
  55. Language Models are Few-Shot LearnersTom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, etal.2020
  56. https://commoncrawl.org/
  57. The Pile: An 800GB Dataset of Diverse Text for Language ModelingLeo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, etal.2020
  58. Mamba: Linear-Time Sequence Modeling with Selective State SpacesAlbert Gu, Tri Dao2023
  59. Efficiently Modeling Long Sequences with Structured State SpacesAlbert Gu, Karan Goel, Christopher Ré2021
  60. The Curious Case of Neural Text DegenerationAri Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi2019
  61. Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM OutputsMinh Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, Ravid Shwartz-Ziv2024
  62. Training language models to follow instructions with human feedbackLong Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, etal.2022
  63. Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsArash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, etal.2024
  64. Simple statistical gradient-following algorithms for connectionist reinforcement learningRonald J. Williams1992
  65. Direct Preference Optimization: Your Language Model is Secretly a Reward ModelRafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn2023
  66. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over ParagraphsDheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner2019
  67. PIQA: Reasoning about Physical Commonsense in Natural LanguageYonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi2019
  68. Measuring Massive Multitask Language UnderstandingDan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt2020
  69. Training Verifiers to Solve Math Word ProblemsKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, etal.2021
  70. WinoGrande: An Adversarial Winograd Schema Challenge at ScaleKeisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi2019
  71. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsAarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, etal.2022
  72. AGIEval: A Human-Centric Benchmark for Evaluating Foundation ModelsWanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, etal.2023
  73. Evaluating Large Language Models Trained on CodeMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, etal.2021
  74. Program Synthesis with Large Language ModelsJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, etal.2021
  75. Ring Attention with Blockwise Transformers for Near-Infinite ContextHao Liu, Matei Zaharia, Pieter Abbeel2023
  76. Sequence Parallelism: Long Sequence Training from System PerspectiveShenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You2021
  77. Reducing Activation Recomputation in Large Transformer ModelsVijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, etal.2022
  78. DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs TrainingDacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, Hao Zhang2023
  79. Efficient Memory Management for Large Language Model Serving with PagedAttentionWoosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, etal.2023
  80. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information FunnelingZefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, etal.2024
  81. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsJoshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai2023
  82. Fast Inference from Transformers via Speculative DecodingYaniv Leviathan, Matan Kalman, Yossi Matias2022
  83. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding HeadsTianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao2024
  84. https://github.com/pytorch/torchtune
  85. https://github.com/vllm-project/vllm
  86. https://huggingface.co/models
  87. https://lmsys.org/
  88. https://ollama.com/
  89. https://github.com/ggerganov/llama.cpp
  90. Toolformer: Language Models Can Teach Themselves to Use ToolsTimo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, etal.2023
  91. AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API CallsYu Du, Fangyun Wei, Hongyang Zhang2024
  92. The Llama 3 Herd of ModelsAaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, etal.2024
  93. Synchromesh: Reliable code generation from pre-trained language modelsGabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, etal.2022
  94. Guiding LLMs The Right Way: Fast, Non-Invasive Constrained GenerationLuca Beurer-Kellner, Marc Fischer, Martin Vechev2024
  95. Lexically Constrained Decoding for Sequence Generation Using Grid Beam SearchChris Hokamp, Qun Liu2017
  96. Long Context Compression with Activation BeaconPeitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou2024
  97. RoFormer: Enhanced Transformer with Rotary Position EmbeddingJianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu2021
  98. Extending Context Window of Large Language Models via Positional InterpolationShouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian2023
  99. Reading Wikipedia to Answer Open-Domain QuestionsDanqi Chen, Adam Fisch, Jason Weston, Antoine Bordes2017
  100. Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksPatrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, etal.2020
  101. REALM: Retrieval-Augmented Language Model Pre-TrainingKelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang2020
  102. Improving language models by retrieving from trillions of tokensSebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, etal.2021
  103. In-Context Retrieval-Augmented Language ModelsOri Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham2023
  104. Vision Transformers Need RegistersTimothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski2023
  105. Massive Activations in Large Language ModelsMingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu2024
  106. Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsJason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, etal.2022
  107. Self-Consistency Improves Chain of Thought Reasoning in Language ModelsXuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, etal.2022
  108. Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsShunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan2023
  109. ReAct: Synergizing Reasoning and Acting in Language ModelsShunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao2022
  110. Reflexion: Language Agents with Verbal Reinforcement LearningNoah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao2023
  111. Generative Verifiers: Reward Modeling as Next-Token PredictionLunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal2024
  112. ChatGPT is bullshitMichael Townsen Hicks, James Humphries, Joe Slater2024
  113. Large Language Models Cannot Self-Correct Reasoning YetJie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou2023
  114. Dissociating language and thought in large language modelsKyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, etal.2023
  115. Physics of Language Models: Part 1, Learning Hierarchical Language StructuresZeyuan Allen-Zhu, Yuanzhi Li2023
  116. Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsArash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, etal.2024
  117. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsZhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, etal.2024
  118. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, etal.2025
  119. Reinforcement Learning for Long-Horizon Interactive LLM AgentsKevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, etal.2025
  120. Buy 4 REINFORCE Samples, Get a Baseline for Free!Wouter Kool, Herke van Hoof, Max Welling2019
  121. ImageNet Classification with Deep Convolutional Neural NetworksAlex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton2012
  122. YFCC100M: The New Data in Multimedia ResearchBart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal.2015
  123. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scaleAlina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, etal.2018
  124. Places: An Image Database for Deep Scene UnderstandingBolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, Aude Oliva2016
  125. Revisiting Unreasonable Effectiveness of Data in Deep Learning EraChen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta2017
  126. Scaling Vision TransformersXiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer2021
  127. DINOv2: Learning Robust Visual Features without SupervisionMaxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, etal.2023
  128. Deep Residual Learning for Image RecognitionKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun2015
  129. Identity Mappings in Deep Residual NetworksKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun2016
  130. A ConvNet for the 2020sZhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie2022
  131. An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAlexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, etal.2020
  132. Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsZe Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo2021
  133. Learning Multiple Layers of Features from Tiny ImagesAlex Krizhevsky2009
  134. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene RecognitionAntonio Torralba, Rob Fergus, William T. Freeman2008
  135. ImageNet: A Large-Scale Hierarchical Image DatabaseJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei2009
  136. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal NetworksShaoqing Ren, Kaiming He, Ross Girshick, Jian Sun2015
  137. Microsoft COCO: Common Objects in ContextTsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, etal.2014
  138. Rich feature hierarchies for accurate object detection and semantic segmentationRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik2013
  139. You Only Look Once: Unified, Real-Time Object DetectionJoseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi2015
  140. Objects as PointsXingyi Zhou, Dequan Wang, Philipp Krähenbühl2019
  141. End-to-End Object Detection with TransformersNicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, etal.2020
  142. Center-based 3D Object Detection and TrackingTianwei Yin, Xingyi Zhou, Philipp Krähenbühl2020
  143. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3DJonah Philion, Sanja Fidler2020
  144. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View RepresentationZhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han2022
  145. Fully Convolutional Networks for Semantic SegmentationJonathan Long, Evan Shelhamer, Trevor Darrell2014
  146. Stacked Hourglass Networks for Human Pose EstimationAlejandro Newell, Kaiyu Yang, Jia Deng2016
  147. Depth Pro: Sharp Monocular Metric Depth in Less Than a SecondAleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, etal.2024
  148. The Cityscapes Dataset for Semantic Urban Scene UnderstandingMarius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, etal.2016
  149. Playing for Data: Ground Truth from Computer GamesStephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun2016
  150. Masked-attention Mask Transformer for Universal Image SegmentationBowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar2021
  151. Segment AnythingAlexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, etal.2023
  152. Mask R-CNNKaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick2017
  153. The Mapillary Vistas Dataset for Semantic Understanding of Street ScenesGerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, Peter Kontschieder2017
  154. Free Supervision From Video GamesPhilipp Krähenbühl2018
  155. U-Net: Convolutional Networks for Biomedical Image SegmentationOlaf Ronneberger, Philipp Fischer, Thomas Brox2015
  156. Learning Transferable Visual Models From Natural Language SupervisionAlec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, etal.2021
  157. Open-vocabulary Object Detection via Vision and Language Knowledge DistillationXiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui2021
  158. Detecting Twenty-thousand Classes using Image-level SupervisionXingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra2022
  159. Reproducible scaling laws for contrastive language-image learningMehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, etal.2022
  160. DataComp: In search of the next generation of multimodal datasetsSamir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, etal.2023
  161. Image Captioners Are Scalable Vision Learners TooMichael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer2023
  162. LocCa: Visual Pretraining with Location-aware CaptionersBo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, etal.2024
  163. Flamingo: a Visual Language Model for Few-Shot LearningJean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, etal.2022
  164. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationJunnan Li, Dongxu Li, Caiming Xiong, Steven Hoi2022
  165. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningWenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, etal.2023
  166. Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee2023
  167. Improved Baselines with Visual Instruction TuningHaotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee2023
  168. Matryoshka Multimodal ModelsMu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee2024
  169. CogVLM: Visual Expert for Pretrained Language ModelsWeihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, etal.2023
  170. OtterHD: A High-Resolution Multi-modality ModelBo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu2023
  171. VILA: On Pre-training for Visual Language ModelsJi Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, etal.2023
  172. VeCLIP: Improving CLIP Training via Visual-enriched CaptionsZhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, etal.2023
  173. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondJinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, etal.2023
  174. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksJiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi2022
  175. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsJunnan Li, Dongxu Li, Silvio Savarese, Steven Hoi2023
  176. Ferret: Refer and Ground Anything Anywhere at Any GranularityHaoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal.2023
  177. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsKeen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal.2024
  178. Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language ModelsHaotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal.2024
  179. Language-Image Models with 3D UnderstandingJang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal.2024
  180. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesBoyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal.2024
  181. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingBrandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal.2024
  182. Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon Team2024
  183. PaliGemma: A versatile 3B VLM for transferLucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal.2024
  184. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language ModelsMatt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal.2024