Advances in Deep Learning
References
Mixed Precision Training Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, etal. 2017 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, etal. 2018 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, etal. 2020 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He 2019 LoRA: Low-Rank Adaptation of Large Language Models Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, etal. 2021 8-Bit Approximations for Parallelism in Deep Learning Tim Dettmers 2015 8-bit Optimizers via Block-wise Quantization Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer 2021 The case for 4-bit precision: k-bit Inference Scaling Laws Tim Dettmers, Luke Zettlemoyer 2022 QLoRA: Efficient Finetuning of Quantized LLMs Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer 2023 GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian 2024 Training Deep Nets with Sublinear Memory Cost Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin 2016 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré 2022 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Tri Dao 2023 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao 2024 - https://github.com/ray-project/ray
- https://github.com/Lightning-AI/pytorch-lightning
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling 2013 Generative Adversarial Networks Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, etal. 2014 Large Scale GAN Training for High Fidelity Natural Image Synthesis Andrew Brock, Jeff Donahue, Karen Simonyan 2018 Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, etal. 2016 Variational Inference with Normalizing Flows Danilo Jimenez Rezende, Shakir Mohamed 2015 Density estimation using Real NVP Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio 2016 Glow: Generative Flow with Invertible 1x1 Convolutions Diederik P. Kingma, Prafulla Dhariwal 2018 WaveNet: A Generative Model for Raw Audio Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, etal. 2016 Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh 2022 Lossless Image Compression through Super-Resolution Sheng Cao, Chao-Yuan Wu, Philipp Krähenbühl 2020 Practical Full Resolution Learned Lossless Image Compression Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, Luc Van Gool 2018 Neural Discrete Representation Learning Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu 2017 Taming Transformers for High-Resolution Image Synthesis Patrick Esser, Robin Rombach, Björn Ommer 2020 Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, etal. 2023 Zero-Shot Text-to-Image Generation Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, etal. 2021 Language Models are Unsupervised Multitask Learners Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever 2019 Simulating 500 million years of evolution with a language model Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, etal. 2024 Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning Piyush Sharma, Nan Ding, Sebastian Goodman, Radu Soricut 2018 YFCC100M: The New Data in Multimedia Research Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal. 2015 Generating Long Sequences with Sparse Transformers Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever 2019 Denoising Diffusion Probabilistic Models Jonathan Ho, Ajay Jain, Pieter Abbeel 2020 Generative Modeling by Estimating Gradients of the Data Distribution Yang Song, Stefano Ermon 2019 Diffusion Models Beat GANs on Image Synthesis Prafulla Dhariwal, Alex Nichol 2021 High-Resolution Image Synthesis with Latent Diffusion Models Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer 2021 Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, etal. 2022 Hierarchical Text-Conditional Image Generation with CLIP Latents Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen 2022 CCM: Adding Conditional Controls to Text-to-Image Consistency Models Jie Xiao, Kai Zhu, Han Zhang, Zhiheng Liu, Yujun Shen, Yu Liu, Xueyang Fu, Zheng-Jun Zha 2023 Adding Conditional Control to Text-to-Image Diffusion Models Lvmin Zhang, Anyi Rao, Maneesh Agrawala 2023 One-step Diffusion with Distribution Matching Distillation Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, etal. 2023 Diffusion Models: A Comprehensive Survey of Methods and Applications Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, etal. 2022 PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, etal. 2022 Gemini: A Family of Highly Capable Multimodal Models Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, etal. 2023 Mistral 7B Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, etal. 2023 Mixtral of Experts Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, etal. 2024 Improving Language Understanding by Generative Pretraining Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever 2018 Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, etal. 2017 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova 2018 Physics of Language Models: Part 3.1, Knowledge Storage and Extraction Zeyuan Allen-Zhu, Yuanzhi Li 2023 Language Models are Few-Shot Learners Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, etal. 2020 - https://commoncrawl.org/
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, etal. 2020 Mamba: Linear-Time Sequence Modeling with Selective State Spaces Albert Gu, Tri Dao 2023 Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu, Karan Goel, Christopher Ré 2021 The Curious Case of Neural Text Degeneration Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi 2019 Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs Minh Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, Ravid Shwartz-Ziv 2024 Training language models to follow instructions with human feedback Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, etal. 2022 Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, etal. 2024 Simple statistical gradient-following algorithms for connectionist reinforcement learning Ronald J. Williams 1992 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn 2023 DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner 2019 PIQA: Reasoning about Physical Commonsense in Natural Language Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi 2019 Measuring Massive Multitask Language Understanding Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt 2020 Training Verifiers to Solve Math Word Problems Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, etal. 2021 WinoGrande: An Adversarial Winograd Schema Challenge at Scale Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi 2019 Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, etal. 2022 AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, etal. 2023 Evaluating Large Language Models Trained on Code Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, etal. 2021 Program Synthesis with Large Language Models Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, etal. 2021 Ring Attention with Blockwise Transformers for Near-Infinite Context Hao Liu, Matei Zaharia, Pieter Abbeel 2023 Sequence Parallelism: Long Sequence Training from System Perspective Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You 2021 Reducing Activation Recomputation in Large Transformer Models Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, etal. 2022 DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, Hao Zhang 2023 Efficient Memory Management for Large Language Model Serving with PagedAttention Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, etal. 2023 PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, etal. 2024 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai 2023 Fast Inference from Transformers via Speculative Decoding Yaniv Leviathan, Matan Kalman, Yossi Matias 2022 Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao 2024 - https://github.com/pytorch/torchtune
- https://github.com/vllm-project/vllm
- https://huggingface.co/models
- https://lmsys.org/
- https://ollama.com/
- https://github.com/ggerganov/llama.cpp
Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, etal. 2023 AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls Yu Du, Fangyun Wei, Hongyang Zhang 2024 The Llama 3 Herd of Models Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, etal. 2024 Synchromesh: Reliable code generation from pre-trained language models Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, etal. 2022 Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation Luca Beurer-Kellner, Marc Fischer, Martin Vechev 2024 Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search Chris Hokamp, Qun Liu 2017 Long Context Compression with Activation Beacon Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou 2024 RoFormer: Enhanced Transformer with Rotary Position Embedding Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu 2021 Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian 2023 Reading Wikipedia to Answer Open-Domain Questions Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes 2017 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, etal. 2020 REALM: Retrieval-Augmented Language Model Pre-Training Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang 2020 Improving language models by retrieving from trillions of tokens Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, etal. 2021 In-Context Retrieval-Augmented Language Models Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham 2023 Vision Transformers Need Registers Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski 2023 Massive Activations in Large Language Models Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu 2024 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, etal. 2022 Self-Consistency Improves Chain of Thought Reasoning in Language Models Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, etal. 2022 Tree of Thoughts: Deliberate Problem Solving with Large Language Models Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan 2023 ReAct: Synergizing Reasoning and Acting in Language Models Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao 2022 Reflexion: Language Agents with Verbal Reinforcement Learning Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao 2023 Generative Verifiers: Reward Modeling as Next-Token Prediction Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal 2024 ChatGPT is bullshit Michael Townsen Hicks, James Humphries, Joe Slater 2024 Large Language Models Cannot Self-Correct Reasoning Yet Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou 2023 Dissociating language and thought in large language models Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, etal. 2023 Physics of Language Models: Part 1, Learning Hierarchical Language Structures Zeyuan Allen-Zhu, Yuanzhi Li 2023 Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, etal. 2024 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, etal. 2024 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, etal. 2025 Reinforcement Learning for Long-Horizon Interactive LLM Agents Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, etal. 2025 Buy 4 REINFORCE Samples, Get a Baseline for Free! Wouter Kool, Herke van Hoof, Max Welling 2019 ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton 2012 YFCC100M: The New Data in Multimedia Research Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, etal. 2015 The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, etal. 2018 Places: An Image Database for Deep Scene Understanding Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, Aude Oliva 2016 Revisiting Unreasonable Effectiveness of Data in Deep Learning Era Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta 2017 Scaling Vision Transformers Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer 2021 DINOv2: Learning Robust Visual Features without Supervision Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, etal. 2023 Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 2015 Identity Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 2016 A ConvNet for the 2020s Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie 2022 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, etal. 2020 Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 2021 Learning Multiple Layers of Features from Tiny Images Alex Krizhevsky 2009 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition Antonio Torralba, Rob Fergus, William T. Freeman 2008 ImageNet: A Large-Scale Hierarchical Image Database Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei 2009 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun 2015 Microsoft COCO: Common Objects in Context Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, etal. 2014 Rich feature hierarchies for accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik 2013 You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi 2015 Objects as Points Xingyi Zhou, Dequan Wang, Philipp Krähenbühl 2019 End-to-End Object Detection with Transformers Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, etal. 2020 Center-based 3D Object Detection and Tracking Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl 2020 Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D Jonah Philion, Sanja Fidler 2020 BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han 2022 Fully Convolutional Networks for Semantic Segmentation Jonathan Long, Evan Shelhamer, Trevor Darrell 2014 Stacked Hourglass Networks for Human Pose Estimation Alejandro Newell, Kaiyu Yang, Jia Deng 2016 Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, etal. 2024 The Cityscapes Dataset for Semantic Urban Scene Understanding Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, etal. 2016 Playing for Data: Ground Truth from Computer Games Stephan R. Richter, Vibhav Vineet, Stefan Roth, Vladlen Koltun 2016 Masked-attention Mask Transformer for Universal Image Segmentation Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar 2021 Segment Anything Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, etal. 2023 Mask R-CNN Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick 2017 The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, Peter Kontschieder 2017 Free Supervision From Video Games Philipp Krähenbühl 2018 U-Net: Convolutional Networks for Biomedical Image Segmentation Olaf Ronneberger, Philipp Fischer, Thomas Brox 2015 Learning Transferable Visual Models From Natural Language Supervision Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, etal. 2021 Open-vocabulary Object Detection via Vision and Language Knowledge Distillation Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui 2021 Detecting Twenty-thousand Classes using Image-level Supervision Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra 2022 Reproducible scaling laws for contrastive language-image learning Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, etal. 2022 DataComp: In search of the next generation of multimodal datasets Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, etal. 2023 Image Captioners Are Scalable Vision Learners Too Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer 2023 LocCa: Visual Pretraining with Location-aware Captioners Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, etal. 2024 Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, etal. 2022 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi 2022 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, etal. 2023 Visual Instruction Tuning Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee 2023 Improved Baselines with Visual Instruction Tuning Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee 2023 Matryoshka Multimodal Models Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee 2024 CogVLM: Visual Expert for Pretrained Language Models Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, etal. 2023 OtterHD: A High-Resolution Multi-modality Model Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu 2023 VILA: On Pre-training for Visual Language Models Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, etal. 2023 VeCLIP: Improving CLIP Training via Visual-enriched Captions Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, etal. 2023 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, etal. 2023 Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi 2022 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi 2023 Ferret: Refer and Ground Anything Anywhere at Any Granularity Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, etal. 2023 Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, etal. 2024 Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, etal. 2024 Language-Image Models with 3D Understanding Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, etal. 2024 SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, etal. 2024 MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, etal. 2024 Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon Team 2024 PaliGemma: A versatile 3B VLM for transfer Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, etal. 2024 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, etal. 2024