Advanced Training
References
Mixed Precision Training Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, etal. 2017 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, etal. 2018 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, etal. 2020 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He 2019 LoRA: Low-Rank Adaptation of Large Language Models Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, etal. 2021 8-Bit Approximations for Parallelism in Deep Learning Tim Dettmers 2015 8-bit Optimizers via Block-wise Quantization Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer 2021 The case for 4-bit precision: k-bit Inference Scaling Laws Tim Dettmers, Luke Zettlemoyer 2022 QLoRA: Efficient Finetuning of Quantized LLMs Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer 2023 GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian 2024 Training Deep Nets with Sublinear Memory Cost Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin 2016 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré 2022 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Tri Dao 2023 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao 2024 - https://github.com/ray-project/ray
- https://github.com/Lightning-AI/pytorch-lightning