Advanced Training

TitleMaterialsReferences
Training Large ModelsSlides
Mixed Precision TrainingSlides[1]
Distributed TrainingSlides[2] [3]
Zero Redundancy TrainingSlides[4]
Low-Rank AdaptersSlides[5]
QuantizationSlides[6] [7] [8]
Quantized Low-Rank AdaptersSlides[9]
Low-Rank ProjectionsSlides[10]
CheckpointingSlides[11]
FlashAttentionSlides[12] [13] [14]
Open-Source Infrastructure for Model TrainingSlidesMaterials[15] [16]

References

  1. Mixed Precision TrainingPaulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, etal.2017
  2. GPipe: Efficient Training of Giant Neural Networks using Pipeline ParallelismYanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, etal.2018
  3. GShard: Scaling Giant Models with Conditional Computation and Automatic ShardingDmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, etal.2020
  4. ZeRO: Memory Optimizations Toward Training Trillion Parameter ModelsSamyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He2019
  5. LoRA: Low-Rank Adaptation of Large Language ModelsEdward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, etal.2021
  6. 8-Bit Approximations for Parallelism in Deep LearningTim Dettmers2015
  7. 8-bit Optimizers via Block-wise QuantizationTim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer2021
  8. The case for 4-bit precision: k-bit Inference Scaling LawsTim Dettmers, Luke Zettlemoyer2022
  9. QLoRA: Efficient Finetuning of Quantized LLMsTim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer2023
  10. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank ProjectionJiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian2024
  11. Training Deep Nets with Sublinear Memory CostTianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin2016
  12. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessTri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré2022
  13. FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningTri Dao2023
  14. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precisionJay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao2024
  15. https://github.com/ray-project/ray
  16. https://github.com/Lightning-AI/pytorch-lightning