References
- Ring Attention with Blockwise Transformers for Near-Infinite ContextHao Liu, Matei Zaharia, Pieter Abbeel2023
- Sequence Parallelism: Long Sequence Training from System PerspectiveShenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You2021
- Reducing Activation Recomputation in Large Transformer ModelsVijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, etal.2022
- DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs TrainingDacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, Hao Zhang2023