Sequence Parallelism

Slides
Video Lecture

References

  1. Ring Attention with Blockwise Transformers for Near-Infinite ContextHao Liu, Matei Zaharia, Pieter Abbeel2023
  2. Sequence Parallelism: Long Sequence Training from System PerspectiveShenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You2021
  3. Reducing Activation Recomputation in Large Transformer ModelsVijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, etal.2022
  4. DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs TrainingDacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, Hao Zhang2023