Architectures

Video Lecture

References

PaLM: Scaling Language Modeling with PathwaysAakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, etal.2022
Gemini: A Family of Highly Capable Multimodal Models Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, etal.2023
Mistral 7BAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, etal.2023
Mixtral of ExpertsAlbert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, etal.2024
Improving Language Understanding by Generative PretrainingAlec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever2018
Attention Is All You NeedAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, etal.2017
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova2018
Physics of Language Models: Part 3.1, Knowledge Storage and ExtractionZeyuan Allen-Zhu, Yuanzhi Li2023
Language Models are Unsupervised Multitask LearnersAlec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever2019
Language Models are Few-Shot LearnersTom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, etal.2020
https://commoncrawl.org/
The Pile: An 800GB Dataset of Diverse Text for Language ModelingLeo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, etal.2020
Mamba: Linear-Time Sequence Modeling with Selective State SpacesAlbert Gu, Tri Dao2023
Efficiently Modeling Long Sequences with Structured State SpacesAlbert Gu, Karan Goel, Christopher Ré2021