Tasks and Datasets

Slides
Video Lecture

References

  1. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over ParagraphsDheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner2019
  2. PIQA: Reasoning about Physical Commonsense in Natural LanguageYonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi2019
  3. Measuring Massive Multitask Language UnderstandingDan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt2020
  4. Training Verifiers to Solve Math Word ProblemsKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, etal.2021
  5. WinoGrande: An Adversarial Winograd Schema Challenge at ScaleKeisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi2019
  6. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsAarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, etal.2022
  7. AGIEval: A Human-Centric Benchmark for Evaluating Foundation ModelsWanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, etal.2023
  8. Evaluating Large Language Models Trained on CodeMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, etal.2021
  9. Program Synthesis with Large Language ModelsJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, etal.2021