References
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over ParagraphsDheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner2019
- PIQA: Reasoning about Physical Commonsense in Natural LanguageYonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi2019
- Measuring Massive Multitask Language UnderstandingDan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt2020
- Training Verifiers to Solve Math Word ProblemsKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, etal.2021
- WinoGrande: An Adversarial Winograd Schema Challenge at ScaleKeisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi2019
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsAarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, etal.2022
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation ModelsWanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, etal.2023
- Evaluating Large Language Models Trained on CodeMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, etal.2021
- Program Synthesis with Large Language ModelsJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, etal.2021