Making big benchmarks more trustworthy: Identifying the capabilities and limitations of language models by improving the BIG-Bench benchmark

Ryan Burnell

Postdoctoral Research Fellow at Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK

AI systems are becoming an integral part of every aspect of modern life. To ensure public trust in these systems, we need tools that can be used to evaluate their capabilities and weaknesses. But these tools are struggling to keep up with the rapid progress in AI—particularly for large language models, which some have argued are developing complex cognitive capabilities. The recently unveiled BIG-bench has attempted to address this problem by incorporating over 200 complex tasks into one large benchmark for language models. But there are critical problems with this benchmark. First, the tasks included in the benchmark were not designed in a systematic way, which means there are significant gaps in the abilities these tasks measure. Second, we currently lack ways of estimating cognitive abilities from patterns of performance on these tasks. To address these issues, we will conduct a systematic analysis of the BIG-bench tasks to build tools for extracting estimates of cognitive abilities from the benchmark. These tools will help researchers, policymakers, and members of the public to track the progress of language models and obtain more robust estimates of their capabilities and weaknesses. The proposed project will be a collaboration between Dr Ryan Burnell, an expert in cognitive science and experimental design based at the University of Cambridge and Dr Hernandez-Orallo, an expert in AI evaluation based at UPV. It will contribute to several objectives of TAILOR, and especially for WP3, whose task on “Safety and Robustness” is coordinated by Hernandez-Orallo.

Keywords: Language models, AI evaluation, cognitive psychology, reasoning, intuitive physics
Scientific area: Trustworthy AI, Human agency and oversight, Safety and robustness, Sustainable AI.