Making big benchmarks more trustworthy: Identifying the capabilities and limitations of language models by improving the BIG-Bench benchmark

Ryan Burnell Postdoctoral Research Fellow at Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK AI systems are becoming an integral part of every aspect of modern life. To ensure public trust in these systems, we need tools that can be used to evaluate their capabilities and weaknesses. But these tools are struggling … Continue reading Making big benchmarks more trustworthy: Identifying the capabilities and limitations of language models by improving the BIG-Bench benchmark