Cloud

Benchmark tests are the SATs of the AI world

By Diana Goovaerts Jul 16, 2024 2:21pm

Benchmarking isn't sexy but model evaluation is critical for AI's future
IBM explained why designing benchmark tests is harder than you might think
Representativeness, reliability, efficiency, validity and transparency are key for designing good benchmarks

Standardized testing isn’t just for humans anymore. It turns out that large language models (LLMs) have to pass their own version of the SAT (or GCSE or ATAR or whatever depending on what country you’re in) before they hit prime time. But as the head of IBM’s foundation model evaluation team told us, designing benchmarking tests for artificial intelligence (AI) is harder than you might think.

Michal Shmueli-Scheuer is senior technical staff member for IBM’s Foundation Models Evaluation group and has led that team since IBM began building its own AI models a year and a half ago. While model evaluation isn’t necessarily a “sexy” task and often goes unappreciated, she said it’s critical to ensuring models are safe and can improve over time. After all, you have to have a baseline measurement to know if an AI is getting better or not.

In a nutshell, here’s how it all works. Models are trained to perform a specific function. They are then tested using tasks (a.k.a. benchmarks) relevant to that function and scored on how well they perform. Performance across all the tasks is aggregated into a series of benchmark metrics.

“It’s very hard work,” Shmueli-Scheuer said of evaluation testing.

Brittle bots and bad data

Why? Well, first of all, unlike the SATs, there’s no single evaluation test that is used across all LLMs. Sure, there are different options like Arena, MMLU, BBH and others. But while these tests are in high agreement on which model is better when testing a large number of LLMs (since it’s easy to tell broadly what’s good and what’s bad), they tend to disagree on which model is best when evaluating a smaller number of LLMs. So, there’s no single broad-spectrum test that’s the be-all-end-all.

To put a finer point on it, researchers published a paper in Science back in April 2023 warning that “aggregate metrics limit our insight into performance in particular situations, making it harder to find system failure points and robustly evaluate system safety.” They added that things are only getting worse as LLMs expand, demanding “more diverse benchmarks to cover the range of their capabilities.”

Since instance-by-instance results are rarely made available, it’s “difficult for researchers and policy-makers to further scrutinize system behavior," the paper said.

Benchmarking must-haves

To solve for these issues, Shmueli-Scheuer said IBM strives to design tests that align with four key pillars: representativeness, reliability, efficiency and validity.

To measure representativeness, the idea is to ensure that the test’s task distribution reflects the range of skills required by the LLM. This is easier said than done, especially when you think of LLM agents, which are meant to tackle a series of tasks in a sequence.

To ensure reliability, IBM aims to ensure the evaluation accounts for how arbitrary factors (like the wayward spaces mentioned above) could impact the model responses. Efficiency is focused on making the test fast and cheap, while validity means checking that the tasks presented in the test measure what they’re meant to.

Benchmark tests are the SATs of the AI world

Brittle bots and bad data

Related

Benchmarking must-haves

Related