Benchmark tests are the SATs of the AI world

  • Benchmarking isn't sexy but model evaluation is critical for AI's future
  • IBM explained why designing benchmark tests is harder than you might think
  • Representativeness, reliability, efficiency, validity and transparency are key for designing good benchmarks

Standardized testing isn’t just for humans anymore. It turns out that large language models (LLMs) have to pass their own version of the SAT (or GCSE or ATAR or whatever depending on what country you’re in) before they hit prime time. But as the head of IBM’s foundation model evaluation team told us, designing benchmarking tests for artificial intelligence (AI) is harder than you might think.

Michal Shmueli-Scheuer is senior technical staff member for IBM’s Foundation Models Evaluation group and has led that team since IBM began building its own AI models a year and a half ago. While model evaluation isn’t necessarily a “sexy” task and often goes unappreciated, she said it’s critical to ensuring models are safe and can improve over time. After all, you have to have a baseline measurement to know if an AI is getting better or not.

In a nutshell, here’s how it all works. Models are trained to perform a specific function. They are then tested using tasks (a.k.a. benchmarks) relevant to that function and scored on how well they perform. Performance across all the tasks is aggregated into a series of benchmark metrics.

“It’s very hard work,” Shmueli-Scheuer said of evaluation testing.

Brittle bots and bad data

Why? Well, first of all, unlike the SATs, there’s no single evaluation test that is used across all LLMs. Sure, there are different options like Arena, MMLU, BBH and others. But while these tests are in high agreement on which model is better when testing a large number of LLMs (since it’s easy to tell broadly what’s good and what’s bad), they tend to disagree on which model is best when evaluating a smaller number of LLMs. So, there’s no single broad-spectrum test that’s the be-all-end-all.

To put a finer point on it, researchers published a paper in Science back in April 2023 warning that “aggregate metrics limit our insight into performance in particular situations, making it harder to find system failure points and robustly evaluate system safety.” They added that things are only getting worse as LLMs expand, demanding “more diverse benchmarks to cover the range of their capabilities.”

Since instance-by-instance results are rarely made available, it’s “difficult for researchers and policy-makers to further scrutinize system behavior," the paper said.

Shmueli-Scheuer noted LLMs are incredibly “brittle,” meaning that the use of different phrasing, the order of examples given or even the presence of a wayward space in prompts can result in wildly different outputs from the same model.

So, the problem is not just figuring out what to test, but also how to test it. Get it wrong and you get bad data. Shmueli-Scheuer said LLM developers also run the risk of delivering a lower-quality model or one that doesn’t meet safety standards.

On top of all this, Shmueli-Scheuer noted testing – like training – can be expensive since it uses so much compute power. So, you don’t want to waste time on evaluations that don’t work.

Benchmarking must-haves

To solve for these issues, Shmueli-Scheuer said IBM strives to design tests that align with four key pillars: representativeness, reliability, efficiency and validity.

To measure representativeness, the idea is to ensure that the test’s task distribution reflects the range of skills required by the LLM. This is easier said than done, especially when you think of LLM agents, which are meant to tackle a series of tasks in a sequence.

To ensure reliability, IBM aims to ensure the evaluation accounts for how arbitrary factors (like the wayward spaces mentioned above) could impact the model responses. Efficiency is focused on making the test fast and cheap, while validity means checking that the tasks presented in the test measure what they’re meant to.

To achieve the most effective evaluation, Shmueli-Scheuer said that shallower testing of many tasks is better than testing deeply on a few tasks. She also recommended multiple tests to ensure the first test didn’t miss anything.

The aforementioned research paper also argued in favor of increased transparency in test results, so that customers, researchers and lawmakers can gain greater insight into the details of how models performed on various benchmarking tasks.

Like AI itself, Shmueli-Scheuer said benchmarking is rapidly evolving, adding that IBM is doing some work on adversarial benchmarking as well as benchmarking that uses humans to judge model performance rather than other LLMs.

Shmueli-Scheuer couldn't elaborate more on using humans to judge model performance because it's top secret, but maybe it's really because humans are still better than AI.