Editor’s note: Sinan Ozdemir is a speaker for ODSC East from May 13th to 15th in Boston. Be sure to check out his talk, “Beyond Benchmarks: Evaluating AI Agents, Multimodal Systems, and Generative AI in the Real World,” there!
For the uninitiated, a benchmark is a standardized open-source test set for an AI task. The idea of a benchmark or even a test set for AI is not new by any stretch. The general idea when training any AI model is to split a (usually) massive amount of data into “splits” and train the model on the largest portion of the data (the training split), validate your results along the way using a smaller subset (the validation split) and a similarly small subset is used to finally “test” the model at the end (the test split). The idea is that if a team agrees on a given train/val/test split, then we could evaluate models fairly, knowing it wasn’t a difference in data that made the difference. Below is an image from one of my books, highlighting that fine-tuning process.
Test sets in general (benchmarks being an example of a test set) help AI engineers know if their hard work on training models paid off (Source: A Quick Start Guide to LLMs, by yours truly, Sinan Ozdemir)
But what if the model you are making is not necessarily meant for you or for your team, but in fact meant for as many people as possible, and what if we can’t standardize a training set, because that itself has some proprietary / “secret sauce”? That’s where benchmarks come in. Someone (or usually some people) create and propose a test set (the benchmark) and hope people adopt it. Moreover, whatever training data an organization needs to use to train their model, just need to make sure it’s not from this benchmark or else that’d be cheating (more on that later).
Benchmarks are often the first thing people ask of a new LLM: “How did it score on XYZ benchmark” or “is this model better than ZYX model at benchmarks?” and frankly that’s the primary purpose of benchmarks: to serve as a top-line conversation starter when evaluating an LLM for a certain job. This post and my session overall will tackle the urge for us to consider benchmarks as more than a conversation starter, but a conversation ender.
We will explore three main areas of benchmarks in this post:
- Benchmarks becoming targets: LLM creators are incentivized to chase the top of leaderboards, and LLM consumers conflate benchmark performance with everyday, real-world performance.
- Biases and shortcuts: Static benchmarks often contain artifacts and biases that models exploit, making benchmarks artificially easier than they appear.
- Overstated progress: High scores on benchmarks don’t mean models have true generalization or human-level understanding.
Let’s dig in.
Benchmarks becoming targets
We’ve been measuring AI using benchmarks for decades. Benchmarks like SQuAD (you’re forgiven if you’ve never heard of it) have been used to measure an AI’s ability to perform question/answer tasks like, “What is the largest city of Poland?” The image below shows the progress of AI on classic benchmarks, superseding “human performance” (denoted as the 0 mark on the y axis) in the last decade. The goal of the benchmarks was to give us humans a consistent, shared view of how well our AI systems were doing.
That’s still true today; on its face, targeting benchmark performance is useful to measure top-line progress of AI performance. The problem becomes when we all hyper-focus on a relatively small subset of benchmarks and equate a model’s performance on the benchmark with the AI’s overall performance. It leads to one of the most difficult questions I have to answer in a lecture: “Which model is currently the best?”
Benchmark saturation over time for popular benchmarks, normalized with initial performance at minus one and human performance at zero. (Source: https://arxiv.org/pdf/2104.14337)
This is not the right question to ask. The better question is, “how will this particular model perform on my particular task and is there a benchmark that gives me any indication of that?” It’s less punchy and puts more work on the consumer, but it better covers the current adoption of AI: task-oriented and usually domain-focused.
That being said, we sometimes can’t even trust an LLM’s ability to solve a benchmark consistently. The image below depicts a study showing the performance of GPT-4 going from 84% to 51% performance on a math benchmark within only 3 months of release (from March to June 2023). This was because OpenAI released a new version of the model to little fanfare with a later knowledge cutoff, but didn’t report how benchmarks shifted. So we, the persuadable public, still assume the benchmarks they reported in March were accurate; they weren’t.
GPT-4 and GPT-3.5 benchmark performances shifting wildly within only 3 months (Source: https://arxiv.org/pdf/2307.09009.pdf)
Nonetheless, we look to benchmarks on new LLM releases to give us that topline view. But once a benchmark like SQuAD (or more recently MMLU) becomes a standard, researchers and models optimize heavily to improve benchmark scores, chasing the top of a leaderboard to show the world what they’ve done and hopefully sell some credits. So should benchmarks even be a target to chase? Sure they should be, especially when the task is relatively nuanced and so is the benchmark (like testing a model’s financial tool selecting ability using a financial tool selection benchmark). Benchmark performance is a great top line set of numbers to help us create a shortlist of models to consider, but it can’t be the end of the story. This wouldn’t be a post about benchmarks without at least one reference to the 1970’s Goodhart’s Law:
“When a measure becomes a target, it ceases to be a good measure.”
Said another way, when we optimize too hard for a particular metric (performance on a benchmark), people (or in this case, AI models) will find ways to “game” the system and score higher metrics without actually improving at the underlying goal in any meaningful way. One way it can do that is by finding shortcuts to take to get better grades.
Biases and shortcuts
A section on benchmark biases and shortcuts could frankly be its own book. In fact I made a 12 hour video series focusing on this topic; there’s a lot to say. For now, let’s focus on a real and imminent shortcut AI’s can suffer from: data contamination. Data contamination is when an AI trains on data suspiciously similar to benchmark questions, artificially inflating a model’s performance on a benchmark. Basically, what if an AI accidentally or maliciously was given a cheat sheet?
An investigation into data contamination showed that letting Llama 2 train on rephrased benchmark questions that passed industry standard data contamination detection would have beaten GPT-4 at the popular MMLU benchmark. Too bad Meta has never open-sourced Llama’s training data so we can’t double check that work. (Source: https://arxiv.org/abs/2311.04850)
One way data contamination can happen is when a benchmark becomes so popular, that the open internet starts to consist of more data that might contain a rephrasing that is justtttt different enough to trick industry-standard contamination detection techniques (usually just embedding similarity and n-gram overlap [basically a keyword search] checks). A frighteningly simple research experiment (shown above) rephrased questions from the MMLU benchmark just enough to pass such industry-standard techniques and found that Llama-2 could have beaten GPT-4 on the benchmark if it had been allowed to train from them.
It’s easier than we think to let a benchmark question slip into the training data of a model. Experiments like these show that companies are likely doing a decent job at de-contaminating training data (otherwise we’d be seeing saturation at an even greater rate) but when models aren’t fully open source (including it’s training data and code) like every single Llama model, we can never double check the work being done.
Overstated progress
Put simply, benchmarks simply don’t cover a majority of what most humans would consider general intelligence, let alone “superintelligence”. Most benchmark questions are multiple choice and most benchmarks reported by companies are in the field of math and coding which really isn’t that helpful if you’re using an LLM to write marketing copy or to classify incoming customer support tickets with a particular intent class. One of the most talked about modern benchmarks is “Humanity’s Last Exam” or HLE. In their own words, this benchmark is designed to be the “final closed-ended academic benchmark of its kind with broad subject coverage.” And by “academic”, and “broad”, they mean almost entirely STEM.
Humanity’s Last Exam is mostly a STEM exam, featuring 3 questions tagged as “League Of Legends” questions 🤷(Source: https://github.com/centerforaisafety/hle)
To be 100% clear, there’s no chance I’ll ever pass, let alone ace this closed-book exam. For one, I don’t play League of Legends and I am terrible at Chess (both categories are represented in this benchmark). When an AI can pass this benchmark (yes I said when), I will find it extremely impressive but I will also have many questions, starting with:
- Was your AI able to look up information in trying to pass the exam?
- Can you prove to me you didn’t help your AI “cheat” by fine-tuning it on similar data?
- Can you convince me that I even care about your AI knowing how long the Second Great War was in StarCraft Lore? (Yes, that’s a real question in this benchmark)
The counterpoint to this line of questioning is that AI adoption will eventually evolve beyond targeted LLM use-cases and prompting and that benchmarks like HLE are less meant for targeted adoption of AI and more meant to signal a turning point in AI: the heralding of Artificial General Intelligence (AGI) or Superintelligence – an AI going beyond human intelligence. To put it bluntly, even an AI scoring 100% on HLE would not alone trigger a sense of AGI or Superintelligence to me. These high benchmark scores look impressive to us, the consumer, but they can stop reflecting true generalization of AI, exactly what Goodhart’s Law predicts.
So what do we do?
We will dive deeper into remedies for benchmarks in our live session, but I’ll outline a few simple steps we can take now:
- Use benchmarks to create a short list of models to evaluate – don’t use them to select a single model.
- Make your own test sets – They are valid and frankly will tell you more than most public benchmarks will on your particular tasks.
- Ask yourself, “who made this benchmark” and “what are they really trying to test?” Is it true reasoning beyond the knowledge an AI has, or simply the recall of a few facts that when put together, simulate reasoning.
Other topics in our session will include:
- The path to “AGI” and “Superintelligence” and a deeper dive into the “Humanity’s Last Exam” benchmark
- How AI developers use prompting when benchmarking, leading to public misconception of AI strength
- Addressing staleness in benchmarks: what if answers to questions change over time?
See you in Boston!
About the Author/ODSC East Speaker:
Sinan Ozdemir is a mathematician, data scientist, NLP expert, lecturer, and accomplished author. He is currently applying my extensive knowledge and experience in AI and Large Language Models (LLMs) as the founder and CTO of LoopGenius, transforming the way entrepreneurs and startups market their products and services.
Simultaneously, he is providing advisory services in AI and LLMs to Tola Capital, an innovative investment firm. He has also worked as an AI author for Addison Wesley and Pearson, crafting comprehensive resources that help professionals navigate the complex field of AI and LLMs.
Previously, he served as the Director of Data Science at Directly, where my work significantly influenced their strategic direction. As an official member of the Forbes Technology Council from 2017 to 2021, he shared his insights on AI, machine learning, NLP, and emerging technologies-related business processes.
He holds a B.A. and an M.A. in Pure Mathematics (Algebraic Geometry) from The Johns Hopkins University, and he is an alumnus of the Y Combinator program. Sinan actively contribute to society through various volunteering activities.
Sinan’s skill set is strongly endorsed by professionals from various sectors and includes data analysis, Python, statistics, AI, NLP, theoretical mathematics, data science, function analysis, data mining, algorithm development, machine learning, game-theoretic modeling, and various programming languages.
For more info visit at Times Of Tech