As someone who maintains benchmarks for a living, I think you're severely underrating the difficultly/complexity of constantly updating tasks. I think there's a fundamental tension with making frequent benchmark updates, and in results being comparable over time. It's a huge pain to re-run old models, often old models served from labs are deprecated, and running complex agent benchmarks is already very time-consuming/difficult. There's a sense in which the whole *point* of a benchmark is to be a static set of tasks that people can use to compare results across time and space, and if it's changing constantly you're hitting a moving target, which dilutes one of the main benefits of having this static set.
Otherwise totally agree with your points here—we should use LLM review and peer reviewers (and benchmark creators themselves, although it's sad that I have to say this) should actually look at the data!
I think there's also just really wide variance in benchmark quality—HLE is wayyyyy worse than most of the other benchmarks you listed, for example, because they didn't do substantive human expert review/validation of question correctness.
Fair point, I probably shouldn't have used the word "simple". I do think that such updates are doable though, even if it means dropping some old models off the leaderboard. On some (though not all) benchmarks, older models effectively get a 0 anyway. Maybe for long term benchmarks, there can be a "global" comparison set which is the set of all unmodified tasks? Certainly, in the months following a benchmark's release, deprecated models shouldn't be an issue. Seems easier than exhaustively finding every bug before release.
Solid audit work here. The regex-log octet issue is such a perfect example of how easy it is for edge cases to slip through even with thorough review. I've definately run into similiar headaches trying to build on top of flawed benchmarks, and it's frustrating how much time gets wasted debugging issues that aren't even in your own code. The open source software comparison is spot on tbh.
As someone who maintains benchmarks for a living, I think you're severely underrating the difficultly/complexity of constantly updating tasks. I think there's a fundamental tension with making frequent benchmark updates, and in results being comparable over time. It's a huge pain to re-run old models, often old models served from labs are deprecated, and running complex agent benchmarks is already very time-consuming/difficult. There's a sense in which the whole *point* of a benchmark is to be a static set of tasks that people can use to compare results across time and space, and if it's changing constantly you're hitting a moving target, which dilutes one of the main benefits of having this static set.
Otherwise totally agree with your points here—we should use LLM review and peer reviewers (and benchmark creators themselves, although it's sad that I have to say this) should actually look at the data!
I think there's also just really wide variance in benchmark quality—HLE is wayyyyy worse than most of the other benchmarks you listed, for example, because they didn't do substantive human expert review/validation of question correctness.
Fair point, I probably shouldn't have used the word "simple". I do think that such updates are doable though, even if it means dropping some old models off the leaderboard. On some (though not all) benchmarks, older models effectively get a 0 anyway. Maybe for long term benchmarks, there can be a "global" comparison set which is the set of all unmodified tasks? Certainly, in the months following a benchmark's release, deprecated models shouldn't be an issue. Seems easier than exhaustively finding every bug before release.
Definitely agree on benchmark quality variance.
And thank you for your service! 🫡
Solid audit work here. The regex-log octet issue is such a perfect example of how easy it is for edge cases to slip through even with thorough review. I've definately run into similiar headaches trying to build on top of flawed benchmarks, and it's frustrating how much time gets wasted debugging issues that aren't even in your own code. The open source software comparison is spot on tbh.
Couldn't agree more. This is so true for AI models learning to game the datasets.