Discussion about this post

User's avatar
David's avatar

As someone who maintains benchmarks for a living, I think you're severely underrating the difficultly/complexity of constantly updating tasks. I think there's a fundamental tension with making frequent benchmark updates, and in results being comparable over time. It's a huge pain to re-run old models, often old models served from labs are deprecated, and running complex agent benchmarks is already very time-consuming/difficult. There's a sense in which the whole *point* of a benchmark is to be a static set of tasks that people can use to compare results across time and space, and if it's changing constantly you're hitting a moving target, which dilutes one of the main benefits of having this static set.

Otherwise totally agree with your points here—we should use LLM review and peer reviewers (and benchmark creators themselves, although it's sad that I have to say this) should actually look at the data!

I think there's also just really wide variance in benchmark quality—HLE is wayyyyy worse than most of the other benchmarks you listed, for example, because they didn't do substantive human expert review/validation of question correctness.

Neural Foundry's avatar

Solid audit work here. The regex-log octet issue is such a perfect example of how easy it is for edge cases to slip through even with thorough review. I've definately run into similiar headaches trying to build on top of flawed benchmarks, and it's frustrating how much time gets wasted debugging issues that aren't even in your own code. The open source software comparison is spot on tbh.

2 more comments...

No posts

Ready for more?