Discussion about this post

User's avatar
Leni Shor's avatar

Thanks for writing this, I found it pretty interesting! Some thoughts about it:

- This is a pretty big problem that's surprisingly underdiscussed (*cough* SWE-Bench Pro *cough*), so I'm very happy to see someone writing about it! I wish there was a widely-understood name for it.

- My gut feeling is that this is not as big as problem as plain vanilla task underspecification for, e.g., Terminal-Bench or SWE-Bench Verified—what's your take on this?

- To be fair to the models, I can imagine a reasonable human developer saying "yeah, we should definitely not allow writes outside of the repo in the future, but I don't think this is key for an MVP", especially if they didn't know what this minimal git clone was *for*. (I'd bet that the success rate would improve a lot if you said that this is production code for your billion-dollar SaaS startup or something!)

- Do you know of any benchmarks that have deliberately-underspecified tasks and try to evaluate whether the models make reasonable inferences about the tasks? I'd love to know more about whether or not this phenomenon really is a big part of why number go up, but mundane-utility goes up much less.

No posts

Ready for more?