Top model scores may be skewed by Git history leaks in SWE-bench

github.com

455 points by mustaphah 2 days ago


ofirpress - 2 days ago

[I'm on the SWE-bench team] Multiple people have looked into this, for example right in that thread: https://github.com/SWE-bench/SWE-bench/issues/465#issuecomme...

This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.

This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.