-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server still allows timelosses to cause incorrect test results #1968
Comments
(I hope it goes without saying that I am most definitely a million percent happy, pleased, deeply thankful and profoundly grateful that we had a new contributor throwing oodles and oodles of new cores at fishtest. I have absolutely no problem with people adding new cores, this is strictly a server problem as far as I'm concerned.) |
Timelosses are part of the development, you can have a perfectly reflective result if you mess up Stockfish code Time management and keep losing on time, it's not like the time losses are generated by the server itself but rather by real-world phenomena (Stockfish being run on some hardware) I don't think the matter is trivial as outright reject all time-losses. |
Sure but the number of patches which touch TM is a very small fraction, and the fact remains that non-engine timelosses directly cause incorrect results on non-TM patches |
I suppose ultimately, discerning between "this patch affects TM" and "this patch doesn't affect TM" is something that can't be automated, which means we have to have some manual way to mark the difference in the fishtest ui. Adding a second checkbox "clear timelosses" next to "auto purge" would do the trick in a nonintrusive way, and that also means that autopurge's timeloss barriers could be loosened as well, since even at present autopurge may "false purge" on TM patches (altho they're so rare it hasn't mattered so far). And since we'd add this manual checkbox to clear timelosses, it would remove any timeloss pair whatsoever from the results (ideally it would remove individual pairs rather than whole tasks). |
I've just had a concrete example of this effect.
Test 1 https://tests.stockfishchess.org/tests/view/662db69c6115ff6764c8065a
This failed, quite surprisingly, in 12k games with more than 1% timelosses across the whole test, with some tasks reaching near 10%.
By contrast, Test A https://tests.stockfishchess.org/tests/view/662db6b36115ff6764c80667 had passed in 50k games, and several other tests in the family had shown that both Tests 1 and A should have passed or failed together.
On that basis I rescheduled Test 1 as Test 2 after the timelossing workers had been fixed: https://tests.stockfishchess.org/tests/view/662edf05e1ff56336c0223d1
This time it went as I expected. It passed (considerably slower than I'd hoped but whatever), which is a stark contrast to failing in 12k games.
Is the difference in these two runs attributable to luck? Yes, certainly, a large portion of it is luck. However a large part is also clearly the direct impact of timelosses, for example the WW/LL rate in the bad test was much, much much higher than in the reschedule or indeed in STCs generally. That can only be attributed directly to the server accepting bad data.
Please note that last year I'd submitted a PR (#1571) to (among other things) tighten up the acceptable data quality, however it was entirely ignored. That PR is now out of date, but as these tests demonstrate, the issue is clearly still a problem on the server, and should be mitigated as soon as possible.
Ideally the server should be able to reject only gamepairs affected, but rejecting tasks with more than 3-5 timelosses would be a good place to start (and ideally such rejection should occur even before a manual or auto purge).
Given the impact of this issue, I'd recommend that anyone who ran STCs while the bad data was being accepted by the server (around a day or so before this issue creation, see the bad test timestamps) should consider rerunning those tests, as results may differ without bad data.
The text was updated successfully, but these errors were encountered: