-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation fault with too big numSearchThreads (OpenCL) #182
Comments
I'm unable to reproduce on my machine. I'm guessing it has something going wrong when trying to use batch size 5 or greater, since your attached output works fine up to 4 search threads, and fails at 6, and if you look at the gdb backtrace you have If you could trace this to some issue in KataGo by exploring in GDB, that would be cool. The only things that come to mind right now for how the specific call you gave could have failed would be if:
...on that line, either input or inputScratch ended up being invalid buffers for OpenCL, or were not sized to be large enough. For batch size 5 and 9x9, and given 22 channels of input, they need to be sized to be more than Another question: what is your stack size? When you run Lastly, given that I can't reproduce, no recent update has materially touched the OpenCL code as far as I know, and other users have not yet reported, it could be an issue specific to your GPU and/or OpenCL drivers. Do either of these help? (found purely by googling your GPU along with "segfault", I don't claim any specific understanding of these threads). |
The issue persists for the default The stack size is the same as in the linked question: I am not very familiar with gdb, so I don't know how to further trace the issue. Printing Would a core file be helpful? I would try to create one, in case some one who knows what they are doing wants to dig down that thing. edit: Regarding those hardware-related links, it seems pretty cold. The problems described there do not match mine. I can run OpenCL apps fine, as well as KataGo. KataGo just segfaults with multiple search threads. |
I think your verbose printing code is incorrect. You want to cast the cl_mem (which itself is basically a pointer I think) directly to a void* and print out the numeric value of that pointer, rather than take the address of the cl_mem. I'm pretty sure you're printing out the memory address of where the cl_mem pointer is stored, rather than the address within the cl_mem pointer itself. No, a core file would probably not be helpful. Are you sure that you can run other OpenCL apps fine, such as various compute-and-memory-intensive benchmarks and such? Given that you report the issue with a smaller number of threads on 19x19, I'm suspecting now that what's breaking for you is something to do with the size of the neural net input data being past some threshold, or at least is stochastically triggered by such. Once you have multiple threads, what happens is that the evaluations from those threads will get batched if they send queries to the backend code at about the same time, and therefore be using more space within the cl_mem buffers. For 2 threads though, batching can sometimes be rare, with the two threads settling into a rhythm of alternately ping-ponging on the GPU with a batch size of 1, instead of landing at the same time to make a batch size of 2. For obvious reasons, a 19x19 board uses more space than a 9x9 board, so if it were the case, one would expect a 19x19 board to trigger problems at a smaller batch size than 9x9. |
And in particular, I'm guessing that you're seeing this issue as soon as you hit batch size 2 (but not 1) on 19x19, or batch size 5 (but not 1-4) on 9x9. Still not sure how to fix it. Are there other applications that you think are using the GPU? Is there a command line tool that you can use on your system to see current GPU usage and the amount of GPU memory allocated by each other process that's using it, if any? |
I'll double-check that OpenCL apps work fine, I think I got confused with OpenGL which works flawlessly for sure. My verbose code printed indeed just the memory address of where the cl_mem pointer is stored, I will change that and see whether it changes anything. |
I tested the windows binary on a separate windows partition, everything works fine there, so no defective hardware. I used radeontop to check GPU usage and didn't see anything suspicious. This doesn't show, however, the allocated memory for processes. I also tried running the test without any instance of X11, so really nothing should use my GPU. Currently I am compiling luxmark to check whether this works fine as well. That should rule out (or point towards) OpenCL/mesa (hardware driver) issues. After that (or well, maybe during the compilation, it already takes a while) I'll try to adjust the verbose output and post an update. |
The cast to void* did it. Now the addresses match. I had some problems with luxmark, which is why I found geekbench. The first few tests of this benchmark work, but then it also fails due to an "internal error". So probably an OpenCL problem..? |
I scanned the websites with OpenCL errors you googled (especially https://forums.freebsd.org/threads/opencl-with-amd-radeon-rx580-segfaults.66789/) and it looks like these simple tools that segfault work well for me. But an idea arose from there: whatch
Not sure how to continue investigation now, though. |
Letting katago play let my pc freeze. Looking into it with
where |
Thanks for checking on the debug code and gdb. At this point it sounds like there's nothing overtly wrong in the the way KataGo's side of the code is executing on your system - it's passing the right buffers that have been allocated with the right sizes, calling a normal OpenCL routine Do you get similar GPU fault messages when you run geekbench or other intensive OpenCL applications? If so, then the implication would be that something is wrong the drivers for your GPU, or your GPU hardware, or the way the GPU driver interacts with your OS (if you found that on windows it worked), or something. Googling for "VM_CONTEXT1_PROTECTION_FAULT_ADDR" and other parts of your log gives some suggestive links, but you're probably in a position to investigate better than me. |
Thanks for your support and explanations, I appreciate it! I don't get these messages from geekbench or clpeak, but geekbench may catch the error as it fails not with a segfault but an "internal error". clpeak runs fine, the system reacts very slowly sometimes, but no errors, no faults, no warnings. Also clinfo runs without issues, although this I wouldn't call an intensive application. As I said before, I tested katago with the windows binary on the same system and everything works as intended with a recommended fastest numSearchThreads of 16. So I would conclude it's not a hardware defect. I think it could be OpenCL related, because everything else, including OpenGL, runs without problems. But of course it might be driver related or OS-driver-interaction related. Indeed, I am in a better position and will try my best to investigate further. I agree it's an issue probably not on katago's side. Should I close this issue then? |
You can leave this issue open if you think there's more to find still and/or it would be helpful for other users to see in the meantime while still unsolved, I don't mind. Let me know what you find, and if it does turn out that there's something I can fix in the code to improve the robustness, let me know. |
I couldn't find anything else so I opened a thread at the arch forums and got this answer: https://bbs.archlinux.org/viewtopic.php?pid=1895516#p1895516 |
I have been playing around a bit with KataGo and repeatedly run into trouble when trying to increase the "numSearchThreads" (or trying the benchmark). I am running an up-to-date arch linux and compiled katago from source (v.1.3.5) as described in the readme. I am using a AMD Radeon RX 580 GPU and a AMD Ryzen 7 1700 CPU. The issue seems to be independent of the net (tried with current g170 extended nets b10, b15, and b20) and board size.
I attached the full gdb output with a backtrace.
gdboutput.txt
If you need more information or I can help resolving this issue, please let me know. Thanks for your effort!
The text was updated successfully, but these errors were encountered: