-
-
Notifications
You must be signed in to change notification settings - Fork 21.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Windows at least building with a -j > 1 scons parameter fails with a file not found even though the file exists #5042
Comments
Can you give the build log, or at least the final part of it? |
I didn't see an obvious log file anywhere in the directory so I copied the last bit of the command-prompt output: Build line was: scons platform=windows target=release_debug -j4 One thing to note: it never happens with the same file, always a different file so it might be some kind of race condition? |
@headkase : it might be a stupid question, but does it change something if you add an space between -j and 4 ? |
Nope, just did this from a clean clone: scons -j 4 platform=windows target=release_debug Still errored on a random file. |
I compile godot with this command on
how about this? |
Nope, that errored right away too. |
And I re-ran it with Adminstrator privileges, still errored with a file not found. |
@volzhs What versions of Python, SCons, and Pywin32 do you have installed? If different than mine I'll try those. |
I can reproduce this with Python 2.7.11, SCons 2.5.0 and PyWin32 build 220, all in 64-bit, when compiling with VS 2015. |
@headkase Python 2.7.11, SCons 2.4.1 and PyWin32 build 220 |
@headkase No, it happens on 2.4.0 for me when compiling MS VisualC++ sometimes (very rarely though). It's a race condition most probably, some other compiler process is accessing/locking the same file that cannot be accessed... I have not looked more into it though, not an expert on multithreaded race conditions. I remmeber there being a setting (or scons using by default) to randomize build order so that race conditions do not happen where multiple jobs are trying to build the same file... This random building is probably the reason why sometimes we hit the race condition, and sometimes we do not. |
Well, with 2.5.0 I had 100% failure with more than one job. With 2.4.1 I had it complete successfully two times in a row. |
@headkase Random luck/unluck or 2.5.0 being really bad at randomizing build order... |
As the issue has been demonstrated to be with SCons, not Godot itself, if there is nothing more to add then it may be closed. Thank you for the help everyone. |
@Griefchief @headkase I have compiled hundreds times with this environment for months, and I never got failure compiling. |
Ah, OK, i'm wrong, SCons is not randomizing dependencies on build order... @headkase can you add --random to your scons builds in 2.5.0 and test if it will error? Also without? |
@Griefchief I have a working system now. Can you upgrade to 2.5.0 and try a multi-job compile with the --random option? Then add your results here. |
@headkase Hm, OK, I can probably do that, will be in a few hours, or a day or two though, got some work currently. Thanks. |
Additional note: Rémi Verschelde on Facebook indicated that it should still be considered a bug in the build system as SCons 2.5.0 works correctly under Linux. https://www.facebook.com/groups/godotengine/permalink/785635728239690/ So I recall the request to close this issue at this time. |
Well since the error is |
@akien-mga If there is a concrete test then I will gladly re-upgrade to 2.5.0 and build the test - that's the least I can do for the Godot project. |
It seems that MSVC has it's own multithreaded build... maybe paralel builds arent thread safe without? maybe we can use this to avoid the race condition with -j by redirecting -j into the compiler itself (if -j can be overriden)... I'm just pasting this link here for reference, i have no idea if this will aleviate the race condition... or help with speed at all... (I guess it might, a bit) |
Investigation needs to be done to determine if scons uses process forking (creates separate copied environments similar to using & in sh in linux) or threading (runs in same process thus sharing resources and all classic threading issues apply) to implement -j. My hunch is we are getting collisions on the algorithm or system call generating tempfile names because in Windows command lines need to be passed via files due to the system's limit on the length of command lines. However if scons uses threads then the potential error space expands substantially. However with processes (distinct environments) the error space is much reduced and a weak tempfile naming algorithm that generates a high number of collisions in tempfile names would generate the very race situation where a later job overwrites an (shared due to opening same file on account of naming collision) argument file with different include specs before a prior job using the same file finishes. I suspect this reason due to past experience a couple years back where the problem with building (in mingw at the time usually during links where the argument size is quite large) was due to the build process not using temporary argument files and this functionality had to be added to the project. |
Agents, I have a hunch that maybe none of us have installed pywin32 properly into python? My SCons complains about python even though i think i installed it earlier (maybe into the wrong version)... Is somebody not receiving a SCons warning related to pywin32 immediately when running a build (just after SCons stops processing SConstruct and starts to build targets)and getting these missing .h files?
|
@Griefchief I am not receiving such an error at all. I would suggest you correct your Pywin32 installation. Edit: Also, I have successfully built parallel builds about 4 more times with SCons 2.4.1. |
@headkase Thank you, this rules out Pywin32 probably... One fine folk from freenode # SCONS says that pywin32 is mostly about python file objects (generated by open() and file())... I guess it's probably not pywin32 related, we'll probably need investigate what gau-veldt mentioned. |
During the investigation I made changes to the build code (Python) to properly close all files everywhere in the Python code base by using the Python with statement for file context wherever it did not need too much refactoring. Explicit close calls are added in all other cases. Unfortunately this does not fix the original issue, but it is good practice anyway. Please find the fixes mentioned above in the PR below. Fixing this race condition on Windows by waiting for the dependencies to be actually readable would be the task of SCons, so the relevant fix should go into their code base. Based on all the investigations above I suggest closing this ticket. |
Great work tracking this down in so much detail, viktor-ferenczi! If you would like to go a little further down this rabbit hole, try running a build while using Procmon: https://docs.microsoft.com/en-us/sysinternals/downloads/procmon -- that should show you exactly why the file isn't openable. Maybe virus scanner has it open, or backup, or something else. Or maybe as you suggest, python isn't closing it. On Windows, only one process can have a file open at a time (in most situations). |
Godot build log (with timestamps):
Process Monitor events filtered on
G: is a ramdisk with no backup. G:\godot build folder was excluded from all Bitdefender Antivirus scanning, it was verified by seeing the lack of access from its BUFFER OVERFLOW is normal and harmless. Some repetitive WriteFile operations were removed for clarity. The very last QueryOpen comes from my extra check for the readability of the file. Next steps:
|
Sharing violation just means that by default, only one process can have a file open on Windows. Python still has it open (for reading), and the compiler is trying to open it. So the question is why Python (SCons) still has it open. I know SCons has had this kind of problem in the past; here's a (very!) old discussion of something quite similar: http://scons-users.scons.narkive.com/pjMTzf5v/race-condition-in-scons-scheduler - you can see it could be a number of things, from python not closing the file until it collects garbage, to a bug in the python function that's creating So if with your help we could nail this one, it would be amazing. I'm going to mention this on the SCons irc as well. One other question: I know python made significant changes to the IO system in 3.x. I wonder if this problem would go away if you run SCons with python3? Oh, in any case you should use SCons 3.0.1. It's still compatible with Python 2.7. |
I'm still running into this issue with Python 3.6.4 and SCons 3.0.1 on Windows 10. |
Fix for that was in my PR, thanks for merging it. I use latest master for my tests.
Python 2.7.14 is used in my tests to avoid any such issues. I think it is best to separate the problems as much as possible. When we nail this with Python 2.7, then I will give it a test on latest 3.x as well. |
This time it happened with: Build log (with scons log timestamp hack):
Process monitor events:
CL.EXE experiences the SHARING VIOLATION about 700ms later than Python closed the file. Time difference should not even count, only the order of events is important here.
Why on Earth cmd.exe is keeping the splash.gen.h file open? Also, there is no CreateFile event for
TortoiseGit's cache seems to find it as well, but that happens only after Next step is to figure whether TortoiseGit's cache have anything to do with it. |
Turned off both TortoiseHg's and TortoiseGit's icon overlay cache and verified that their processes are not running anymore. Repeated the test and it failed with splash.gen.h again the exact same way. So the icon overlay caches did not cause the issue. |
Interesting! How is cmd.exe even involved here... and why is there no CreateFile. interesting questions. SCons will try to remove the old version before building it, but I'm pretty sure that's done directly via os.unlink(). Also note SCons opened (CREATE_FILE) it with ShareMode read and write. A bit unusual, but OK. And also maybe why it worked at all, if cmd.exe had it open already. If you still have the whole procmon log, maybe seeing how cmd.exe (pid 3956 there) got started would give a hint. Was it started from python/SCons? Must have been, I'd think. SCons will run cmd.exe to execute shell commands. --taskmastertrace=- might show something. |
Before each test I clean the build tree by running
That's caused by some extra code I hacked directly into SCons to explicitly verify that all target files generated can be opened and fully read. This is why Also, if I run the exact same command line after the build fails, then CL.EXE can successfully read the file. So it might be something with the parent process (SCons's python.exe or some intermediate cmd.exe). It more and more looks like a generic SCons issue on Windows. |
One would assume it's because of cmd.exe having it open. But that would imply that whatever cmd.exe did to open it failed to show up in the procmon log. And it would imply that cmd.exe didn't have it open when SCons/python was generating it. (Oh, maybe you should run procmon as Adminstrator, if you're not already. Probably not the smoking gun, but best to be sure.) |
Here is how SCons is apparently invoking CL.EXE using a modified (wrapped) version of See Pretty printed (space joined) command line of the failing CL.EXE invocation:
This is originally passed as a list, which would be unreadable for us humans. Please note that it makes quoting incorrect above. Invoking this same command line after the build does succeed, even with wrong quoting. Since no generated header file is listed on the command line Apparently cmd.exe keeps its working directory ( |
Summary on all the workarounds attempted: Changing the execution environment
Changing SCons source code
Changing Godot build scripts
Both So files written cannot be renamed right after closing them. I think this is a step in the right direction to figure out what's going on. |
That sounds right, in terms of how SCons invokes subprocesses. But as you say, it's unlikely that cmd.exe is the one opening the generated file. It just gets /c "some command" and executes it. I wonder why this never happens for me? I have a big AMD CPU and always build with -j30 (it's very nice), and haven't seen this happen. I'm using Win10, SSD, VS2017, python 3.6.3, SCons master branch from January. I'm building in my C drive, you're on G: -- probably not relevant, but who knows... |
Here's one thing to try: |
Thank you for the ideas. I'm trying to use If that does not work, then we can try to run all of our Python build (make_*) functions in subprocesses, which would effectively decouple them from the SCons process making inheriting anything from there impossible. Based on former tickets this approach worked in case of shaders. AMD Ryzen 7 1800X CPU here with 8 physical cores + HT, so the -j16 for me. Could be -j32 due to I/O wait, tough. |
HANDLE_FLAG_INHERIT seems to work. Testing it in a loop many times, then creating a PR. Higher build parallelism seems to decrease the likelihood of this issue. Reason can be the extra waiting time introduced between subsequent tasks by too much parallelism. Under-utilizing the CPU seems to maximize the likelihood of this issue. Going down to -j1 seems to decrease the probability, tough. The less time is spent between finishing the Python generator and starting the CL.EXE compiler the more likely this problem is. It happens for sure if the gap is less than a few seconds. There seem to be a delay for closing files, even if we do it properly in Python. |
The fix is not reliable, unfortunately. I was just lucky while testing the first 2-3 times, likely because of trying to have -j32 parallelism. Now it is running in a loop with -j16 and the build failed 20 out of 20 times. Next: Trying to run all generator functions in their own subprocesses. |
Running all builders (make_* functions) in subprocesses works well. I've seen this suggestion way above in the chat, actually. Too bad we don't know the exact reason for this problem, but at least there is a reliable workaround now. Running 20-30 test builds, expect 100% success rate. I will also try with -j1 and -j32. Then a PR will follow. |
Build benchmark Average successful build times in seconds by build parallelism (-j option):
All builds started from clean state and running on a ramdisk. Build machine |
" Running all builders (make_* functions) in subprocesses works well. I've
seen this suggestion way above in the chat, actually. Too bad we don't
know the exact reason for this problem "
Reading through your work today has given me somewhat of an inspiration
today on the reason... I remember using sleep() and fixing issues that way
while testing... the 7mb file (the ssl certs one) needs to be slept for
about a second or 2 to solve the problem, compared to a really low time for
the smaller files... I don't know enough about python&win internals but I
think the reason the 7mb file gets released much slower is that it's
possibly still being written to the disk by the filesystem/driver, and is
therefore not available for consumption by another requester... Python
releases it's handle too soon( or too late/never, whichever the reason is),
or it simply informs scons too quickly for some reason or other and scons
can't then know that the build will fail when another process requests the
(bigger file)...
So, anyway, the reason why it was random failing for you is that that
"release" time (wherever it is being released) is sometimes 250ms sometimes
faster, sometimes slower, depending on the file (size) etc, so if scons is
"too fast" via another thread (within the xxx ms) compiler just won't see
it when it tries to access it... Sub-process forces a close on the handles
when it is quiting itself, or the python (sub)proces simply does not close
until it's sure that it has actually released the handle and therefore
avoids the problem? I know this isn't an actuall answer, sorry about that,
I'm just giving ideas and my exp with this stuff into print.
Also, I just wanted thank you for busting your head on this stuff, I wasn't
able to work on Godot internals since the last time I commented above, a
year or so ago. With this out of the way I can probably move on other
interesting stuff with the vsproj generation, like the paths seemingly not
being inputed properly into it, so VS can't find many of the files (the end
of the SConstruct file puts those paths in when it generates the
projects)... If that wasn't already fixed... Or who knows, maybe even Godot
internals.
I just wanted to thank you for "unblocking" me (mentally), in a way, to
mess with other Godot's stuff :) I'm glad to see somebody fixed this before
me.
…On Sun, Mar 18, 2018 at 12:51 AM, Viktor Ferenczi ***@***.***> wrote:
*Build benchmark*
Average successful build times in seconds by build parallelism (-j option):
-j1: 794s
-j2: 416s
-j4: 228s
-j8: 166s
-j16: 154s
-j32: 152s
All builds started from clean state and running on a ramdisk.
Cleaning was not counted for the build time.
*Build machine*
CPU: Ryzen 7 1800X
RAM: 32GB 2400MHz (suboptimal, I know)
Motherboard: ASUS ROG Crosshair VI Hero, Socket-AM4
SSD: Samsung 850 Pro 1TB (SATA)
OS: Windows 10 Pro 64bit ENU
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5042 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFfbCCoRXtxtuld5tyvlh8S7FCyf3QTMks5tfaGAgaJpZM4IuTZH>
.
--
Thanks,
Aleksandar Danilovic.
|
Take a look at this: (Python's implementation of close) Note the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS around the actual C call to close() :
My belief is that the GIL gets released while the file is closing, and randomly the written (or copied) file get's open()'d either by python or a subprocess before the close() call actually completes.. We've had patches and SCons users swapping to use win32's FileCopy API instead of shutil.copy() with good results. In those cases users were mainly seeing file handle open issue with files pulled in from a CacheDir and/or being installed or Copy()'d by scons. Again my hypothesis would hold as the FileCopy call via ctypes (or pywin32) would hold the GIL until it returns which shouldn't happen until the file copy has completed. |
Following #20373 I'm not sure whether this issue is completely solved, but maybe it is effectively so rare now as to not be a problem. However I have some potentially relevant info. Following my work adding unity build #13096 option to Godot, I was getting these include errors for the generated files each clean / rebuild cycle. Whether it was the same mechanism for the error I don't know however, I can show how I cured it in the unity build... Using the SCsub in main as an example, currently my replacement is:
new:
That is, as I understand it, instead of relying on Scons to find the dependency information (and determine the build order), I am explicitly saying that these generated files are needed as a dependency before SCU_main.cc is compiled. I am of course assuming that we were currently relying on Scons to determine build order without hints such as this. As to why it was occurring more in the unity build, I don't know (perhaps the number of files in the compilation unit were overwhelming the parser Scons uses to auto detect dependencies), or it was limiting the multithreading of the compilation. This leads me to suspect that the same problem may be existing in the main build (Scons not knowing order of dependencies) but it is being masked from showing up in most circumstances, and only showing up 'randomly' on some people's machines. If this is the case then the same steps to fix the unity build might fix the main build. Other dependencies needed were:
In modules/gdnative:
I think those were the main ones, I may have missed a couple. To modify this for the main build you might need to change the Depends sections so that the targets are cpp in the main build, or some approach like this. Does this sound a plausible alternative explanation (in addition to the file locking thing)? If so I can have a think about how to apply the same dependencies to the main build. Has anyone still been having this problem since #17595 ? |
@lawnjelly - This shouldn't be necessary unless the names of the header files are defined via a preprocessor value like this:
My first guess would one of the following conditions apply:
In general if you have to add a Depends() for a header file something is wrong. This is not expected when using SCons. BTW. You don't specify order to SCons, you specify dependencies and SCons determines order using that. |
Ah, then ignore me, this may be what is causing the issue on the unity build. I am doing exactly this (for easy changes, the path is currently specified in a macro). However it might potentially make things easier if I can remove the Depends() by specifying non-macrod paths, I will try and test this. 👍 |
Operating system or device - Godot version:
Windows 10 Pro 64-bit, Godot source 64-bit build (tested with master branch)
Visual Studio 2015 Community, Windows 10 SDK Kit 10586.
Python 2.7.11, Scons 2.5.0, and Pywin32 220 all these tools are 32-bit.
Issue description (what happened, and what was expected):
When I issue the command:
scons platform=windows target=release_debug -jx
Where "x" is a number > 1 then the build process will eventually give a file not found error. Going to where it says the file is not found it is present. Specifying -j1 on the build line works fine.
Steps to reproduce:
scons platform=windows target=release_debug -j2
The text was updated successfully, but these errors were encountered: