-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Windows compatibility (safe relative paths, no float128). #777
Conversation
… of numpy. Instead, this is covered by the system-independent `longdouble` where supported.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@@ -13,7 +14,7 @@ | |||
from signac.synced_collections.numpy_utils import NumpyConversionWarning | |||
|
|||
PYPY = "PyPy" in platform.python_implementation() | |||
WINDOWS = "win" in platform.python_implementation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug -- it was checking "win" in "CPython"
. 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this was getting WINDOWS = True
because of that bug, and hiding failures on other platforms too: https://app.circleci.com/pipelines/github/glotzerlab/signac/3039/workflows/bea29ea7-81f8-4026-a303-27ba8cfac543/jobs/21272
@vyasr Until we resolve the underlying problems, I would vote to always skip the multithreaded tests rather than claim it's just a Windows problem (we know it's sporadically appearing elsewhere too). We haven't had success in reproducing threading problems aside from Windows and on various platforms in CI. If it would help, we could try to debug it together on my Windows machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*sigh
As a Windows user, are you able to reproduce this issue on any platform locally? At one point I was occasionally able to do so using Pypy, but never using CPython (including different versions) on Linux or Mac. Is the problem related to the use of pytest-xdist
on CI? I know that pytest -n *
works fine when running signac tests locally on every machine that I have tried.
At this point I'm convinced that there is a real underlying bug somewhere in the multithreading support (unless it's a bug in the tests), but I have struggled enough to get a consistent reproducer that I have stopped pursuing it. In the past, SSHing into failing CI builds has also failed to provide a consistent repro. I suppose I can try that again with one of these images and see if that gets me anywhere at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, disabling the xfail
and running > pytest -k multithreaded -x
(no pytest-xdist) gets it to fail on my Windows machine. However, it may or may not be the same failure as on Linux. Sometimes the failure on Linux / PyPy is in checking the length here:
assert len(synced_collection) == num_threads |
However, the failure I get locally on Windows is a permissions failure here:
list(executor.map(set_value, [synced_collection] * num_threads * 10)) |
They could be related, but I'm not sure. It's just a buttery, flaky, test.
Full Error Log from Windows
============================= test session starts =============================
platform win32 -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: \\wsl.localhost\Ubuntu\home\bdice\code\signac, configfile: setup.cfg
plugins: anyio-3.6.1, forked-1.4.0, xdist-2.5.0
collected 6582 items / 6513 deselected / 69 selected
tests\test_synced_collections\test_json_buffered_collection.py ......... [ 13%]
...............F
================================== FAILURES ===================================
_____________ TestBufferedJSONDictWriteConcern.test_multithreaded _____________
self = <test_json_buffered_collection.TestBufferedJSONDictWriteConcern object at 0x000001B3D9D44DF0>
synced_collection = {'ThreadPoolExecutor-31_31': 'ThreadPoolExecutor-31_31'}
@pytest.mark.xfail(
False,
reason=(
"This test sometimes fails. This may indicate a race condition. "
"The test fails more consistently on Windows but also appears on "
"Linux in CI."
),
)
def test_multithreaded(self, synced_collection):
"""Test multithreaded runs of synced dicts."""
if not type(synced_collection)._supports_threading:
return
from concurrent.futures import ThreadPoolExecutor
from json.decoder import JSONDecodeError
from threading import current_thread
def set_value(sd):
sd[current_thread().name] = current_thread().name
num_threads = 50
with ThreadPoolExecutor(max_workers=num_threads) as executor:
list(executor.map(set_value, [synced_collection] * num_threads * 10))
assert len(synced_collection) == num_threads
# Now clear the data and try again with multithreading disabled. Unless
# we're very unlucky, some of these threads should overwrite each
# other.
type(synced_collection).disable_multithreading()
synced_collection.clear()
try:
with ThreadPoolExecutor(max_workers=num_threads) as executor:
> list(executor.map(set_value, [synced_collection] * num_threads * 10))
tests\test_synced_collections\synced_collection_test.py:490:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\Users\bdice\mambaforge\lib\concurrent\futures\_base.py:609: in result_iterator
yield fs.pop().result()
C:\Users\bdice\mambaforge\lib\concurrent\futures\_base.py:439: in result
return self.__get_result()
C:\Users\bdice\mambaforge\lib\concurrent\futures\_base.py:391: in __get_result
raise self._exception
C:\Users\bdice\mambaforge\lib\concurrent\futures\thread.py:58: in run
result = self.fn(*self.args, **self.kwargs)
tests\test_synced_collections\synced_collection_test.py:474: in set_value
sd[current_thread().name] = current_thread().name
c:\users\bdice\code\signac\signac\synced_collections\data_types\synced_dict.py:182: in __setitem__
with self._load_and_save, self._suspend_sync:
c:\users\bdice\code\signac\signac\synced_collections\buffers\file_buffered_collection.py:71: in __exit__
super().__exit__(exc_type, exc_val, exc_tb)
c:\users\bdice\code\signac\signac\synced_collections\data_types\synced_collection.py:48: in __exit__
self._collection._save()
c:\users\bdice\code\signac\signac\synced_collections\buffers\buffered_collection.py:121: in _save
self._save_to_resource()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'ThreadPoolExecutor-31_31': 'ThreadPoolExecutor-31_31'}
def _save_to_resource(self):
"""Write the data to JSON file."""
# Serialize data
blob = json.dumps(self, cls=SyncedCollectionJSONEncoder).encode()
# When write_concern flag is set, we write the data into dummy file and then
# replace that file with original file. We also enable this mode
# irrespective of the write_concern flag if we're running in
# multithreaded mode.
if self._write_concern or type(self)._threading_support_is_active:
dirname, filename = os.path.split(self._filename)
fn_tmp = os.path.join(dirname, f"._{uuid.uuid4()}_{filename}")
with open(fn_tmp, "wb") as tmpfile:
tmpfile.write(blob)
> os.replace(fn_tmp, self._filename)
E PermissionError: [WinError 5] Access is denied: 'C:\\Users\\bdice\\AppData\\Local\\Temp\\pytest-of-bdice\\pytest-18\\test_multithreaded6\\._32b91d8c-94f7-4e89-87dd-36a23b33d4bc_test.json' -> 'C:\\Users\\bdice\\AppData\\Local\\Temp\\pytest-of-bdice\\pytest-18\\test_multithreaded6\\test.json'
c:\users\bdice\code\signac\signac\synced_collections\backends\collection_json.py:264: PermissionError
=========================== short test summary info ===========================
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
=============== 1 failed, 24 passed, 6513 deselected in 44.00s ================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that it's possible that your failure is due to the same underlying issue, perhaps somehow a nonexistent file move is being attempted because another thread has moved the same file already? I don't know enough about Windows error codes to know all the ways that Windows could fail. While in theory IIRC os.replace
is the "safest" way to do this on Windows maybe there are still issues with it. We could look into it a bit more and see, but I'd probably need you to do a bit more legwork here given my limited familiarity with Windows.
The Pypy/Linux errors, though, are clear indications that something is wrong because they are demonstrating that the exact invariant being tested is violated. Those are the errors that I unfortunately haven't been able to reproduce on any machine that I've had access to. If we could find a way to reproduce that test reliably on some machine I would be happy to look into it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you open a PR where you reenable this test after the GHA switch? Then we will at least have something open to track this, and a way to test whether this issue recurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so basically any and all multithreaded tests of synced collections seem like they could fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my take, yes. I saw failures on tests like:
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_setitem
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_update
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_reset
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_clear
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_load
and decided it wasn't worth trying anything with "multithreaded" in the name. I xfailed all tests named with "multithreaded."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I guess we can try to investigate this again at some point. I anticipate having access to a Windows system soonish, which would help a ton.
Description
Fixes a bug where relative paths crossing drives cause failures on Windows. Example:
signac uses relative paths in several places for user-readable data like log files and reprs. If
os.path.relpath
fails, we should just show the absolute path that is held internally. That's what I implemented in_safe_relpath
.Additionally, this removes explicit testing of
numpy.float128
from synced collections because it's not available on Windows builds of NumPy. The underlyingfloat128
type is already tested on Linux/macOS through other tests that usenumpy.longdouble
. That value is defined differently on UNIX-like platforms (float128) than on Windows (float64). cc: @vyasrMotivation and Context
I found this bug while working on #776.
It's not very straightforward to add comprehensive tests for this kind of bug. I was able to test it locally by running the tests from a
D:\
drive while the temp folders used for testing were onC:\
.Checklist: