Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Windows compatibility (safe relative paths, no float128). #777

Merged
merged 11 commits into from
Jul 5, 2022

Conversation

bdice
Copy link
Member

@bdice bdice commented Jun 19, 2022

Description

Fixes a bug where relative paths crossing drives cause failures on Windows. Example:

>>> # Starting from a current working directory in C:\Users\bdice
>>> import signac
>>> h = signac.H5Store(r"C:\data.h5")
>>> repr(h)
"H5Store(filename='..\\\\..\\\\data.h5')"
>>> h = signac.H5Store(r"D:\data.h5")
>>> repr(h)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\users\bdice\code\signac\signac\core\h5store.py", line 341, in __repr__
    type(self).__name__, repr(os.path.relpath(self._filename))
  File "C:\Users\bdice\mambaforge\lib\ntpath.py", line 703, in relpath
    raise ValueError("path is on mount %r, start on mount %r" % (
ValueError: path is on mount 'D:', start on mount 'C:'

signac uses relative paths in several places for user-readable data like log files and reprs. If os.path.relpath fails, we should just show the absolute path that is held internally. That's what I implemented in _safe_relpath.

Additionally, this removes explicit testing of numpy.float128 from synced collections because it's not available on Windows builds of NumPy. The underlying float128 type is already tested on Linux/macOS through other tests that use numpy.longdouble. That value is defined differently on UNIX-like platforms (float128) than on Windows (float64). cc: @vyasr

Motivation and Context

I found this bug while working on #776.

It's not very straightforward to add comprehensive tests for this kind of bug. I was able to test it locally by running the tests from a D:\ drive while the temp folders used for testing were on C:\.

Checklist:

… of numpy.

Instead, this is covered by the system-independent `longdouble` where supported.
@bdice bdice changed the title Use safe relative paths for Windows compatibility. Improve Windows compatibility (safe relative paths, no float128). Jun 19, 2022
@@ -13,7 +14,7 @@
from signac.synced_collections.numpy_utils import NumpyConversionWarning

PYPY = "PyPy" in platform.python_implementation()
WINDOWS = "win" in platform.python_implementation()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug -- it was checking "win" in "CPython". 😛

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this was getting WINDOWS = True because of that bug, and hiding failures on other platforms too: https://app.circleci.com/pipelines/github/glotzerlab/signac/3039/workflows/bea29ea7-81f8-4026-a303-27ba8cfac543/jobs/21272

@vyasr Until we resolve the underlying problems, I would vote to always skip the multithreaded tests rather than claim it's just a Windows problem (we know it's sporadically appearing elsewhere too). We haven't had success in reproducing threading problems aside from Windows and on various platforms in CI. If it would help, we could try to debug it together on my Windows machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*sigh

As a Windows user, are you able to reproduce this issue on any platform locally? At one point I was occasionally able to do so using Pypy, but never using CPython (including different versions) on Linux or Mac. Is the problem related to the use of pytest-xdist on CI? I know that pytest -n * works fine when running signac tests locally on every machine that I have tried.

At this point I'm convinced that there is a real underlying bug somewhere in the multithreading support (unless it's a bug in the tests), but I have struggled enough to get a consistent reproducer that I have stopped pursuing it. In the past, SSHing into failing CI builds has also failed to provide a consistent repro. I suppose I can try that again with one of these images and see if that gets me anywhere at some point.

Copy link
Member Author

@bdice bdice Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, disabling the xfail and running > pytest -k multithreaded -x (no pytest-xdist) gets it to fail on my Windows machine. However, it may or may not be the same failure as on Linux. Sometimes the failure on Linux / PyPy is in checking the length here:

assert len(synced_collection) == num_threads

However, the failure I get locally on Windows is a permissions failure here:

list(executor.map(set_value, [synced_collection] * num_threads * 10))

They could be related, but I'm not sure. It's just a buttery, flaky, test.

Full Error Log from Windows
============================= test session starts =============================
platform win32 -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: \\wsl.localhost\Ubuntu\home\bdice\code\signac, configfile: setup.cfg
plugins: anyio-3.6.1, forked-1.4.0, xdist-2.5.0
collected 6582 items / 6513 deselected / 69 selected

tests\test_synced_collections\test_json_buffered_collection.py ......... [ 13%]
...............F

================================== FAILURES ===================================
_____________ TestBufferedJSONDictWriteConcern.test_multithreaded _____________

self = <test_json_buffered_collection.TestBufferedJSONDictWriteConcern object at 0x000001B3D9D44DF0>
synced_collection = {'ThreadPoolExecutor-31_31': 'ThreadPoolExecutor-31_31'}

    @pytest.mark.xfail(
        False,
        reason=(
            "This test sometimes fails. This may indicate a race condition. "
            "The test fails more consistently on Windows but also appears on "
            "Linux in CI."
        ),
    )
    def test_multithreaded(self, synced_collection):
        """Test multithreaded runs of synced dicts."""
        if not type(synced_collection)._supports_threading:
            return
    
        from concurrent.futures import ThreadPoolExecutor
        from json.decoder import JSONDecodeError
        from threading import current_thread
    
        def set_value(sd):
            sd[current_thread().name] = current_thread().name
    
        num_threads = 50
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            list(executor.map(set_value, [synced_collection] * num_threads * 10))
    
        assert len(synced_collection) == num_threads
    
        # Now clear the data and try again with multithreading disabled. Unless
        # we're very unlucky, some of these threads should overwrite each
        # other.
        type(synced_collection).disable_multithreading()
        synced_collection.clear()
    
        try:
            with ThreadPoolExecutor(max_workers=num_threads) as executor:
>               list(executor.map(set_value, [synced_collection] * num_threads * 10))

tests\test_synced_collections\synced_collection_test.py:490: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\Users\bdice\mambaforge\lib\concurrent\futures\_base.py:609: in result_iterator
    yield fs.pop().result()
C:\Users\bdice\mambaforge\lib\concurrent\futures\_base.py:439: in result
    return self.__get_result()
C:\Users\bdice\mambaforge\lib\concurrent\futures\_base.py:391: in __get_result
    raise self._exception
C:\Users\bdice\mambaforge\lib\concurrent\futures\thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
tests\test_synced_collections\synced_collection_test.py:474: in set_value
    sd[current_thread().name] = current_thread().name
c:\users\bdice\code\signac\signac\synced_collections\data_types\synced_dict.py:182: in __setitem__
    with self._load_and_save, self._suspend_sync:
c:\users\bdice\code\signac\signac\synced_collections\buffers\file_buffered_collection.py:71: in __exit__
    super().__exit__(exc_type, exc_val, exc_tb)
c:\users\bdice\code\signac\signac\synced_collections\data_types\synced_collection.py:48: in __exit__
    self._collection._save()
c:\users\bdice\code\signac\signac\synced_collections\buffers\buffered_collection.py:121: in _save
    self._save_to_resource()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'ThreadPoolExecutor-31_31': 'ThreadPoolExecutor-31_31'}

    def _save_to_resource(self):
        """Write the data to JSON file."""
        # Serialize data
        blob = json.dumps(self, cls=SyncedCollectionJSONEncoder).encode()
        # When write_concern flag is set, we write the data into dummy file and then
        # replace that file with original file. We also enable this mode
        # irrespective of the write_concern flag if we're running in
        # multithreaded mode.
        if self._write_concern or type(self)._threading_support_is_active:
            dirname, filename = os.path.split(self._filename)
            fn_tmp = os.path.join(dirname, f"._{uuid.uuid4()}_{filename}")
            with open(fn_tmp, "wb") as tmpfile:
                tmpfile.write(blob)
>           os.replace(fn_tmp, self._filename)
E           PermissionError: [WinError 5] Access is denied: 'C:\\Users\\bdice\\AppData\\Local\\Temp\\pytest-of-bdice\\pytest-18\\test_multithreaded6\\._32b91d8c-94f7-4e89-87dd-36a23b33d4bc_test.json' -> 'C:\\Users\\bdice\\AppData\\Local\\Temp\\pytest-of-bdice\\pytest-18\\test_multithreaded6\\test.json'

c:\users\bdice\code\signac\signac\synced_collections\backends\collection_json.py:264: PermissionError
=========================== short test summary info ===========================
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
=============== 1 failed, 24 passed, 6513 deselected in 44.00s ================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that it's possible that your failure is due to the same underlying issue, perhaps somehow a nonexistent file move is being attempted because another thread has moved the same file already? I don't know enough about Windows error codes to know all the ways that Windows could fail. While in theory IIRC os.replace is the "safest" way to do this on Windows maybe there are still issues with it. We could look into it a bit more and see, but I'd probably need you to do a bit more legwork here given my limited familiarity with Windows.

The Pypy/Linux errors, though, are clear indications that something is wrong because they are demonstrating that the exact invariant being tested is violated. Those are the errors that I unfortunately haven't been able to reproduce on any machine that I've had access to. If we could find a way to reproduce that test reliably on some machine I would be happy to look into it further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open a PR where you reenable this test after the GHA switch? Then we will at least have something open to track this, and a way to test whether this issue recurs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was even more aggressive in #776 (now merged) and had to disable more tests because of failures. See CI failures on faff779 and xfails added in 360b898.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so basically any and all multithreaded tests of synced collections seem like they could fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my take, yes. I saw failures on tests like:

FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_setitem
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_update
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_reset
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_clear
FAILED tests/test_synced_collections/test_json_buffered_collection.py::TestBufferedJSONDictWriteConcern::test_multithreaded_buffering_load

and decided it wasn't worth trying anything with "multithreaded" in the name. I xfailed all tests named with "multithreaded."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I guess we can try to investigate this again at some point. I anticipate having access to a Windows system soonish, which would help a ton.

@bdice bdice self-assigned this Jun 19, 2022
@bdice bdice added the bug Something isn't working label Jun 19, 2022
@bdice bdice added this to the v1.8.0 milestone Jun 19, 2022
@bdice bdice marked this pull request as ready for review June 20, 2022 00:02
@bdice bdice requested review from a team as code owners June 20, 2022 00:02
@bdice bdice requested review from vyasr and Charlottez112 and removed request for a team June 20, 2022 00:02
@bdice bdice mentioned this pull request Jun 20, 2022
14 tasks
@bdice bdice merged commit a9c75e0 into master Jul 5, 2022
@bdice bdice deleted the safe-relpath branch July 5, 2022 01:39
@vyasr vyasr mentioned this pull request Nov 3, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants