Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete versions OOM fix #277

Merged
merged 1 commit into from
Oct 6, 2023

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Oct 1, 2023

This PR avoids writing a _tmp_raw_data array in calls to _recreate_raw_data, preventing an allocation that in the past could consume lots of memory, causing delete_versions to fail. Closes #273.

NOTE: This is a WIP PR - right now I'm seeing a really weird issue with the tests in test_replay.py where calls to np.testing.assert_equal are failing on tests which have had deleted versions. I'm guessing this is some issue flushing to disk; creating this PR here for further investigation.

Okay, confirmed that closing and opening the file objects in the offending tests seems to resolve the issue. Will look for a way to flush output to disk to close this out.


The intermittent test failures are indeed resolved by calling gc.collect at the end of delete_versions. Clearly this isn't ideal, but at the moment this seems like a functioning approach at least.

Changes

  • _recreate_raw_data now directly writes the recreated raw data to the file. Before, we were writing to a numpy array, then an intermediate _tmp_raw_data dataset, then we'd move the _tmp_raw_data to raw_data to replace the original raw data. This should resolve any issues related to memory usage, and will likely also speed up delete_versions overall, although we now pay an additional garbage collection penalty.
  • Removed the posixpath alias, which was pp and thus would interfere with debugger pp pretty-print commands, in.

@peytondmurray
Copy link
Collaborator Author

peytondmurray commented Oct 5, 2023

So a little more information about this, since it's still a mystery why some of the tests are intermittently failing on this branch:

I've got the right chunks being written to the right place, but I think I've uncovered a bug somewhere in the wrappers. On this line of InMemoryGroup.__getitem__, if you index the entire array before returning the result, the tests pass. In other words, changing this

        if name in self._data:
            return self._data[name]

to

        if name in self._data:
            self._data[name][:]
            return self._data[name]

causes all the intermittently failing tests to pass. This is incredibly weird, but points to some issue with caching. I'll keep investigating here.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Oct 5, 2023

[...] This is incredibly weird, but points to some issue with caching. I'll keep investigating here.

Thanks for keeping us posted @peytondmurray. I agree, this looks like some issue with caching.

If you look at some of the other issues in the repo you can find that we ran into similar issues like this before without figuring out what was happening, like this one #216, or this one #162. We "fixed" the problem by calling gc.collect, does that help here?

@peytondmurray
Copy link
Collaborator Author

This could definitely be a gc issue given how random it feels, so let me try doing this.

@peytondmurray peytondmurray changed the title [WIP] Delete versions OOM fix Delete versions OOM fix Oct 5, 2023
@peytondmurray peytondmurray merged commit ac969b4 into deshaw:master Oct 6, 2023
7 checks passed
@peytondmurray peytondmurray deleted the delete-versions-oom-fix branch October 6, 2023 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delete_versions: use less memory and gracefully handle failure (PyInf#11013)
2 participants