-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize job __hash__ and __eq__ checks. #442
Conversation
I made a mistake and pushed the changelog update to |
…oblems if two projects share the same job but have different working directories.
Codecov Report
@@ Coverage Diff @@
## master #442 +/- ##
=======================================
Coverage 76.98% 76.98%
=======================================
Files 42 42
Lines 5704 5704
Branches 1112 1112
=======================================
Hits 4391 4391
Misses 1029 1029
Partials 284 284
Continue to review full report at Codecov.
|
I had to update one test file to accommodate this change. I believe the guarantees about job hashes that were previously tested were too strong, and were only true because of the implementation details of |
Looking at this PR now, but how were you able to push to master? I thought we had the master branch locked down. |
@@ -655,7 +655,6 @@ def test_job_move(self): | |||
job = project_a.open_job(dict(a=0)) | |||
job_b = project_b.open_job(dict(a=0)) | |||
assert job != job_b | |||
assert hash(job) != hash(job_b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this test fails under the new changes since it only failed since the working directories for job
and job_b
were different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this fails because of the relaxed definition of the hash
function. job == job_b
implies that hash(job) == hash(job_b)
, but the logical inverse (job != job_b
⏩ hash(job) != hash(job_b)
) does not hold. In the Python data model, it is fine for objects to be non-equal and have the same hash.
I had a few questions about the removed tests since I think understanding how job equality is different now impacts how I feel about this change being too breaking or not. On one hand I think this optimization is great, on the other hand it feels like we are touching the most important internals for the core, so I would be interested in hearing the opinion of another maintainer (@vyasr @csadorf @b-butler) |
Repo administrators have privileges to push to |
@mikemhenry @bdice I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how
Checking equality of the id directly is at least as good as hashing it first, and then you don't have to modify the hash. We could still make the hash change if we find other cases where a faster hash function would be helpful, but at least wait until 2.0 to break that behavior. Based on your benchmark making this change alone is sufficient to get large speedups. That said, if Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary. |
@vyasr @mikemhenry There are several places where jobs' hashes are used. This includes sets of jobs (I know there are examples of this) and dicts with jobs as keys (I'm not 100% sure if there are examples of this but it seems likely). Thus, I believe it is necessary to optimize
Unfortunately caching the
It is unfortunately not possible to eliminate the call to |
That's fair, I agree that optimizing for usage of the hash is something that we should do. I'm simply suggesting that we can avoid the question of whether changing the hash is too breaking in the 1.x line by only optimizing I do think that in either case we should redefine
Fair point, I agree that might end up requiring more work. Something to consider for a future PR if realpath remains a necessity in future versions.
Yes, subdirectory of a project... the alternative of infinite recursion would be a small problem 😂 If we make both changes I listed (a project root directory is defined as an absolute path and the workspace is a subdirectory of a project) the only case I can see where symbolic links would cause problems would be if two different projects symlink to the same workspace directory in a different location. I guess technically those should compare as the same job... that seems like a very error-prone use case that I intentionally discounted from my previous evaluation, but on second consideration even if we do redefine the workspace as a subdirectory of a project there's no way for us to check for that case, so yes I agree that realpath is necessary and we're stuck with this definition as the best that can be done. |
I profiled @klywang @mikemhenry This PR is ready for review. Tagging @glotzerlab/signac-maintainers in case anyone feels strongly and wants to vote against the small breaking change to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this change is non-breaking enough to happen without a major version change. I'm going to "vote" by approving this PR.
Thanks @mikemhenry! This can be merged with a second approval from @klywang or one of @glotzerlab/signac-maintainers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
As long as two jobs that are in different directories are not comparing equal, I think it is fine to relax the restraint on the |
Description
In signac-flow's aggregation feature, it is fairly common to check membership:
job in list(job1, job2, ...)
. This is currently very slow in signac.By Python's membership test rules,
x in y
is equivalent toany(x is e or x == e for e in y)
.Currently, the
__eq__
check requires a call toos.path.realpath
, which is fairly expensive. For a directory path like/a/b/c/d/e/f
,realpath
must check whether/a
,/a/b
,/a/b/c
, etc. are symlinks, and if so, resolve them to their target location. That requires a lot of system calls just to check if jobs are equal. You can see that definition here.I propose weakening the
__hash__
function (which should always be fast in the proposed optimization) and using the hash as a fast way to rule out equality. This optimization is valid becausea == b
implieshash(a) == hash(b)
(see the Python data model section on hashing for details). Using the contrapositive, we can check for hash collision (the job's hash is simplyhash(job.id)
, a property that is known by the job) and then only check therealpath
if hashes collide.Motivation and Context
For a workspace of 30,000 jobs, this speeds up
job in list(project)
by a factor of ~70. (6.05 seconds without the optimization, 0.086 seconds with optimization). (Note thatjob in project
would use the project's__contains__
method, so we need to check against a list-of-jobs,)Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: