-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed Project __contains__ method to run in constant time. #231
Conversation
@vyasr I added you as a reviewer since the Pull Assigner wasn't working. I would like your input on whether this is a "safe change" with respect to signac's data model, before I go further with this PR. |
Offline discussion with @vyasr: This is probably an ok change but it makes our implementation more concrete and less abstract. I'm perfectly fine with that (in light of our potential rewrites for a more flexible storage engine in signac v2.0) because it's reasonable to expect that other potential backends should be able to implement their own @csadorf If you'd like to weigh in, feel free. Otherwise we'll plan on merging this ~3 days after @vyasr reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This must be merged after #232. |
Codecov Report
@@ Coverage Diff @@
## master #231 +/- ##
==========================================
+ Coverage 64.77% 64.84% +0.07%
==========================================
Files 39 39
Lines 5558 5558
==========================================
+ Hits 3600 3604 +4
+ Misses 1958 1954 -4
Continue to review full report at Codecov.
|
Description
I was running a cheap operation on a large data space (50k statepoints) with
python project.py run
. I noticed there was a very large initialization time before the operations began. I hit Ctrl+C a few times and identified that the__contains__
check was taking time proportional toO(N_jobs)
.The specific code path being triggered was:
FlowProject.run
calls theselect
filter over pending operations:https://github.com/glotzerlab/signac-flow/blob/c4d84fecaa18cd7556be85090e4332482bd79d38/flow/project.py#L1692
The
select
filter checks ifjob not in self
:https://github.com/glotzerlab/signac-flow/blob/c4d84fecaa18cd7556be85090e4332482bd79d38/flow/project.py#L1650
The
__contains__
method callsself._find_job_ids()
with no arguments:signac/signac/contrib/project.py
Line 411 in 48e3158
The
_find_job_ids()
function returns a list ofself._job_dirs()
immediately:signac/signac/contrib/project.py
Lines 524 to 526 in 48e3158
_job_dirs()
iterates over all the subdirectories inself._wd
, checking each of them against the regular expressionJOB_ID_REGEX
before yielding.signac/signac/contrib/project.py
Lines 374 to 376 in 48e3158
Finally,
job.get_id()
is checked against the list of all job ids returned by_job_dirs()
.In my understanding, this should be equivalent to a constant-time check like the one I proposed in this PR. The regular expression validation is not necessary because we're getting the job id directly from the
Job
object itself, and we really just need to know if that id string is a subfolder ofself.wd
.Motivation and Context
Before implementing this change, the
select
filter was running at ~300 iterations per second. After, it ran theselect
filter almost immediately (couldn't time it effectively, but there were 45,000 jobs so it's about 100x faster for this size of data space).Existing tests pass, and I welcome suggestions for additional tests if this might introduce any unexpected behavior.
Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: