Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed Project __contains__ method to run in constant time. #231

Merged
merged 3 commits into from
Oct 10, 2019

Conversation

bdice
Copy link
Member

@bdice bdice commented Oct 5, 2019

Description

I was running a cheap operation on a large data space (50k statepoints) with python project.py run. I noticed there was a very large initialization time before the operations began. I hit Ctrl+C a few times and identified that the __contains__ check was taking time proportional to O(N_jobs).

The specific code path being triggered was:

  1. FlowProject.run calls the select filter over pending operations:
    https://github.com/glotzerlab/signac-flow/blob/c4d84fecaa18cd7556be85090e4332482bd79d38/flow/project.py#L1692

  2. The select filter checks if job not in self:
    https://github.com/glotzerlab/signac-flow/blob/c4d84fecaa18cd7556be85090e4332482bd79d38/flow/project.py#L1650

  3. The __contains__ method calls self._find_job_ids() with no arguments:

    return job.get_id() in self._find_job_ids()

  4. The _find_job_ids() function returns a list of self._job_dirs() immediately:

    def _find_job_ids(self, filter=None, doc_filter=None, index=None):
    if filter is None and doc_filter is None and index is None:
    return list(self._job_dirs())

  5. _job_dirs() iterates over all the subdirectories in self._wd, checking each of them against the regular expression JOB_ID_REGEX before yielding.

    for d in os.listdir(self._wd):
    if JOB_ID_REGEX.match(d):
    yield d

  6. Finally, job.get_id() is checked against the list of all job ids returned by _job_dirs().

In my understanding, this should be equivalent to a constant-time check like the one I proposed in this PR. The regular expression validation is not necessary because we're getting the job id directly from the Job object itself, and we really just need to know if that id string is a subfolder of self.wd.

Motivation and Context

Before implementing this change, the select filter was running at ~300 iterations per second. After, it ran the select filter almost immediately (couldn't time it effectively, but there were 45,000 jobs so it's about 100x faster for this size of data space).

Existing tests pass, and I welcome suggestions for additional tests if this might introduce any unexpected behavior.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Performance enhancement 🐎 🚗
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog.

@bdice bdice added the enhancement New feature or request label Oct 5, 2019
@bdice bdice self-assigned this Oct 5, 2019
@bdice bdice requested a review from vyasr October 7, 2019 14:13
@bdice
Copy link
Member Author

bdice commented Oct 7, 2019

@vyasr I added you as a reviewer since the Pull Assigner wasn't working. I would like your input on whether this is a "safe change" with respect to signac's data model, before I go further with this PR.

@bdice
Copy link
Member Author

bdice commented Oct 8, 2019

Offline discussion with @vyasr: This is probably an ok change but it makes our implementation more concrete and less abstract. I'm perfectly fine with that (in light of our potential rewrites for a more flexible storage engine in signac v2.0) because it's reasonable to expect that other potential backends should be able to implement their own __contains__ check in O(1), without relying on generic O(N) methods.

@csadorf If you'd like to weigh in, feel free. Otherwise we'll plan on merging this ~3 days after @vyasr reviews.

@bdice bdice requested review from csadorf and a team as code owners October 8, 2019 14:34
@ghost ghost removed their request for review October 8, 2019 14:34
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vyasr vyasr added the blocked Dependent on something else label Oct 9, 2019
@vyasr
Copy link
Contributor

vyasr commented Oct 9, 2019

This must be merged after #232.

@codecov
Copy link

codecov bot commented Oct 10, 2019

Codecov Report

Merging #231 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #231      +/-   ##
==========================================
+ Coverage   64.77%   64.84%   +0.07%     
==========================================
  Files          39       39              
  Lines        5558     5558              
==========================================
+ Hits         3600     3604       +4     
+ Misses       1958     1954       -4
Impacted Files Coverage Δ
signac/contrib/project.py 90.81% <100%> (ø) ⬆️
signac/contrib/import_export.py 84.73% <0%> (+0.26%) ⬆️
signac/contrib/collection.py 89.15% <0%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3d06ef...43fa451. Read the comment docs.

@bdice bdice merged commit d6ba3c9 into master Oct 10, 2019
@bdice bdice deleted the fix/constant-time-contains branch October 10, 2019 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Dependent on something else enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants