Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Lazy statepoint loading #238

Closed
bdice opened this issue Oct 31, 2019 · 5 comments · Fixed by #239
Closed

Proposal: Lazy statepoint loading #238

bdice opened this issue Oct 31, 2019 · 5 comments · Fixed by #239
Labels
enhancement New feature or request GSoC Google Summer of Code proposal
Milestone

Comments

@bdice
Copy link
Member

bdice commented Oct 31, 2019

Feature description

I am experimenting with refactoring statepoint loading to be lazy. Currently, project.open_job() is called on every job during a flow run command, which causes the statepoints to be loaded. However, the statepoint information isn't even used in many cases. This may be a case where signac is executing unnecessary I/O (which is costly for large projects like the ones I'm working with).

Proposed solution

The property job.statepoint currently just returns job._statepoint, meaning that the internal state is effectively identical to the publicly-accessible state. I am investigating making the property job.statepoint load lazily, with job._statepoint = None until the statepoint is requested.

I see a HUGE performance boost (from 30 seconds to <0.5 seconds for a simple operation on 3000 jobs) since the I/O is dropped dramatically, but some tests fail because job statepoint corruption (which raises a JobsCorruptedError) is not detected at the time of project.open_job(). I think that lazy loading is generally safe except in cases where the statepoint hash and job id don't match (corrupted jobs).

I just wanted to open this for discussion - I am not sure where we stand on approaches to handling corrupted data. Specifically, I want to know whether we think that validating hash(statepoint) == id (and thus the I/O cost of a statepoint load) is always necessary when opening a job by id.

Additional context

Maybe investigate async as an alternative...? Still expensive but maybe less so.

@bdice
Copy link
Member Author

bdice commented Jan 26, 2020

We discussed this offline (I can't remember who was a part of the discussion). Our conclusion was that loading a job by id should not require validating that hash(statepoint) == id. A user (or I/O error) may corrupt the data space by breaking that invariant, and the user will still be allowed to open the job by id. In the proposed lazy-load of statepoints in #239, job corruption will be checked on accessing job.statepoint. (Note that this may not reflect the current design in that PR.) @glotzerlab/signac-committers Was this a conclusion you recall and/or agree with?

@csadorf
Copy link
Contributor

csadorf commented Jan 27, 2020

I am ok with lazy loading and only validating the state point metadata when it is accessed.

@b-butler
Copy link
Member

@bdice I don't remember being a part of the conversation, but I agree with the approach.

@vyasr
Copy link
Contributor

vyasr commented Jan 27, 2020

@bdice and I did discuss and agree on this approach. I need to post my pseudo-prototype for #249 along with a class diagram so that we can progress on that front. IMO that serves as the cleanest path forward for implementing lazy loading by isolating the logic for synchronization and ensuring that a clear set of invariants are well-defined and well-tested for the different possible use-cases of nesting and buffering.

@bdice
Copy link
Member Author

bdice commented Jan 27, 2020

@vyasr I think (but am not sure) that my intended implementation of lazy loading can be clearly separated from your work on #249. I may need to work on #239 a little more to be sure, but I believe my implementation will rely more heavily on the Project class's caching and opening of jobs than the specifics of how job state points / job documents are synchronized. You should feel free to go ahead and post your "pseudo-prototype" (😉) in the meantime.

@bdice bdice linked a pull request Feb 27, 2020 that will close this issue
12 tasks
@bdice bdice added proposal GSoC Google Summer of Code labels Feb 27, 2020
@bdice bdice changed the title Lazy statepoint loading / corruption checks Proposal: Lazy statepoint loading Feb 27, 2020
@bdice bdice added this to the v2.0.0 milestone Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request GSoC Google Summer of Code proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants