-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request subset of columns in JobsCursor.to_dataframe
#327
Comments
I could see adding |
I think that's probably the right call for state point keys. Adding However, that is referring to state points. I think the situation is potentially quite different for documents; people store plenty of somewhat heavyweight data in there. For that case the black/whitelist seems much more useful. Thoughts? |
@vyasr I agree with @bdice we shouldn't over-engineer this. To determine which state point keys are constant over the data space is a non-trivial so providing that option makes sense, but everything else I feel is out of scope. Pandas in general makes it very easy to select and unselect specific columns. However, it's possible I am missing something in which case I'd encourage you to provide a specific example that demonstrates the feature gap. |
@csadorf @bdice we're generally on the same page. I think enabling The usage I currently have is that I have a project where I'm storing a number of quantities in my job document that parameterize my runs but are not good fits for a state point (things like the number of steps to run a simulation or a compression protocol, which I may want to change). Over the course of the simulation I store some scalar properties and things like progress flags in the document. Furthermore, in the process of performing certain analyses I store certain computed quantities in the job document. Afterwards, when I want to run a specific analysis based on some of the previous calculations, I currently do |
@vyasr I usually have some |
OK, I wrote this snippet under the assumption that all statepoints vary and it's very simple. The behavior I'm looking for is basically just this:
So if |
@vyasr You mentioned cases where the document is large -- I think that's best to handle now, not later. We would just need to implement this filter before the pandas DataFrame is constructed. If the user doesn't request a key, we shouldn't store it into memory over all the jobs. It'd be simple to modify this function to have blacklists/whitelists for sp/doc: signac/signac/contrib/project.py Lines 1904 to 1908 in b21bdeb
|
@vyasr Another design consideration: the |
Feature description
My statepoints are often overspecified in the sense that I include more information than is required to make a job unique in a data space because those are still variables that are critical to defining the data point and I anticipate the possibility of varying them. It would be very convenient if there was a simple API to specify which columns should be included in the DataFrame (which would essentially be equivalent to running
df.drop(EXCLUDED_COLUMNS, inplace=True)
after the face), and more importantly to replicate the behavior inproject.detect_schema
where you canexclude_const
.Proposed solution
Add
exclude_const
and/orexclude_keys
extra arguments toJobsCursor.to_dataframe
to specify statepoint or document parameters that should not be included in the output dataframe.The text was updated successfully, but these errors were encountered: