Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request subset of columns in JobsCursor.to_dataframe #327

Closed
vyasr opened this issue Apr 20, 2020 · 8 comments · Fixed by #330
Closed

Request subset of columns in JobsCursor.to_dataframe #327

vyasr opened this issue Apr 20, 2020 · 8 comments · Fixed by #330
Labels
enhancement New feature or request

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 20, 2020

Feature description

My statepoints are often overspecified in the sense that I include more information than is required to make a job unique in a data space because those are still variables that are critical to defining the data point and I anticipate the possibility of varying them. It would be very convenient if there was a simple API to specify which columns should be included in the DataFrame (which would essentially be equivalent to running df.drop(EXCLUDED_COLUMNS, inplace=True) after the face), and more importantly to replicate the behavior in project.detect_schema where you can exclude_const.

Proposed solution

Add exclude_const and/or exclude_keys extra arguments to JobsCursor.to_dataframe to specify statepoint or document parameters that should not be included in the output dataframe.

@bdice
Copy link
Member

bdice commented Apr 22, 2020

I could see adding exclude_const, but not a whitelist/blacklist of keys. I think it's appropriate for users to perform more extensive manipulations using pandas itself. I don't think this is signac's responsibility, unless it's presenting a significant performance penalty.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 22, 2020

I think that's probably the right call for state point keys. Adding exclude_const is IMO very useful, but beyond that I think modification can be left to pandas.

However, that is referring to state points. I think the situation is potentially quite different for documents; people store plenty of somewhat heavyweight data in there. For that case the black/whitelist seems much more useful. Thoughts?

@csadorf
Copy link
Contributor

csadorf commented Apr 23, 2020

@vyasr I agree with @bdice we shouldn't over-engineer this. To determine which state point keys are constant over the data space is a non-trivial so providing that option makes sense, but everything else I feel is out of scope. Pandas in general makes it very easy to select and unselect specific columns. However, it's possible I am missing something in which case I'd encourage you to provide a specific example that demonstrates the feature gap.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 26, 2020

@csadorf @bdice we're generally on the same page. I think enabling exclude_const is absolutely worthwhile and in scope, while additional filtering is easier to do with pandas and so may be out of scope. I'll explain my argument regarding a whilelist, but I'm fine if we decide not to proceed on that front.

The usage I currently have is that I have a project where I'm storing a number of quantities in my job document that parameterize my runs but are not good fits for a state point (things like the number of steps to run a simulation or a compression protocol, which I may want to change). Over the course of the simulation I store some scalar properties and things like progress flags in the document. Furthermore, in the process of performing certain analyses I store certain computed quantities in the job document. Afterwards, when I want to run a specific analysis based on some of the previous calculations, I currently do project.to_dataframe(), and then I have a long list of columns that are passed to df.drop to get just the quantities that I want. I built this list based on my best recollection, but I had to edit it a couple of times after viewing the resulting data frame because I missed a few things. Dropping these columns is not optional, because I'm making use of things like groupby and manipulating the resulting pandas MultiIndex, so unless I get rid of the columns at some point I will at best have to do a reset_index and manually select the columns I want to plot (and this makes the plotting code much more verbose since I can't take advantage of pandas knowing how to plot e.g. a Series object with a MultiIndex intelligently). What I would like is simply to skip the df.drop step. Since I only really need two of the quantities in the job document for this calculation, it would be much easier to do something like pd.to_dataframe(exclude_const=True, doc_keys=['foo', 'bar']) than what I'm currently doing. I could do something more clever than just a drop list with functions like df.columns.[intersection|difference], but then I would still need to construct the list of all state point keys since we can't drop those.

@csadorf
Copy link
Contributor

csadorf commented Apr 27, 2020

@vyasr I usually have some util.py or common.py module within my project space in which I define little utility functions that take of these kinds of things for me. So in this case I would have some kind of to_df() function that wraps the underlying signac function. Would that be a practical solution for your problem?

@vyasr
Copy link
Contributor Author

vyasr commented Apr 28, 2020

OK, I wrote this snippet under the assumption that all statepoints vary and it's very simple. The behavior I'm looking for is basically just this:

desired_doc_keys = [DESIRED_KEYS]
df = df[[c for c in df.columns if 'sp.' in c or c in df.columns.intersection(['doc.' + x for x in desired_doc_keys])]]

So if exclude_const is implemented, then I think we can skip on the extra whitelisting unless we run into performance issues where the document is just too large.

@bdice
Copy link
Member

bdice commented Apr 28, 2020

@vyasr You mentioned cases where the document is large -- I think that's best to handle now, not later. We would just need to implement this filter before the pandas DataFrame is constructed. If the user doesn't request a key, we shouldn't store it into memory over all the jobs. It'd be simple to modify this function to have blacklists/whitelists for sp/doc:

def _export_sp_and_doc(job):
for key, value in job.sp.items():
yield sp_prefix + key, value
for key, value in job.doc.items():
yield doc_prefix + key, value

@bdice
Copy link
Member

bdice commented Apr 28, 2020

@vyasr Another design consideration: the to_dataframe method uses only top-level keys. That is, it wouldn't flatten a nested dictionary into a.b.c, it would just have a column a with the nested dictionary as its value. This makes exclude_const somewhat tricky to define, since that option is typically applied to schemas (which are flat). I am not sure how you'd prefer to resolve that ambiguity so I left it out of the current PR.

@mikemhenry mikemhenry added the enhancement New feature or request label May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants