Request subset of columns in `JobsCursor.to_dataframe` #327

vyasr · 2020-04-20T20:51:11Z

Feature description

My statepoints are often overspecified in the sense that I include more information than is required to make a job unique in a data space because those are still variables that are critical to defining the data point and I anticipate the possibility of varying them. It would be very convenient if there was a simple API to specify which columns should be included in the DataFrame (which would essentially be equivalent to running df.drop(EXCLUDED_COLUMNS, inplace=True) after the face), and more importantly to replicate the behavior in project.detect_schema where you can exclude_const.

Proposed solution

Add exclude_const and/or exclude_keys extra arguments to JobsCursor.to_dataframe to specify statepoint or document parameters that should not be included in the output dataframe.

The text was updated successfully, but these errors were encountered:

bdice · 2020-04-22T20:31:01Z

I could see adding exclude_const, but not a whitelist/blacklist of keys. I think it's appropriate for users to perform more extensive manipulations using pandas itself. I don't think this is signac's responsibility, unless it's presenting a significant performance penalty.

vyasr · 2020-04-22T21:58:14Z

I think that's probably the right call for state point keys. Adding exclude_const is IMO very useful, but beyond that I think modification can be left to pandas.

However, that is referring to state points. I think the situation is potentially quite different for documents; people store plenty of somewhat heavyweight data in there. For that case the black/whitelist seems much more useful. Thoughts?

csadorf · 2020-04-23T15:42:52Z

@vyasr I agree with @bdice we shouldn't over-engineer this. To determine which state point keys are constant over the data space is a non-trivial so providing that option makes sense, but everything else I feel is out of scope. Pandas in general makes it very easy to select and unselect specific columns. However, it's possible I am missing something in which case I'd encourage you to provide a specific example that demonstrates the feature gap.

vyasr · 2020-04-26T16:08:18Z

@csadorf @bdice we're generally on the same page. I think enabling exclude_const is absolutely worthwhile and in scope, while additional filtering is easier to do with pandas and so may be out of scope. I'll explain my argument regarding a whilelist, but I'm fine if we decide not to proceed on that front.

The usage I currently have is that I have a project where I'm storing a number of quantities in my job document that parameterize my runs but are not good fits for a state point (things like the number of steps to run a simulation or a compression protocol, which I may want to change). Over the course of the simulation I store some scalar properties and things like progress flags in the document. Furthermore, in the process of performing certain analyses I store certain computed quantities in the job document. Afterwards, when I want to run a specific analysis based on some of the previous calculations, I currently do project.to_dataframe(), and then I have a long list of columns that are passed to df.drop to get just the quantities that I want. I built this list based on my best recollection, but I had to edit it a couple of times after viewing the resulting data frame because I missed a few things. Dropping these columns is not optional, because I'm making use of things like groupby and manipulating the resulting pandas MultiIndex, so unless I get rid of the columns at some point I will at best have to do a reset_index and manually select the columns I want to plot (and this makes the plotting code much more verbose since I can't take advantage of pandas knowing how to plot e.g. a Series object with a MultiIndex intelligently). What I would like is simply to skip the df.drop step. Since I only really need two of the quantities in the job document for this calculation, it would be much easier to do something like pd.to_dataframe(exclude_const=True, doc_keys=['foo', 'bar']) than what I'm currently doing. I could do something more clever than just a drop list with functions like df.columns.[intersection|difference], but then I would still need to construct the list of all state point keys since we can't drop those.

csadorf · 2020-04-27T07:45:50Z

@vyasr I usually have some util.py or common.py module within my project space in which I define little utility functions that take of these kinds of things for me. So in this case I would have some kind of to_df() function that wraps the underlying signac function. Would that be a practical solution for your problem?

vyasr · 2020-04-28T15:37:21Z

OK, I wrote this snippet under the assumption that all statepoints vary and it's very simple. The behavior I'm looking for is basically just this:

desired_doc_keys = [DESIRED_KEYS]
df = df[[c for c in df.columns if 'sp.' in c or c in df.columns.intersection(['doc.' + x for x in desired_doc_keys])]]

So if exclude_const is implemented, then I think we can skip on the extra whitelisting unless we run into performance issues where the document is just too large.

bdice · 2020-04-28T15:56:02Z

@vyasr You mentioned cases where the document is large -- I think that's best to handle now, not later. We would just need to implement this filter before the pandas DataFrame is constructed. If the user doesn't request a key, we shouldn't store it into memory over all the jobs. It'd be simple to modify this function to have blacklists/whitelists for sp/doc:

signac/signac/contrib/project.py

Lines 1904 to 1908 in b21bdeb

    
           def _export_sp_and_doc(job): 
        
               for key, value in job.sp.items(): 
        
                   yield sp_prefix + key, value 
        
               for key, value in job.doc.items(): 
        
                   yield doc_prefix + key, value

bdice · 2020-04-28T16:41:09Z

@vyasr Another design consideration: the to_dataframe method uses only top-level keys. That is, it wouldn't flatten a nested dictionary into a.b.c, it would just have a column a with the nested dictionary as its value. This makes exclude_const somewhat tricky to define, since that option is typically applied to schemas (which are flat). I am not sure how you'd prefer to resolve that ambiguity so I left it out of the current PR.

bdice mentioned this issue Apr 28, 2020

Add usecols and flatten to to_dataframe feature. #330

Merged

12 tasks

mikemhenry added the enhancement New feature or request label May 5, 2020

vyasr closed this as completed in #330 May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request subset of columns in `JobsCursor.to_dataframe` #327

Request subset of columns in `JobsCursor.to_dataframe` #327

vyasr commented Apr 20, 2020

bdice commented Apr 22, 2020

vyasr commented Apr 22, 2020

csadorf commented Apr 23, 2020

vyasr commented Apr 26, 2020

csadorf commented Apr 27, 2020

vyasr commented Apr 28, 2020

bdice commented Apr 28, 2020

bdice commented Apr 28, 2020

Request subset of columns in JobsCursor.to_dataframe #327

Request subset of columns in JobsCursor.to_dataframe #327

Comments

vyasr commented Apr 20, 2020

Feature description

Proposed solution

bdice commented Apr 22, 2020

vyasr commented Apr 22, 2020

csadorf commented Apr 23, 2020

vyasr commented Apr 26, 2020

csadorf commented Apr 27, 2020

vyasr commented Apr 28, 2020

bdice commented Apr 28, 2020

bdice commented Apr 28, 2020

Request subset of columns in `JobsCursor.to_dataframe` #327

Request subset of columns in `JobsCursor.to_dataframe` #327