Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing schema in xarray #169

Closed
ColCarroll opened this issue Aug 12, 2018 · 7 comments
Closed

Representing schema in xarray #169

ColCarroll opened this issue Aug 12, 2018 · 7 comments

Comments

@ColCarroll
Copy link
Member

ColCarroll commented Aug 12, 2018

In order to get to feature parity with pymc3 plotting, we need to have a way to access sampler statistics (specifically, to access divergences), and ppc samples. @aseyboldt outlined a good way to think about the rest of the schema here.

At the same time, xarray supports groups, but it doesn't look like it does so natively (yet? see the discussion at pydata/xarray#1092).

I am proposing something like

import xarray as xr
import netCDF4 as nc

class Trace(object):
    def __init__(self, filename):
        self.filename = filename
        self.data = nc.Dataset(filename)
        self.groups = self.data.groups
        
    def __getattr__(self, name):
        if name in self.groups:
            return xr.open_dataset(self.filename, group=name)
        raise AttributeError("informative message")
    
    def __dir__(self):
        """Allows for tab completion on netCDF group names"""
        return super(Trace, self).__dir__() + list(self.groups.keys())

This is a pretty light wrapper around netCDF and xarray. Usage is something like

t = Trace('mytrace.nc')
t.posterior  # this is an xarray.Dataset
t.posterior.mu.mean()  # calculate the mean of a variable

I think this will have to change a little bit so that nested groups work fine. In particular, something like

t = Trace('mytrace.nc')
t.sampler_stats.divergences  # should return an xarray.Dataset
t.sampler_stats  #  I think this would tend to return an empty xarray.Dataset
@ColCarroll
Copy link
Member Author

Also pinging @shoyer in case he has advice for doing this.

@shoyer
Copy link

shoyer commented Aug 13, 2018

This seems more or less reasonable to me, but note that opening a netCDF file isn't always cheap (this is also true to a lesser extent with creating an xarray.Dataset). I expect you will be happier with caching or eagerly creating Dataset objects rather than recreating them in __getattr__.

@canyon289
Copy link
Member

@ColCarroll Should adding a representation of a prior be considered as well? Thinking of this in regards to this issue

pymc-devs/pymc#3104

@canyon289
Copy link
Member

Nevermind, I see that its already mentioned in the schema document

@ColCarroll
Copy link
Member Author

My thought is that the wrapper above would be used to let arviz decide what plots to "allow", or to throw exceptions with. Once we start adding other data (priors, sampler statistics, observed data, posterior_predictive), I think there are exponentially more interesting visualizations to be done.

@canyon289
Copy link
Member

One big benefit is that its a one stop shop for all the data in a model that is interesting as well, so it makes it easier for people new to the field to figure out what to learn

@ColCarroll
Copy link
Member Author

Closed by #173 and #176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants