Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Massage InferenceData for multi-variable model comparison? #998

Closed
rpgoldman opened this issue Jan 14, 2020 · 8 comments
Closed

Comments

@rpgoldman
Copy link
Contributor

Short Description

I have a PyMC3 model that is partitioned into 6 sub-models. The observations are also partitioned into 6 subsets. This lets me apply different parameters based on the values of independent variables without complex indexing. I can't use mixing because there is no distribution over the independent variables.

I know model comparison only works for models with a single observed RV. But I was wondering: is there some way to "unpartition" the InferenceData so that Arviz can treat the six vectors of observations as one big vector, and compute model comparison metrics?

I can compare the submodels individually, but this does not properly take into account the hyperparameters that link them.

@rpgoldman rpgoldman changed the title Massage InferenceData for multi-variable model comparison Question: Massage InferenceData for multi-variable model comparison? Jan 14, 2020
@OriolAbril
Copy link
Member

I think it is, but for now it will be convoluted and unnecessarily much more complicated than it needs to be.

The first issue is getting the log_likelihood data from the PyMC3 model, for this you have to use the code in #794 (I can probably rebase and upload in some hours) or write/copy some function to get the log likelihood.

The second issue is ArviZ not accepting multiple log likelihood instances, so the fastest way to get it done is probably to group all the log likelihoods into a single "log_likelihood" variable, something similar to the following pseudocode (assuming code from #794):

idata.sampler_stats["log_likelihood"] = xr.concat(
    [idata.log_likelihoods[var_name] for var_name in var_names]
)
az.waic/loo(idata)

Note: I would start trying if it works with xr.concat, but maybe it is some other function

@ahartikainen
Copy link
Contributor

Do you want to sum them together or compare invidually.

If you can extract each cell (use arviz.utils.flat_inference_data_to_dict) and then either create a new InferenceData with from_dict (log_likelihood as a vector) or create a new InferenceData for each variable.

Then compare pair(s).

@rpgoldman
Copy link
Contributor Author

@ahartikainen I would like to compare the models as a whole (rather than comparing the submodels), so I think I want to do what @OriolAbril suggests: effectively concatenate all the observed random variables into one big random variable. This is reasonable, because the observations are interchangeable by the model.

@rpgoldman
Copy link
Contributor Author

I think it is, but for now it will be convoluted and unnecessarily much more complicated than it needs to be.

The first issue is getting the log_likelihood data from the PyMC3 model, for this you have to use the code in #794 (I can probably rebase and upload in some hours) or write/copy some function to get the log likelihood.

The second issue is ArviZ not accepting multiple log likelihood instances, so the fastest way to get it done is probably to group all the log likelihoods into a single "log_likelihood" variable, something similar to the following pseudocode (assuming code from #794):

idata.sampler_stats["log_likelihood"] = xr.concat(
    [idata.log_likelihoods[var_name] for var_name in var_names]
)
az.waic/loo(idata)

Note: I would start trying if it works with xr.concat, but maybe it is some other function

@OriolAbril I tried this, but it didn't work, because the code for PyMC3Converter. _extract_log_likelihood() does not populate this group if there is more than one observed RV:

    def _extract_log_likelihood(self):
        """Compute log likelihood of each observation.

        Return None if there is not exactly 1 observed random variable.
        """
        if len(self.model.observed_RVs) != 1:
            return None, None
...

I think it might be possible to build multiple InferenceData objects, one for each observed RV, by modifying the trace before saving it. I'll report back if this works.

@OriolAbril
Copy link
Member

@OriolAbril I tried this, but it didn't work, because the code for PyMC3Converter. _extract_log_likelihood() does not populate this group if there is more than one observed RV

This is why I recommended using the code in #794

@rpgoldman
Copy link
Contributor Author

@OriolAbril Thanks for the reminder. I will have a careful look at #794 tomorrow.

@VincentBt
Copy link
Contributor

so the fastest way to get it done is probably to group all the log likelihoods into a single "log_likelihood" variable,

What do you suggest once you sent this concatenation of log_likelihood for all observed variables, to the waic function? Don't you need to sum the log_likelihoods at some point? See my comment here.

@OriolAbril
Copy link
Member

After combining all the log_likelihood data into a single array (stored in sample_stats.log_likelihood), waic and loo will calculate the IC assuming all observations are conditionally independent (or independent, I am not completely sure, I have not found time to do the math), waic and loo already worked with n-dimensional arrays. If this is your case, then the results will be correct, otherwise you would have to implement the correct version of the algorithm or wait a little still.

Note: ArviZ currently looks first at sample_stats to raise a warning if log likelihood is still there, therefore, doing this will only get a deprecation warning and not the annoying "Found several log likelihood arrays {}, var_name cannot be None" error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants