-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116
Comments
Here's one possible interpretation, for a single model, assuming every model has a notion of latent variables: Let I think this may even be close to or exactly the same quantity that bayeslite/lovecat effectively computes by shenanigans that look incoherent as written. |
@riastradh-probcomp yes, your interpretation is the one that lovecat uses. The reason that I put this issue in cgpm, and not in bayeslite, is how can we define the above operation in terms of the cgpm interface? It seems like an overwhelming violation of abstraction to compute predictive probability, even if we carefully specify the Bayesian quantity it is computing (like you did above). Suppose Pah... |
Well. How gross is it, really? Suppose we have observed fixed values On the other hand, we could consistently interpret On the third hand, maybe there are cases where it is easy to accidentally ask a qualitatively different question -- about a hypothetical value in an unobserved cell for an otherwise observed row, versus about a hypothetical individual sharing every characteristic in common except one observed cell with an observed row -- in which case perhaps there should be a different name for asking the question. But maybe it is sufficient to split the name |
@riastradh-probcomp this formalisms seems to me the most reasonable one. First of all, it is model independent i.e. there is no notion of telling the cgpm to "use the same latents for the hypothetical row as the observed row", and second of all its implementation is closest to what is currently being done. I am going to extend your interpretation above of
How the cgpm decides to deal with the latents for the hypothetical row in the query (reuse them from the observed row, resample them based on the user-specified evidence, marginalize over them, etc) is not specified by the interface. Different cgpms will have the ability to optimize the query differently. It will be some time before I uniformly refactor all cgpms in the library to adhere to the above convention. (Addition: And it is not straightforward to program the above logic for the venturescript cgpm). |
Further justification of |
- Moves _populate_evidence from State to View. - Defines simulate/logpdf for an observed row for gpm-crosscat using the proposal of @riastradh from #116 with one caveat, the user may not override any existing of the rowid on a per-query basis; only missing values in the row may be constrained by logpdf (for query/evidence) and simulate (for logpdf). - Updates test suite to capture the errors of incorrect invocations of simulate/logpdf. - Adds test to querying the cluster assignment in View. TODO: Should decide whether non-crosscat cgpms should be updated to handle an observed rowid in the same way. Right now factor analysis / multivariate kde / multivariate knn etc all have their own conventions. I think the definition should be informed by an actual use case of the other GPMs rather than commit to a particular behavior.
…OBABILITY. We synthesize a fresh_rowid with identical constraints for all other values other than the query value. This update allows CGPM to answer BQL PREDICTIVE PROBABILITY queries. Related: probcomp/cgpm#116
Consider the following annotated session with a hypothetical lightweight cgpm language:
The question is what should
logpdf
return for an observed cell? Note thatsimulate
is not a problem since we know the distribution of the constrained random variable, but it does not have a density. In general, I would be happy to always through an error (as we do now) on this query...BUT ... BQL has this notion of
PREDICTIVE PROBABILITY
(which is an ad-hoc approximation of some Bayesian quantity which is unclearly/unrigorously specified in the BayesDB paper, but turns out to be extremely useful for real life data analysis workflows).So, what do we do? Debate!
The text was updated successfully, but these errors were encountered: