Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116

Open
fsaad opened this issue Jul 5, 2016 · 5 comments

Comments

@fsaad
Copy link
Collaborator

fsaad commented Jul 5, 2016

Consider the following annotated session with a hypothetical lightweight cgpm language:

# Creates a normal gpm with NIG prior, 1 output variable named 'x' and no inputs.
>> gpm <- normal-gpm(outputs=['x'], inputs=None)

# Returns density of x=2 from the prior.
>> gpm.logpdf(rowid=1, query={'x': 2})
-0.12

# Returns a sample of x_1 | {gpm, data={}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': 1.28}

# Incorporates observation (rowid=1, {'x'=1.28})
>> gpm.incorporate(rowid=1, query={'x': 1.28})

# Returns sample of x_1 | {gpm, data={(rowid=1, {'x'=1.28})}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': 1.28}

# Distribution of x_1 | {gpm, data={(rowid=1, {'x'=1.28})}} has no density
# it is a dirac(1.28) measure which is not absolutely continuous wrt to any
# measure I know of.
logpdf(rowid=1, query={x:1.28})
?? undefined density ??

# Returns sample of x_2 | {gpm, data={(rowid=1, {'x'=1.28})}}
>> gpm.simulate(rowid=2, query=['x'])
{'x': 0.41}

# Deletes observation (rowid=1, {'x'=1.28})
>> gpm.unincorporate(rowid=1)

# Returns 1 sample of x_1 | {gpm, data={}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': -0.18}

The question is what should logpdf return for an observed cell? Note that simulate is not a problem since we know the distribution of the constrained random variable, but it does not have a density. In general, I would be happy to always through an error (as we do now) on this query...

BUT ... BQL has this notion of PREDICTIVE PROBABILITY (which is an ad-hoc approximation of some Bayesian quantity which is unclearly/unrigorously specified in the BayesDB paper, but turns out to be extremely useful for real life data analysis workflows).

So, what do we do? Debate!

@riastradh-probcomp
Copy link
Contributor

riastradh-probcomp commented Jul 6, 2016

Here's one possible interpretation, for a single model, assuming every model has a notion of latent variables: Let a_r, b_r be the values of observable variables in the observed row r, and x_r the value of the latent variable in observed row r; let A_*, B_* be observable variables of an unobserved row, and X_* be the latent variable of an unobserved row. At row r, PREDICTIVE PROBABILITY OF B can be interpreted as Pr[B_* = b_r | A_* = a_r, X_* = x_r].

I think this may even be close to or exactly the same quantity that bayeslite/lovecat effectively computes by shenanigans that look incoherent as written.

@fsaad
Copy link
Collaborator Author

fsaad commented Jul 6, 2016

@riastradh-probcomp yes, your interpretation is the one that lovecat uses. The reason that I put this issue in cgpm, and not in bayeslite, is how can we define the above operation in terms of the cgpm interface? It seems like an overwhelming violation of abstraction to compute predictive probability, even if we carefully specify the Bayesian quantity it is computing (like you did above).

Suppose A has index 0 and B has index 1 in your example above. One possible way to compute the predictive probability is to define a rowid' <- cgpm.clone(rowid) method, which synthesizes a new observation rowid' that is identical to rowid. Then we can use cgpm.unincorporate(rowid', query=[1]), followed by cgpm.logpdf(rowid', query={1:b}).

Pah...

@riastradh-probcomp
Copy link
Contributor

Well. How gross is it, really?

Suppose we have observed fixed values a_r, b_r for the random variables A_r, B_r, i.e. we have a distribution conditioned on A_r = a_r, B_r = b_r among the data D. Then if we interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_r = 42 | D], obviously the answer should be a resounding 0. But is that useful? Would anyone ever want to ask that question, in practice?

On the other hand, we could consistently interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_* = 42 | D, B_* = b_r, X_* = x_r] -- i.e., any time someone asks about an observed cell in an observed row, give the answer instead for the corresponding cell in a hypothetical row having all the other information from the observed row.

On the third hand, maybe there are cases where it is easy to accidentally ask a qualitatively different question -- about a hypothetical value in an unobserved cell for an otherwise observed row, versus about a hypothetical individual sharing every characteristic in common except one observed cell with an observed row -- in which case perhaps there should be a different name for asking the question.

But maybe it is sufficient to split the name logpdf into logpdf_observed and logpdf_unobserved, instead of any more elaborate API -- do we actually have any uses for asking simultaneously about multiple cells in observed and unobserved rows, as logpdf currently supports?

@fsaad
Copy link
Collaborator Author

fsaad commented Jul 25, 2016

On the other hand, we could consistently interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_* = 42 | D, B_* = b_r, X_* = x_r] -- i.e., any time someone asks about an observed cell in an observed row, give the answer instead for the corresponding cell in a hypothetical row having all the other information from the observed row.

@riastradh-probcomp this formalisms seems to me the most reasonable one. First of all, it is model independent i.e. there is no notion of telling the cgpm to "use the same latents for the hypothetical row as the observed row", and second of all its implementation is closest to what is currently being done. I am going to extend your interpretation above of cgpm.logpdf(A_r=42) to the case there is an evidence clause, i.e. cgpm.logpdf(rowid=r, query={A:42}, evidence={B:5}) where the behavior will be to compute:

Pr[A_* = 42 | B_*=5, X_*=x_r the point being that b_r has been replaced with the user-specified constraint B_r=5, while all other row variables are reused.

How the cgpm decides to deal with the latents for the hypothetical row in the query (reuse them from the observed row, resample them based on the user-specified evidence, marginalize over them, etc) is not specified by the interface. Different cgpms will have the ability to optimize the query differently.

It will be some time before I uniformly refactor all cgpms in the library to adhere to the above convention. (Addition: And it is not straightforward to program the above logic for the venturescript cgpm).

@riastradh-probcomp
Copy link
Contributor

Further justification of Pr[A_* = a_r | B_* = b_r]: When, in Crosscat, we evaluate PREDICTIVE PROBABILITY OF A as a Monte Carlo integral of Pr[A_* = a_r | B_* = b_r, X_* = x_{r,i}, M = m_i] over all models m_i (i.e., clusterings) and latent variables x_{r,i} (i.e., the category assignment of row r in model i), I suspect we are effectively computing a Monte Carlo estimate of Pr[A_* = a_r | B_* = b_r] already. If, instead, we evaluated a Monte Carlo integral of Pr[A_* = a_r | B_* = b_r, M = m_i], i.e. averaging over all possible latent variables of row r given the model i, I think that would be another Monte Carlo estimate of the same quantity Pr[A_* = a_r | B_* = b_r].

@fsaad fsaad changed the title What should logpdf of an observed cell return? What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? Nov 19, 2016
fsaad pushed a commit that referenced this issue Nov 21, 2016
- Moves _populate_evidence from State to View.
- Defines simulate/logpdf for an observed row for gpm-crosscat using
the proposal of @riastradh from #116
with one caveat, the user may not override any existing of the rowid on
a per-query basis; only missing values in the row may be constrained by
logpdf (for query/evidence) and simulate (for logpdf).
- Updates test suite to capture the errors of incorrect invocations of
simulate/logpdf.
- Adds test to querying the cluster assignment in View.

TODO: Should decide whether non-crosscat cgpms should be updated to handle
an observed rowid in the same way. Right now
factor analysis / multivariate kde / multivariate knn etc all have
their own conventions. I think the definition should be informed by an
actual use case of the other GPMs rather than commit to a particular
behavior.
fsaad pushed a commit to probcomp/bayeslite that referenced this issue Nov 23, 2016
…OBABILITY.

We synthesize a fresh_rowid with identical constraints for all other values
other than the query value. This update allows CGPM to answer BQL
PREDICTIVE PROBABILITY queries.

Related: probcomp/cgpm#116
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants