What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116

fsaad · 2016-07-05T23:40:03Z

Consider the following annotated session with a hypothetical lightweight cgpm language:

# Creates a normal gpm with NIG prior, 1 output variable named 'x' and no inputs.
>> gpm <- normal-gpm(outputs=['x'], inputs=None)

# Returns density of x=2 from the prior.
>> gpm.logpdf(rowid=1, query={'x': 2})
-0.12

# Returns a sample of x_1 | {gpm, data={}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': 1.28}

# Incorporates observation (rowid=1, {'x'=1.28})
>> gpm.incorporate(rowid=1, query={'x': 1.28})

# Returns sample of x_1 | {gpm, data={(rowid=1, {'x'=1.28})}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': 1.28}

# Distribution of x_1 | {gpm, data={(rowid=1, {'x'=1.28})}} has no density
# it is a dirac(1.28) measure which is not absolutely continuous wrt to any
# measure I know of.
logpdf(rowid=1, query={x:1.28})
?? undefined density ??

# Returns sample of x_2 | {gpm, data={(rowid=1, {'x'=1.28})}}
>> gpm.simulate(rowid=2, query=['x'])
{'x': 0.41}

# Deletes observation (rowid=1, {'x'=1.28})
>> gpm.unincorporate(rowid=1)

# Returns 1 sample of x_1 | {gpm, data={}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': -0.18}

The question is what should logpdf return for an observed cell? Note that simulate is not a problem since we know the distribution of the constrained random variable, but it does not have a density. In general, I would be happy to always through an error (as we do now) on this query...

BUT ... BQL has this notion of PREDICTIVE PROBABILITY (which is an ad-hoc approximation of some Bayesian quantity which is unclearly/unrigorously specified in the BayesDB paper, but turns out to be extremely useful for real life data analysis workflows).

So, what do we do? Debate!

The text was updated successfully, but these errors were encountered:

riastradh-probcomp · 2016-07-06T20:39:36Z

Here's one possible interpretation, for a single model, assuming every model has a notion of latent variables: Let a_r, b_r be the values of observable variables in the observed row r, and x_r the value of the latent variable in observed row r; let A_*, B_* be observable variables of an unobserved row, and X_* be the latent variable of an unobserved row. At row r, PREDICTIVE PROBABILITY OF B can be interpreted as Pr[B_* = b_r | A_* = a_r, X_* = x_r].

I think this may even be close to or exactly the same quantity that bayeslite/lovecat effectively computes by shenanigans that look incoherent as written.

fsaad · 2016-07-06T22:36:14Z

@riastradh-probcomp yes, your interpretation is the one that lovecat uses. The reason that I put this issue in cgpm, and not in bayeslite, is how can we define the above operation in terms of the cgpm interface? It seems like an overwhelming violation of abstraction to compute predictive probability, even if we carefully specify the Bayesian quantity it is computing (like you did above).

Suppose A has index 0 and B has index 1 in your example above. One possible way to compute the predictive probability is to define a rowid' <- cgpm.clone(rowid) method, which synthesizes a new observation rowid' that is identical to rowid. Then we can use cgpm.unincorporate(rowid', query=[1]), followed by cgpm.logpdf(rowid', query={1:b}).

Pah...

riastradh-probcomp · 2016-07-06T23:21:36Z

Well. How gross is it, really?

Suppose we have observed fixed values a_r, b_r for the random variables A_r, B_r, i.e. we have a distribution conditioned on A_r = a_r, B_r = b_r among the data D. Then if we interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_r = 42 | D], obviously the answer should be a resounding 0. But is that useful? Would anyone ever want to ask that question, in practice?

On the other hand, we could consistently interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_* = 42 | D, B_* = b_r, X_* = x_r] -- i.e., any time someone asks about an observed cell in an observed row, give the answer instead for the corresponding cell in a hypothetical row having all the other information from the observed row.

On the third hand, maybe there are cases where it is easy to accidentally ask a qualitatively different question -- about a hypothetical value in an unobserved cell for an otherwise observed row, versus about a hypothetical individual sharing every characteristic in common except one observed cell with an observed row -- in which case perhaps there should be a different name for asking the question.

But maybe it is sufficient to split the name logpdf into logpdf_observed and logpdf_unobserved, instead of any more elaborate API -- do we actually have any uses for asking simultaneously about multiple cells in observed and unobserved rows, as logpdf currently supports?

fsaad · 2016-07-25T13:57:46Z

On the other hand, we could consistently interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_* = 42 | D, B_* = b_r, X_* = x_r] -- i.e., any time someone asks about an observed cell in an observed row, give the answer instead for the corresponding cell in a hypothetical row having all the other information from the observed row.

@riastradh-probcomp this formalisms seems to me the most reasonable one. First of all, it is model independent i.e. there is no notion of telling the cgpm to "use the same latents for the hypothetical row as the observed row", and second of all its implementation is closest to what is currently being done. I am going to extend your interpretation above of cgpm.logpdf(A_r=42) to the case there is an evidence clause, i.e. cgpm.logpdf(rowid=r, query={A:42}, evidence={B:5}) where the behavior will be to compute:

Pr[A_* = 42 | B_*=5, X_*=x_r the point being that b_r has been replaced with the user-specified constraint B_r=5, while all other row variables are reused.

How the cgpm decides to deal with the latents for the hypothetical row in the query (reuse them from the observed row, resample them based on the user-specified evidence, marginalize over them, etc) is not specified by the interface. Different cgpms will have the ability to optimize the query differently.

It will be some time before I uniformly refactor all cgpms in the library to adhere to the above convention. (Addition: And it is not straightforward to program the above logic for the venturescript cgpm).

riastradh-probcomp · 2016-07-25T19:14:14Z

Further justification of Pr[A_* = a_r | B_* = b_r]: When, in Crosscat, we evaluate PREDICTIVE PROBABILITY OF A as a Monte Carlo integral of Pr[A_* = a_r | B_* = b_r, X_* = x_{r,i}, M = m_i] over all models m_i (i.e., clusterings) and latent variables x_{r,i} (i.e., the category assignment of row r in model i), I suspect we are effectively computing a Monte Carlo estimate of Pr[A_* = a_r | B_* = b_r] already. If, instead, we evaluated a Monte Carlo integral of Pr[A_* = a_r | B_* = b_r, M = m_i], i.e. averaging over all possible latent variables of row r given the model i, I think that would be another Monte Carlo estimate of the same quantity Pr[A_* = a_r | B_* = b_r].

@riastradh

- Moves _populate_evidence from State to View. - Defines simulate/logpdf for an observed row for gpm-crosscat using the proposal of @riastradh from #116 with one caveat, the user may not override any existing of the rowid on a per-query basis; only missing values in the row may be constrained by logpdf (for query/evidence) and simulate (for logpdf). - Updates test suite to capture the errors of incorrect invocations of simulate/logpdf. - Adds test to querying the cluster assignment in View. TODO: Should decide whether non-crosscat cgpms should be updated to handle an observed rowid in the same way. Right now factor analysis / multivariate kde / multivariate knn etc all have their own conventions. I think the definition should be informed by an actual use case of the other GPMs rather than commit to a particular behavior.

…OBABILITY. We synthesize a fresh_rowid with identical constraints for all other values other than the query value. This update allows CGPM to answer BQL PREDICTIVE PROBABILITY queries. Related: probcomp/cgpm#116

fsaad changed the title ~~What should logpdf of an observed cell return?~~ What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? Nov 19, 2016

This was referenced Nov 23, 2016

What should logpdf multirow of observed rows return? #181

Closed

What should logpdf multirow of observed row conditioned on the same observed row return? #182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116

What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116

fsaad commented Jul 5, 2016

riastradh-probcomp commented Jul 6, 2016 •

edited

Loading

fsaad commented Jul 6, 2016

riastradh-probcomp commented Jul 6, 2016

fsaad commented Jul 25, 2016 •

edited

Loading

riastradh-probcomp commented Jul 25, 2016

What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116

What should logpdf of an observed cell return (ie given a rowid which was already incorporated)? #116

Comments

fsaad commented Jul 5, 2016

riastradh-probcomp commented Jul 6, 2016 • edited Loading

fsaad commented Jul 6, 2016

riastradh-probcomp commented Jul 6, 2016

fsaad commented Jul 25, 2016 • edited Loading

riastradh-probcomp commented Jul 25, 2016

riastradh-probcomp commented Jul 6, 2016 •

edited

Loading

fsaad commented Jul 25, 2016 •

edited

Loading