Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pretty-printing and fix latex underscores #6533

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

covertg
Copy link

@covertg covertg commented Feb 18, 2023

What is this PR about?
printing.py had developed a decent amount of duplicated code. I wanted to work on #6508 but found it difficult to parse through the logic. Through the course of understanding how the submodule works, I ended up refactoring it somewhat.

This PR fixes the LaTeX underscore bug as discussed in that issue and also seeks to clean up pretty-printing. In particular,

  • Addresses BUG: KaTeX parse error on Jupyter Notebook when using underscore, etc. #6508 by escaping \text commands to print underscores
  • Unifies str_for_model_var and str_for_potential_or_deterministic
  • Deprecates the unused include_params argument
    • The codebase seems to not use this argument at all, and removing it simplifies the logic of some functions in printing.py as well as in tests. But I could re-incorporate it if preferred.
  • Parses the value of the formatting arg a bit more strictly
  • Changes the printing of constants to include shape information, e.g. <constant (1,2)>
  • Updates tests to reflect changes
  • Adds a handful of other tests for printing functionality: model names, names with edge case characters (underscore, $, ~), and misc tests for the warning and errors that may be thrown by pretty printing

Checklist

Major / Breaking Changes

  • Passing include_params to the str_repr function now raises a futurewarning.
  • The formatting argument is now expected to be exactly either "plain" or "latex".

New features

  • Constant shape information as discussed. (Sorry, issue creep. It's a one-liner now. But I can also leave this for a separate issue if preferred.)
  • Tildes (~) in user-specified strings are converted to dashes (-). Practically this is because str_for_model uses ~ and \sim to align the model str, and it's annoying to typeset a plain tilde character in LaTeX.

Bugfixes

Documentation

  • ...

Maintenance

  • While working on this I noticed that the _repr_latex_() methods assigned to Distributions and Models are essentially the same as their str_repr functions, just with latex enforced. Nowhere in the pymc codebase uses these methods. Candidate for removal? I've realized that this method is necessary for latex to automagically appear when supported by ipython.
  • The inputs to Potentials and Deterministics have been printed as a function-like expression, e.g. Deterministic(f(arg1, ...)). In the updated submodule, we maintain this behavior, but it would be easy to drop the f() wrapper for this case if wanted. On reflection it seems pretty sensible to keep the f().
  • I also noticed that we don't have any tests for printing the model name. Added some tests involving this. This includes the case for nested models (one pm.Model() created within the context of another). I've never done this myself in pymc so could use a sanity check on if we need/want to be testing this.

@covertg
Copy link
Author

covertg commented Feb 18, 2023

Forgot to mention: As a qualitative test, I also checked that the "monolithic" model in test_printing.py renders correctly in both MathJax and KaTeX.

@codecov
Copy link

codecov bot commented Feb 18, 2023

Codecov Report

Merging #6533 (8370d97) into main (c2ce47f) will decrease coverage by 9.51%.
The diff coverage is 87.23%.

❗ Current head 8370d97 differs from pull request most recent head f5af848. Consider uploading reports for the commit f5af848 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6533      +/-   ##
==========================================
- Coverage   94.74%   85.24%   -9.51%     
==========================================
  Files         147      147              
  Lines       27861    27855       -6     
==========================================
- Hits        26397    23745    -2652     
- Misses       1464     4110    +2646     
Impacted Files Coverage Δ
pymc/printing.py 87.82% <86.36%> (-0.08%) ⬇️
pymc/distributions/distribution.py 96.31% <100.00%> (ø)
pymc/model.py 89.90% <100.00%> (-0.24%) ⬇️
pymc/tests/logprob/test_utils.py 0.00% <0.00%> (-100.00%) ⬇️
pymc/tests/logprob/test_cumsum.py 0.00% <0.00%> (-100.00%) ⬇️
pymc/tests/logprob/test_mixture.py 0.00% <0.00%> (-100.00%) ⬇️
pymc/tests/logprob/test_abstract.py 0.00% <0.00%> (-100.00%) ⬇️
pymc/tests/logprob/test_rewriting.py 0.00% <0.00%> (-100.00%) ⬇️
pymc/tests/logprob/test_joint_logprob.py 0.00% <0.00%> (-100.00%) ⬇️
pymc/tests/logprob/test_composite_logprob.py 0.00% <0.00%> (-100.00%) ⬇️
... and 18 more

pymc/printing.py Outdated
elif "latex" in formatting:
return rf"\text{{<{var_type}>}}"
# Recurse
return _str_for_input_var(var.owner.inputs[0], formatting=formatting)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I carried this over from the previous version.
When would we receive a DimShuffle here, and can we be sure that only the first input is relevant?

@covertg
Copy link
Author

covertg commented Feb 20, 2023

I pushed some changes. Mainly, they clarify the exceptions that are thrown, add a number of printing tests, and fix an edge-case bug or two. I've also updated the original post to reflect the latest. I won't be changing this PR any more till review. (though there are a few questions or comments that I will add to the code.)

else:
return f"<{var_type} {var_data.shape}>"
elif _has_owner(var):
if isinstance(var.owner.op, DimShuffle):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have this question. Is the DimShuffle Op case still one that we need to handle? If so, let's add a test for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. My understanding is that this is to retrieve the original RV that has been resized (or DimShuffled).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, cool. What would be an example of a model where the original RV gets resized or dimshuffled? (@ricardoV94 )

with Model() as self.model:
# TODO: some variables commented out here as they're not working properly
# in v4 yet (9-jul-2021), so doesn't make sense to test str/latex for them

# Priors for unknown model parameters
alpha = Normal("alpha", mu=0, sigma=10)
b = Normal("beta", mu=0, sigma=10, size=(2,), observed=beta)
# TODO why is this observed?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be a huge deal but it seems curious that b/beta has been observed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I've never noticed this. I also don't think that we should have beta to be observed? It does not seem to affect the test nor its purpose though... Curious indeed

# Simulate outcome variable
Y = alpha + X.dot(beta) + np.random.randn(size) * sigma
Y = alpha0 + X.dot(beta0) + np.random.randn(size) * sigma0
with Model() as self.model:
# TODO: some variables commented out here as they're not working properly
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a TODO or is it outdated?

@larryshamalama
Copy link
Member

Hi @covertg, sorry for the delay in writing back and thanks for your PR :) I'm currently travelling, but I intend to review your PR within the next few days.

Copy link
Member

@larryshamalama larryshamalama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your PR @covertg and for your patience. It sounds like you've put in a lot of work into this, especially for a heavily utilized functionality of PyMC that is often not a focal point of development.

Overall, this submodule is due for a heavy makeover, as you can tell and have started. I provided some comments and questions but, overall, a dispatch-based approach for printing is probably going to be ultimately the best solution. Evidence for this is that there are a lot of if/else statements going on; a more general solution would be to implement a printing mechanism for each Op. That way, (I don't think that?) we would have to delineate between potentials, deterministics, etc. See some references below. I previously attempted to refactor the whole module in #6447, but I finally decided against it.

The PR is still good. Even if we decide to immediately go for a dispatched-based approach, not all effort will be lost, but I personally think that what you have is a good start. I will need to go over the code again before merging. I'm happy to bounce back some iterations if you have questions or concerns. I can even set aside time if you would like to do some pair programming on this work :)

PR 6447
Discussion 6261
Issue 6311

@@ -197,7 +197,7 @@ class SymbolicRandomVariable(OpFromGraph):
"""

_print_name: Tuple[str, str] = ("Unknown", "\\operatorname{Unknown}")
"""Tuple of (name, latex name) used for for pretty-printing variables of this type"""
"Tuple of (name, latex name) used for for pretty-printing variables of this type"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry - Refreshing on pep 257 I've realized triple quotes are probably preferred!

with Model() as self.model:
# TODO: some variables commented out here as they're not working properly
# in v4 yet (9-jul-2021), so doesn't make sense to test str/latex for them

# Priors for unknown model parameters
alpha = Normal("alpha", mu=0, sigma=10)
b = Normal("beta", mu=0, sigma=10, size=(2,), observed=beta)
# TODO why is this observed?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I've never noticed this. I also don't think that we should have beta to be observed? It does not seem to affect the test nor its purpose though... Curious indeed

alpha, sigma = 1, 1
beta = [1, 2.5]

alpha0, sigma0, beta0 = 1, 1, [1, 2.5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe alpha_true, sigma_true, beta_true? This notation is also okay

# This is a bit hacky but seems like the best we got
if (
hasattr(var, "str_repr")
and callable(var.str_repr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if hasattr(var, "str_repr"), it must necessarily be callable, right?

@@ -1116,7 +1116,7 @@ class _LKJCholeskyCovBaseRV(RandomVariable):
ndim_supp = 1
ndims_params = [0, 0, 1]
dtype = "floatX"
_print_name = ("_lkjcholeskycovbase", "\\operatorname{_lkjcholeskycovbase}")
_print_name = ("_lkjcholeskycovbase", r"\operatorname{\_lkjcholeskycovbase}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the addition of \ in front of _? All the other random variables follow ("rv_name", "\\operatorname{rv_name}"). Without the \, do you obtain a strange looking string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thinking about this again... the answer is probably in the title of this PR and related to issue #6508

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly!

@@ -1164,7 +1164,7 @@ def rng_fn(self, rng, n, eta, D, size):
# be safely resized. Because of this, we add the thin SymbolicRandomVariable wrapper
class _LKJCholeskyCovRV(SymbolicRandomVariable):
default_output = 1
_print_name = ("_lkjcholeskycov", "\\operatorname{_lkjcholeskycov}")
_print_name = ("_lkjcholeskycov", r"\operatorname{\_lkjcholeskycov}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

else:
return f"<{var_type} {var_data.shape}>"
elif _has_owner(var):
if isinstance(var.owner.op, DimShuffle):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. My understanding is that this is to retrieve the original RV that has been resized (or DimShuffled).

warnings.warn(
"The `include_params` argument has been deprecated. Future versions will always include parameter values.",
FutureWarning,
stacklevel=2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point of depreciating include_params seems valid to me. If IIRC when browsing this file a few weeks or months ago, include_params was used to delineate whether they were printed as arguments to other distributions or not. You seem to have addressed this in _get_varname_distname_args.



def _is_potential_or_deterministic(var: Variable) -> bool:
# This is a bit hacky but seems like the best we got
Copy link
Member

@larryshamalama larryshamalama Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting function 😅. I never thought of checking if a variable is created via Potential or Deterministic this way. If it works, this sounds great but, as you suggest, it does look hackish...

(Previous _is_potential_or_deterministic below for reference)

def _is_potential_or_deterministic(var: Variable) -> bool:
      try:
          return var.str_repr.__func__.func is str_for_potential_or_deterministic
      except AttributeError:
          # in case other code overrides str_repr, fallback
          return False

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I know I know....... 😅
I figured that it's not that far removed from the original implementation, since in the big-picture they're both inspecting whether a var's str_repr is a functools partial that was made in a certain way.

Anyways, a dispatch-based approach or at least requiring a model object would obviate this type of approach

return text


def _is_potential_or_deterministic(var: Variable) -> bool:
Copy link
Member

@ricardoV94 ricardoV94 Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the model object for this?

if var in model.deterministics or var in model.potentials,

Just pass the model variable along

rv_reprs = [rv.str_repr(formatting=formatting, include_params=include_params) for rv in all_rv]
rv_reprs = [rv_repr for rv_repr in rv_reprs if "TransformedDistribution()" not in rv_repr]

rv_reprs = [rv.str_repr(formatting=formatting, **kwargs) for rv in all_rv]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not the point of this PR but can we already remove the assumption that model RVs have this str_repr attached to it?

Just call the respective function str_for_dist or str_for_deterministic on the RV

It would fix #6311

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I hadn't been aware of that issue. I don't think that we can quite yet remove that assumption, essentially because right now the str_repr function (str_for_model_var in this PR, or str_for_dist and str_for_deterministic in mainline) doesn't know how to distinguish between deterministics and potentials.

At the moment that info is inserted using functools.partial whenever the Model registers a new deterministic or potential. But, related to the above comment: currently we don't require that a model object is on the context stack in order to pretty-print a model variable. Requiring a model object would simplify this though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be fine to require a model to be on the context for the fancier pretty printing.

We can have a much more limited pretty printing for standalone variables (or none at all)

@covertg
Copy link
Author

covertg commented Mar 7, 2023

@larryshamalama of course, thank you for the review! I understand this PR ballooned a good deal and also recognize that I don't have the full big-picture vision that y'all have. My bad for missing your previous work on the dispatch-based approach.

So thanks for the comments. Did my best to follow up on the specific ones that y'all have left. Moving forward -- I'd be very glad to help work on this in whatever way would best serve pymc in the long run. It'd be fun to work on that together! If I may ask, what led you to decide against the work in #6447?

Also a big-picture question for us. Do we want model graphing to remain separate from model printing? This came to me when considering the question (in comments above) of, do we need to pretty-print pymc variables without a model on the context stack. If the focus is on pretty-printing a full model, it seems like graph-visualizing and model-pretty-printing could possibly be unified. But perhaps that is its own big can of worms.

@ricardoV94
Copy link
Member

Also a big-picture question for us. Do we want model _graph_ing to remain separate from model _print_ing? This came to me when considering the question (in comments above) of, do we need to pretty-print pymc variables without a model on the context stack. If the focus is on pretty-printing a full model, it seems like graph-visualizing and model-pretty-printing could possibly be unified. But perhaps that is its own big can of worms.

The big value of the pretty-printing is on printing the whole Model, not individual variables, so I would say that we don't need try to hard to pretty-print individual variables.

@ricardoV94 ricardoV94 marked this pull request as draft August 24, 2023 11:28
@ricardoV94
Copy link
Member

@covertg Any interest in picking up on this again? I did a small fix on #6942 but the underscore thing is still broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants