Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify documentation for matplotlib datasets #2536

Open
astrojuanlu opened this issue Apr 25, 2023 · 4 comments
Open

Clarify documentation for matplotlib datasets #2536

astrojuanlu opened this issue Apr 25, 2023 · 4 comments
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@astrojuanlu
Copy link
Member

Description

The documentation to visualize plots with matplotlib contains this snippet:

import matplotlib.pyplot as plt
import seaborn as sn

...


def create_confusion_matrix(companies: pd.DataFrame):
    actuals = [0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1]
    predicted = [1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1]
    data = {"y_Actual": actuals, "y_Predicted": predicted}
    df = pd.DataFrame(data, columns=["y_Actual", "y_Predicted"])
    confusion_matrix = pd.crosstab(
        df["y_Actual"], df["y_Predicted"], rownames=["Actual"], colnames=["Predicted"]
    )
    sn.heatmap(confusion_matrix, annot=True)
    return plt

It's very puzzling that the node returns the matplotlib.pyplot state machine through the plt variable. However, it works!

The MatplotlibWriter documentation, on the other hand, hints that the node should return the matplotlib.figure.Figure, which also worked an feels way more natural.

If returning pyplot is expected, I think we should make it more clear in both places of the docs. If we want to discourage it (yes please), then we should tweak the example in the narrative docs so it returns a Figure instead.

Related: kedro-org/kedro-plugins#529.

Context

The problem with the documentation of datasets is that the actual logic lives in _load and _save methods that are private and therefore not documented. I recall there are generic issues to "redesign the catalog" #1778 #1981 but I don't know if they are about this.

@astrojuanlu astrojuanlu added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Apr 25, 2023
@antonymilne
Copy link
Contributor

I'm surprised that this works, although looking at the matplotlib.pyplot source makes it clear why it does:

def savefig(*args, **kwargs):
    fig = gcf()
    res = fig.savefig(*args, **kwargs)
    ...

Either way, this is definitely not the intended use of the dataset and I agree we should not encourage it. Let's change the docs to return the Figure object instead.

As for documentation of datasets, you're right that this is one disadvantage of the current approach. To get around the problem at the moment we just put all the docs in the class docstring, which is not great but it's ok I think. As in the github issues you mention, IMO there's more fundamental problems with AbstractDataSet and the like, but changing from the template method pattern would be a big change (see also #2409 (comment)). So if we eventually do that, getting the better documentation would just be a nice side-benefit.

@astrojuanlu
Copy link
Member Author

To get around the problem at the moment we just put all the docs in the class docstring, which is not great but it's ok I think.

Yep... The story for multi-repository docs is not impossible, but requires some focus: #2072 (comment)

@astrojuanlu
Copy link
Member Author

xref kedro-org/kedro-plugins#353 and a conversation we had in Slack: https://linen-slack.kedro.org/t/15875966/i-have-a-plot-saved-as-an-image-in-my-catalog-yml-and-the-fi#6d98204f-5d9f-497b-88df-80a23980dcfa

yeah this one is tricky, because I would expect a MatplotlibDataset to save/load figures, rather than the raw PNG files
but proper serialization of figures has been an ongoing theme in matplotlib for 14+ years
these days you can pickle the figure, at least

The fact that some datasets are meant to be write-only (hence "artifacts") is in line with the discussions we're having in #1936 (comment)

@astrojuanlu
Copy link
Member Author

This should not be tackled before kedro-org/kedro-plugins#353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Status: No status
Development

No branches or pull requests

3 participants