Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional comments for #289 #291

Open
dzeber opened this issue Sep 22, 2022 · 0 comments
Open

Additional comments for #289 #291

dzeber opened this issue Sep 22, 2022 · 0 comments

Comments

@dzeber
Copy link
Contributor

dzeber commented Sep 22, 2022

@alberginia Here are some additional comments or nice-to-have fixes we can do for #289.

  • Remove the ML_copies prefix from the names of the example notebooks, now that
    they're in their own folder
  • Remove sys.path.append("../..") cells from notebooks. This should not be
    necessary if the enviornment is set up as described in the README.
  • The projected class density plots can be difficult to read. Could we add a sentence describing how read the plots and what exactly is plotted in the panels.

Possible modification to the class density plots:

IIUC the class densities are currently conveying two types of info: the density of where the sampled points fall in the feature space (not informative), and the relative prevalence of the classes. For example, in the wine dataset notebook, the SVM predicts everything as class 0, so the plot (cell 16) shows densities that are an artifact of sampling. If the main goal of the plot is to compare class distributions, in the case of 2 classes, it might be better to have a 2d heatmap plot where regions of the feature space are more blue if more samples are assigned class 1 and more red if more samples are assigned class 0. With this approach, the plot for the wine dataset SVM would be solid red in every panel. Also, if you use this approach for the multivariable_density_comparison plots, in the 2-class case you can combine them into a single plot.

With this approach, the histograms at the diagonals could be turned into stacked barplots showing the proportions of each univariate bin that belong to each class (something like this: https://stackoverflow.com/a/41266416). That could work for the multiclass case as well.

I tried this out with one pair of features in the wine dataset notebook. For this approach I think we have to do the binning manually.

d = visualization_original.df
# Bin the 2 features
d["fb"] = pd.cut(d["fixed acidity"], bins=10, labels=False)
# (Minus sign flips the order of the values to make the plot match the original)
d["ab"] = pd.cut(-d["alcohol"], bins=10, labels=False)
dd = d[["fb", "ab", "recommend"]]
# For some reason, recommend is categorical?
dd["recommend"] = dd["recommend"].astype(int)
# Convert to 2d-bin format (10 x 10 table).
dp = dd.pivot_table("recommend", index="ab", columns="fb")

sns.heatmap(dp, cmap="seismic_r")

Screen Shot 2022-09-21 at 7 21 10 PM

Original:
Screen Shot 2022-09-21 at 7 13 19 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant