Additional comments for #289 #291

dzeber · 2022-09-22T00:33:55Z

@alberginia Here are some additional comments or nice-to-have fixes we can do for #289.

Remove the ML_copies prefix from the names of the example notebooks, now that
they're in their own folder
Remove sys.path.append("../..") cells from notebooks. This should not be
necessary if the enviornment is set up as described in the README.
The projected class density plots can be difficult to read. Could we add a sentence describing how read the plots and what exactly is plotted in the panels.

Possible modification to the class density plots:

IIUC the class densities are currently conveying two types of info: the density of where the sampled points fall in the feature space (not informative), and the relative prevalence of the classes. For example, in the wine dataset notebook, the SVM predicts everything as class 0, so the plot (cell 16) shows densities that are an artifact of sampling. If the main goal of the plot is to compare class distributions, in the case of 2 classes, it might be better to have a 2d heatmap plot where regions of the feature space are more blue if more samples are assigned class 1 and more red if more samples are assigned class 0. With this approach, the plot for the wine dataset SVM would be solid red in every panel. Also, if you use this approach for the multivariable_density_comparison plots, in the 2-class case you can combine them into a single plot.

With this approach, the histograms at the diagonals could be turned into stacked barplots showing the proportions of each univariate bin that belong to each class (something like this: https://stackoverflow.com/a/41266416). That could work for the multiclass case as well.

I tried this out with one pair of features in the wine dataset notebook. For this approach I think we have to do the binning manually.

d = visualization_original.df
# Bin the 2 features
d["fb"] = pd.cut(d["fixed acidity"], bins=10, labels=False)
# (Minus sign flips the order of the values to make the plot match the original)
d["ab"] = pd.cut(-d["alcohol"], bins=10, labels=False)
dd = d[["fb", "ab", "recommend"]]
# For some reason, recommend is categorical?
dd["recommend"] = dd["recommend"].astype(int)
# Convert to 2d-bin format (10 x 10 table).
dp = dd.pivot_table("recommend", index="ab", columns="fb")

sns.heatmap(dp, cmap="seismic_r")

Original:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional comments for #289 #291

Additional comments for #289 #291

dzeber commented Sep 22, 2022

Additional comments for #289 #291

Additional comments for #289 #291

Comments

dzeber commented Sep 22, 2022