practice_lectures/lec22.Rmd

---
title: "Practice Lecture 22 MATH 342W Queens College"
author: "Professor Adam Kapelner"
---


# Missingness

Take a look at an housing dataset from Australia:

https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home?select=melb_data.csv#


```{r}
rm(list = ls())
pacman::p_load(tidyverse, magrittr, data.table, skimr, R.utils)
apts = fread("melb_data.csv.bz2")
skim(apts)
```

We drop all character variables first just for expedience in the demo. If you were building a prediction model, you would scour them carefully to see if there is any signal in them you can use, and then mathematize them to metrics if so.

```{r}
apts %<>%
  select_if(is.numeric) %>%
  select(Price, everything())
```

Imagine we were trying to predict `Price`. So let's section our dataset:

```{r}
y = apts$Price
X = apts %>% 
  select(-Price)
rm(apts)
```

Let's first create a matrix with $p$ columns that represents missingness

```{r}
M = as_tibble(apply(is.na(X), 2, as.numeric))
colnames(M) = paste("is_missing_", colnames(X), sep = "")
M %<>% 
  select_if(function(x){sum(x) > 0})
head(M)
skim(M)
```

Some of these missing indicators might be collinear because they share all the rows they are missing on. Let's filter those out if they exist:

```{r}
M = as_tibble(t(unique(t(M))))
skim(M)
```

Without imputing and without using missingness as a predictor in its own right, let's see what we get with a basic linear model now:

```{r}
lin_mod_listwise_deletion = lm(y ~ ., X)
summary(lin_mod_listwise_deletion)
```

Not bad ... but this is only on the data that has full records! There are 6,750 observations dropped!

Now let's impute using the package. we cannot fit RF models to the entire dataset (it's 13,580 observations) so we will sample 2,000 observations for each of the trees. This is a typical strategy when fitting RF. It definitely reduces variance but increases bias. But we don't have a choice since we don't want to wait forever. We will see that boosting is faster so it is preferred for large sample sizes.

```{r}
pacman::p_load(missForest)
Ximp = missForest(data.frame(X), sampsize = rep(2000, ncol(X)))$ximp
skim(Ximp)
```


Now we consider our imputed dataset as the design matrix.

```{r}
linear_mod_impute = lm(y ~ ., Ximp)
summary(linear_mod_impute)
```
We can do even better if we use all the information i.e. including the missingness. We take our imputed dataset, combine it with our missingness indicators for a new design matrix.

```{r}
Ximp_and_missing_dummies = data.frame(cbind(Ximp, M))
linear_mod_impute_and_missing_dummies = lm(y ~ ., Ximp_and_missing_dummies)
summary(linear_mod_impute_and_missing_dummies)
```

Not much gain, but it the right thing to do. For those in 343... it checks out nicely:

```{r}
anova(linear_mod_impute, linear_mod_impute_and_missing_dummies)
```


Are these two better models than the original model that was built with listwise deletion of observations with missingness?? 

Are they even comparable? It is hard to compare the two models since the first model was built with only non-missing observations which may be easy to predict on and the second was built with the observations that contained missingness. Those extra 6,750 are likely more difficult to predict on. So we cannot do the comparison!

Maybe one apples-to-apples comparison is you can replace all the missingness in the original dataset with something naive e.g. the average and then see who does better. This at least keeps the same observations.

```{r}
X %<>% mutate(Rooms = as.numeric(Rooms))
Xnaive = X %>%
 replace_na(as.list(colMeans(X, na.rm = TRUE)))
linear_mod_naive_without_missing_dummies = lm(y ~ ., Xnaive)
summary(linear_mod_naive_without_missing_dummies)
```

There is a clear gain to imputing and using is_missing dummy features to reduce delta (55.3% vs 52.4% Rsqs).

Note: this is just an illustration of best practice. It didn't necessarily have to "work".


# Using Probability Estimation to do Classification

First repeat quickly (a) load the adult data (b) do a training / test split and (c) build the logisitc model.

```{r}
rm(list = ls())
pacman::p_load_gh("coatless/ucidata")
data(adult)
adult = na.omit(adult) #kill any observations with missingness

set.seed(1)
train_size = 5000
train_indices = sample(1 : nrow(adult), train_size)
adult_train = adult[train_indices, ]
y_train = adult_train$income
X_train = adult_train
X_train$income = NULL

test_size = 5000
test_indices = sample(setdiff(1 : nrow(adult), train_indices), test_size)
adult_test = adult[test_indices, ]
y_test = adult_test$income
X_test = adult_test
X_test$income = NULL

logistic_mod = glm(income ~ ., adult_train, family = "binomial")
p_hats_train = predict(logistic_mod, adult_train, type = "response")
p_hats_test = predict(logistic_mod, adult_test, type = "response")
```

Let's establish a rule: if the probability estimate is greater than or equal to 50%, let's classify the observation as positive, otherwise 0.

```{r}
y_hats_train = factor(ifelse(p_hats_train >= 0.5, ">50K", "<=50K"))
```

How did this "classifier" do in-sample?

```{r}
mean(y_hats_train != y_train)
in_sample_conf_table = table(y_train, y_hats_train)
in_sample_conf_table
```

And the performance stats:

```{r}
n = sum(in_sample_conf_table)
fp = in_sample_conf_table[1, 2]
fn = in_sample_conf_table[2, 1]
tp = in_sample_conf_table[2, 2]
tn = in_sample_conf_table[1, 1]
num_pred_pos = sum(in_sample_conf_table[, 2])
num_pred_neg = sum(in_sample_conf_table[, 1])
num_pos = sum(in_sample_conf_table[2, ])
num_neg = sum(in_sample_conf_table[1, ])
precision = tp / num_pred_pos
cat("precision", round(precision * 100, 2), "%\n")
recall = tp / num_pos
cat("recall", round(recall * 100, 2), "%\n")
false_discovery_rate = 1 - precision
cat("false_discovery_rate", round(false_discovery_rate * 100, 2), "%\n")
false_omission_rate = fn / num_pred_neg
cat("false_omission_rate", round(false_omission_rate * 100, 2), "%\n")
```

That was in-sample which may be overfit. Howe about oos?

```{r}
y_hats_test = factor(ifelse(p_hats_test >= 0.5, ">50K", "<=50K"))
mean(y_hats_test != y_test)
oos_conf_table = table(y_test, y_hats_test)
oos_conf_table
```

A tad bit worse. Here are estimates of the future performance for each class:

```{r}
n = sum(oos_conf_table)
fp = oos_conf_table[1, 2]
fn = oos_conf_table[2, 1]
tp = oos_conf_table[2, 2]
tn = oos_conf_table[1, 1]
num_pred_pos = sum(oos_conf_table[, 2])
num_pred_neg = sum(oos_conf_table[, 1])
num_pos = sum(oos_conf_table[2, ])
num_neg = sum(oos_conf_table[1, ])
precision = tp / num_pred_pos
cat("precision", round(precision * 100, 2), "%\n")
recall = tp / num_pos
cat("recall", round(recall * 100, 2), "%\n")
false_discovery_rate = 1 - precision
cat("false_discovery_rate", round(false_discovery_rate * 100, 2), "%\n")
false_omission_rate = fn / num_pred_neg
cat("false_omission_rate", round(false_omission_rate * 100, 2), "%\n")
```

Worse than in-sample (which was expected). But still could be "good enough" depending on your definition of "good enough".

However... this whole classifier hinged on the decision of the prob-threshold = 50%! What if we change this default threshold??

# Asymmetric Cost Classifiers

Let's establish a *new* rule: if the probability estimate is greater than or equal to 90%, let's classify the observation as positive, otherwise 0.

```{r}
y_hats_test = factor(ifelse(p_hats_test >= 0.9, ">50K", "<=50K"))
mean(y_hats_test != y_test)
oos_conf_table = table(y_test, y_hats_test)
oos_conf_table
```

Of course the misclassification error went up! But now look at the confusion table! The second column represents all $\hat{y} = 1$ and there's not too many of them! Why? You've made it *much* harder to classify something as positive. Here's the new additional performance metrics now:

```{r}
n = sum(oos_conf_table)
fp = oos_conf_table[1, 2]
fn = oos_conf_table[2, 1]
tp = oos_conf_table[2, 2]
tn = oos_conf_table[1, 1]
num_pred_pos = sum(oos_conf_table[, 2])
num_pred_neg = sum(oos_conf_table[, 1])
num_pos = sum(oos_conf_table[2, ])
num_neg = sum(oos_conf_table[1, ])
precision = tp / num_pred_pos
cat("precision", round(precision * 100, 2), "%\n")
recall = tp / num_pos
cat("recall", round(recall * 100, 2), "%\n")
false_discovery_rate = 1 - precision
cat("false_discovery_rate", round(false_discovery_rate * 100, 2), "%\n")
false_omission_rate = fn / num_pred_neg
cat("false_omission_rate", round(false_omission_rate * 100, 2), "%\n")
```

We don't make many false discoveries but we make a lot of false omissions! It's a tradeoff...


# Receiver-Operator Curve Plot

The entire classifier is indexed by that indicator function probability threshold which creates the classification decision. Why not see look at the entire range of possible classification models. We do this with a function. We will go through it slowly and explain each piece:

```{r}
#' Computes performance metrics for a binary probabilistic classifer
#'
#' Each row of the result will represent one of the many models and its elements record the performance of that model so we can (1) pick a "best" model at the end and (2) overall understand the performance of the probability estimates a la the Brier scores, etc.
#'
#' @param p_hats  The probability estimates for n predictions
#' @param y_true  The true observed responses
#' @param res     The resolution to use for the grid of threshold values (defaults to 1e-3)
#'
#' @return        The matrix of all performance results
compute_metrics_prob_classifier = function(p_hats, y_true, res = 0.001){
  #we first make the grid of all prob thresholds
  p_thresholds = seq(0 + res, 1 - res, by = res) #values of 0 or 1 are trivial
  
  #now we create a matrix which will house all of our results
  performance_metrics = matrix(NA, nrow = length(p_thresholds), ncol = 12)
  colnames(performance_metrics) = c(
    "p_th",
    "TN",
    "FP",
    "FN",
    "TP",
    "miscl_err",
    "precision",
    "recall",
    "FDR",
    "FPR",
    "FOR",
    "miss_rate"
  )
  
  #now we iterate through each p_th and calculate all metrics about the classifier and save
  n = length(y_true)
  for (i in 1 : length(p_thresholds)){
    p_th = p_thresholds[i]
    y_hats = factor(ifelse(p_hats >= p_th, ">50K", "<=50K"))
    confusion_table = table(
      factor(y_true, levels = c("<=50K", ">50K")),
      factor(y_hats, levels = c("<=50K", ">50K"))
    )
      
    fp = confusion_table[1, 2]
    fn = confusion_table[2, 1]
    tp = confusion_table[2, 2]
    tn = confusion_table[1, 1]
    npp = sum(confusion_table[, 2])
    npn = sum(confusion_table[, 1])
    np = sum(confusion_table[2, ])
    nn = sum(confusion_table[1, ])
  
    performance_metrics[i, ] = c(
      p_th,
      tn,
      fp,
      fn,
      tp,
      (fp + fn) / n,
      tp / npp, #precision
      tp / np,  #recall
      fp / npp, #false discovery rate (FDR)
      fp / nn,  #false positive rate (FPR)
      fn / npn, #false omission rate (FOR)
      fn / np   #miss rate
    )
  }
  
  #finally return the matrix
  performance_metrics
}
```

Now let's generate performance results for the in-sample data:

```{r}
pacman::p_load(data.table, magrittr)
performance_metrics_in_sample = compute_metrics_prob_classifier(p_hats_train, y_train) %>% data.table
performance_metrics_in_sample
```

Now let's plot the ROC curve

```{r}
pacman::p_load(ggplot2)
ggplot(performance_metrics_in_sample) +
  geom_line(aes(x = FPR, y = recall)) +
  geom_abline(intercept = 0, slope = 1, col = "orange") + 
  coord_fixed() + xlim(0, 1) + ylim(0, 1)
```

Now calculate the area under the curve (AUC) which is used to evaluate the probabilistic classifier (just like the Brier score) using a trapezoid area function. 

```{r}
pacman::p_load(pracma)
-trapz(performance_metrics_in_sample$FPR, performance_metrics_in_sample$recall)
```

This is not bad at all!

Note that I should add $<0, 0>$ and $<1, 1>$ as points before this is done but I didn't...

How do we do out of sample?


```{r}
performance_metrics_oos = compute_metrics_prob_classifier(p_hats_test, y_test) %>% data.table
performance_metrics_oos
```

And graph the ROC:


```{r}
#first we do our own melting of two data frames together to make it long format
performance_metrics_in_and_oos = rbind(
    cbind(performance_metrics_in_sample, data.table(sample = "in")),
    cbind(performance_metrics_oos, data.table(sample = "out"))
)
ggplot(performance_metrics_in_and_oos) +
  geom_line(aes(x = FPR, y = recall, col = sample)) +
  geom_abline(intercept = 0, slope = 1, col = "orange") + 
  coord_fixed() + xlim(0, 1) + ylim(0, 1)
```


```{r}
-trapz(performance_metrics_oos$FPR, performance_metrics_oos$recall)
```

Not bad at all - only a tad worse! In the real world it's usually a lot worse. We are lucky we have n = 5,000 in both a train and test set.

# Detection Error Tradeoff curve

```{r}
table(y_test) / length(y_test)
table(y_train) / length(y_train)
ggplot(performance_metrics_in_and_oos) +
  geom_line(aes(x = FDR, y = miss_rate, col = sample)) +
  coord_fixed() + xlim(0, 1) + ylim(0, 1)
```

What is the interpretation of this plot?

#Using AUC to Compare Probabilistic Classification Models

What would the effect be of less information on the same traing set size? Imagine we didn't know the features: occupation, education, education_num, relationship, marital_status. How would we do relative to the above? Worse!

```{r}
pacman::p_load(data.table, tidyverse, magrittr)
if (!pacman::p_isinstalled(ucidata)){
  pacman::p_load_gh("coatless/ucidata")
} else {
  pacman::p_load(ucidata)
}

data(adult)
adult = na.omit(adult) #kill any observations with missingness

set.seed(1)
train_size = 5000
train_indices = sample(1 : nrow(adult), train_size)
adult_train = adult[train_indices, ]
y_train = adult_train$income
X_train = adult_train
X_train$income = NULL

test_size = 5000
test_indices = sample(setdiff(1 : nrow(adult), train_indices), test_size)
adult_test = adult[test_indices, ]
y_test = adult_test$income
X_test = adult_test
X_test$income = NULL

logistic_mod_full = glm(income ~ ., adult_train, family = "binomial")
p_hats_test = predict(logistic_mod_full, adult_test, type = "response")

performance_metrics_oos_full_mod = compute_metrics_prob_classifier(p_hats_test, y_test) %>% data.table

logistic_mod_red = glm(income ~ . - occupation - education - education_num - relationship - marital_status, adult_train, family = "binomial")
p_hats_test = predict(logistic_mod_red, adult_test, type = "response")

performance_metrics_oos_reduced_mod = compute_metrics_prob_classifier(p_hats_test, y_test) %>% data.table


ggplot(rbind(
  performance_metrics_oos_full_mod[, model := "full"],
  performance_metrics_oos_reduced_mod[, model := "reduced"]
)) +
  geom_point(aes(x = FPR, y = recall, shape = model, col = p_th), size = 1) +
  geom_abline(intercept = 0, slope = 1) + 
  coord_fixed() + xlim(0, 1) + ylim(0, 1) + 
  scale_colour_gradientn(colours = rainbow(5))
```

and we can see clearly that the AUC is worse. This means that the full model dominates the reduced model for every FPR / TPR pair.

```{r}
pacman::p_load(pracma)
-trapz(performance_metrics_oos_reduced_mod$FPR, performance_metrics_oos_reduced_mod$recall)
-trapz(performance_metrics_oos_full_mod$FPR, performance_metrics_oos_full_mod$recall)
```

As we lose information that is related to the true causal inputs, we lose predictive ability. Same story for this entire data science class since error due to ignorance increases! And certainly no different in probabilistic classifiers.

Here's the same story with the DET curve:

```{r}
ggplot(rbind(
  performance_metrics_oos_full_mod[, model := "full"],
  performance_metrics_oos_reduced_mod[, model := "reduced"]
)) +
  geom_point(aes(x = FDR, y = miss_rate, shape = model, col = p_th), size = 1) +
  coord_fixed() + xlim(0, 1) + ylim(0, 1) + 
  scale_colour_gradientn(colours = rainbow(5))
```


# Choosing a Decision Threshold Based on Asymmetric Costs and Rewards

The ROC and DET curves gave you a glimpse into all the possible classification models derived from a probability estimation model. Each point on that curve is a separate $g(x)$ with its own performance metrics. How do you pick one?

Let's create rewards and costs. Imagine we are trying to predict income because we want to sell people an expensive item e.g. a car. We want to advertise our cars via a nice packet in the mail. The packet costs \$5. If we send a packet to someone who really does make $>50K$/yr then we are expected to make \$1000. So we have rewards and costs below:

```{r}
r_tp = 0
c_fp = -5
c_fn = -1000
r_tn = 0
```

Let's return to the linear logistic model with all features. Let's calculate the overall oos average reward per observation (per person) for each possible $p_{th}$:

```{r}
n = nrow(adult_test)
performance_metrics_oos_full_mod$avg_cost = 
  (r_tp * performance_metrics_oos_full_mod$TP +
  c_fp * performance_metrics_oos_full_mod$FP +
  c_fn * performance_metrics_oos_full_mod$FN +
  r_tn * performance_metrics_oos_full_mod$TN) / n
```

Let's plot average reward (reward per person) by threshold:

```{r}
ggplot(performance_metrics_oos_full_mod) +
  geom_line(aes(x = p_th, y = avg_cost)) #+
  # xlim(0, 0.05) + ylim(-5,0)
```

Obviously, the best decision is $p_{th} \approx 0$ which means you classifiy almost everything as a positive. This makes sense because the mailing is so cheap. What are the performance characteristics of the optimal model?

```{r}
i_star = which.max(performance_metrics_oos_full_mod$avg_cost)
performance_metrics_oos_full_mod[i_star, ]

# performance_metrics_oos_full_mod[, .(p_th, avg_cost)]
```

The more interesting problem is where the cost of advertising is higher:

```{r}
r_tp = 0
c_fp = -200
c_fn = -1000
r_tn = 0
performance_metrics_oos_full_mod$avg_cost = 
  (r_tp * performance_metrics_oos_full_mod$TP +
  c_fp * performance_metrics_oos_full_mod$FP +
  c_fn * performance_metrics_oos_full_mod$FN +
  r_tn * performance_metrics_oos_full_mod$TN) / n
ggplot(performance_metrics_oos_full_mod) +
  geom_point(aes(x = p_th, y = avg_cost), lwd = 0.01)
```

What are the performance characteristics of the optimal model?

```{r}
i_star = which.max(performance_metrics_oos_full_mod$avg_cost)
performance_metrics_oos_full_mod[i_star, ]
```

If $g_{pr}$ is closer to $f_{pr}$, what happens? 

All the threshold-derived classification models get better and you are guaranteed to make more money since you have a better discriminating eye.


# Asymmetric Cost Models in Trees and RF

There is also a way to make asymmetric cost models with trees. Let's load up the adult dataset where the response is 1 if the person makes more than $50K per year and 0 if they make less than $50K per year.

```{r}
rm(list = ls())
options(java.parameters = "-Xmx4000m")
pacman::p_load(YARF)
pacman::p_load_gh("coatless/ucidata")
data(adult)
adult %<>% 
  na.omit #kill any observations with missingness
```

Let's use samples of 2,000 to run experiments:

```{r}
train_size = 2000
train_indices = sample(1 : nrow(adult), train_size)
adult_train = adult[train_indices, ]
y_train = adult_train$income
X_train = adult_train %>% select(-income)
test_indices = setdiff(1 : nrow(adult), train_indices)
adult_test = adult[test_indices, ]
y_test = adult_test$income
X_test = adult_test %>% select(-income)
```

What does the $y$'s look like?

```{r}
table(y_train)
```

Very imbalanced. This would off-the-bat make y=0 the default.

Now make a regular RF and look at the oob confusion table and FDR and FOR:

```{r}
num_trees = 500
yarf_mod = YARF(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE)
y_hat_test = predict(yarf_mod, X_test)
oos_confusion = table(y_test, y_hat_test)
oos_confusion
cat("FDR =", oos_confusion[1, 2] / sum(oos_confusion[, 2]), "\n")
cat("FOR =", oos_confusion[2, 1] / sum(oos_confusion[, 1]), "\n")
```

High FDR rate and low FOR rate. Let's try to change this and reduce the FDR by oversampling 0's.

```{r}
idx_0 = which(y_train == "<=50K")
n_0 = length(idx_0)
idx_1 = which(y_train == ">50K")
n_1 = length(idx_1)

bootstrap_indices = list()
for (m in 1 : num_trees){
  bootstrap_indices[[m]] = c( #note n_0' + n_1' doesn't equal n. You can make it so with one more line of code...
    sample(idx_0, round(2.0 * n_0), replace = TRUE),
    sample(idx_1, round(0.5 * n_1), replace = TRUE)
  )
}
yarf_mod_asymmetric = YARF(X_train, y_train, bootstrap_indices = bootstrap_indices, calculate_oob_error = FALSE)
y_hat_test = predict(yarf_mod_asymmetric, X_test)
oos_confusion = table(y_test, y_hat_test)
oos_confusion
cat("FDR =", oos_confusion[1, 2] / sum(oos_confusion[, 2]), "\n")
cat("FOR =", oos_confusion[2, 1] / sum(oos_confusion[, 1]), "\n")
```

You can even vary the sampling and trace out ROC / DET curves. See function `YARFROC`.