simulated_zinb_get_priors.Rmd

---
title: "ZeroInflatedPoisson"
author: "Anders Sundelin"
date: "2022-10-18"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(boot)
```

```{r}
CHAINS <- 3
CORES <- 2
THREADS <- 2
```


# Simulating Issue Introduction and Removal with Zero-inflated Negative Binomial and brms

The goal of this Rmd is to arrive at reasonably sane, scientifically grounded priors.

In order to do so, we will simulate data that conceptually "look like" the data we see "in the wild", but have a known source distribution (specified by us).
This is done just to assess the quality of the priors, and does not imply that the true data-generating process obeys the same formula.
We merely want to assess the strength of the priors, in order to not make them too strong, as that would prevent the posterior from dominating the priors.

## Inventing Data Points

We know that our data for both ADD, DUP and COMPLEX are very right-skewed. After experimenting with exponential distributions, we instead settled for calculating lognormal distribution metrics instead.

Shape of the Introduced issues:
```
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.2331   0.0000 150.0000 
sd: 2.281958
Prop.zeros: 0.9341762
```

Shape of Removed issues:
```
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.000   0.000   0.000   0.151   0.000 112.000 
sd 1.841085
Prop.zeros: 0.9563002
```
Shape of the added lines:
```
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    6.00   31.76   28.00 3772.00 
sd: 88.80997
Prop.zeros: 0.1543845
```

Shape of the complexity score:
```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    3.00   16.00   58.83   52.00 1244.00 
sd 128.2378
Prop.zeros: 0.1182959
```

Shape of the duplicated blocks (over 50% of the duplicated blocks are 0, 75% are 1 or 2:
```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   7.085   2.000 664.000 
sd 33.09433
Prop.zeros: 0.6055084
```

Shape of removed lines: (lower mean/median/quantiles than added, but in general the same shape):
```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    2.00   20.69   12.00 3413.00 
sd 78.87213
Prop.zeros: 0.257587
```


```{r}
# input the requested mean and sd, output is the mu of the rlnorm
rlnorm_mu <- function(mean, sd) {
  log(mean^2/sqrt(mean^2+sd^2))
}
rlnorm_sd <- function(mean, sd) {
  sqrt(log(1+sd^2/mean^2))
}

mu_added <- rlnorm_mu(31.76, 88.80997)
sd_added <- rlnorm_sd(31.76, 88.80997)
```

```{r}
mu_complex <- rlnorm_mu(58.83, 128.2378)
sd_complex <- rlnorm_sd(58.83, 128.2378)
# in particular for dup, we have many more zeros than what we expect. Squeeze down the mu, while keeping the sd, to get more zeros
mu_dup <- rlnorm_mu(log(20.69^2), 78.87213)
sd_dup <- rlnorm_sd(log(20.69^2), 78.87213)
```


```{r}
N <- 3500
set.seed(20230104)

ADD <- rlnorm(N, meanlog = mu_added, sdlog=sd_added)
COMPLEX <- rlnorm(N, meanlog = mu_complex, sdlog=sd_complex)
DUP <- rlnorm(N, meanlog = mu_dup, sdlog=sd_dup)
```

The main thing to look for is the maximum value (is it reasonable, within an order of magnitude), number of zeros/extremely low values, and the shape in between. We want to have a sharp right-skewed distribution, as that is how our own data looks like.

All three of our predictors have a similar shape, so we use the same formula to simulate them, though all use unique lognormal distributions.

```{r}
data.frame(add=DUP) |> ggplot(aes(x=add)) + geom_histogram(binwidth = 1) + ggtitle("Simulated DUP data")
```

The overall shape looks similar to our original dataset.
As we are looking for integer counts, it seems plausible we could truncate the logarithms to get to the integer part. This will allow us to reach the value `0`, which features a lot in our data.

```{r}
ADD <- trunc(ADD)
COMPLEX <- trunc(COMPLEX)
DUP <- trunc(DUP)
```


### Proportion of zeros in the predictor data

Proportion of zeros in the resulting data set:

```{r}
z <- data.frame(a=DUP) |> filter(a == 0) |> tally()
z/length(DUP)
```

So, our ADD only has 4.5% zeros in itself. Our model has 15%
Our COMPLEX is an even worse fit, it has 1.4%, versus the real 11.8%
Our original DUP had 16.2%, whereas the real has 60%. If we double the sd, we will at least get 31.1% zeros. Not enough, but still something...
Clearly, the DUP statistics does not follow a proper lognormal distribution, something else is influencing the large number of zeros.

All of our data underestimate the number of zeros in our simulated model. 

If we squeeze down the mean of DUP by taking the logarithm of the mean, we can get 80% zeros there, more than we have in the existing data.
Squaring the mean, and taking the logarithm, means we again got down to 62% zeros after truncationg. We settle for that solution for now.

```{r}
data.frame(add=trunc(ADD)) |> ggplot(aes(x=add)) + geom_histogram(binwidth = 1) + ggtitle("Simulated ADD data")

```
```{r}
summary(ADD)
```

```{r}
summary(COMPLEX)
```
```{r}
summary(DUP)
```

Which distribution we use is not that important, just that its values are reasonably true to the form of the collected data.

### Preparing predictor data

Due to the right-skewedness, our model uses the logarithm of the datapoints in the linear regression. Otherwise the very few extreme values will have extremely high leverage over the shape of the linear model. We are also heeding the advice of Gelman: https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/
This also means that the linear model will be based on the _magnitude_ of the change in parameter (e.g. 0, 1, 2.7, 7.3, 19.8 added lines/existing complexity). So each individual line will have marginally lower impact, but the scale of the parameter will still matter.

```{r}
logADD <- log(ADD+1)
logCOMPLEX <- log(COMPLEX+1)
logDUP <- log(DUP+1)
```

As we know that our true parameters have a lower bound of 0, we add 1 to the parameter before taking the logarithm.

Our linear regression parameters are less spread out, but the lower bound is still 0 (meaning zero added lines):

```{r}
data.frame(logADD=logADD) |> ggplot(aes(x=logADD)) + geom_histogram(bins = 50) + ggtitle("Simulated logADD data")
```
```{r}
data.frame(x=logCOMPLEX) |> ggplot(aes(x=x)) + geom_histogram(bins = 50) + ggtitle("Simulated logCOMPLEX data")
data.frame(x=logDUP) |> ggplot(aes(x=x)) + geom_histogram(bins = 50) + ggtitle("Simulated logDUP data")

```

So, now we have three parameters that should reasonably well resemble our own data, even if the underlying process is totally different.

We follow recommendations from McElreath and others, and scale our log values -- this fixes the standard deviation to 1, and centers around the mean of the logarithm (that is, the average magnitude of the change).

```{r}
A <- scale(logADD)
C <- scale(logCOMPLEX)
D <- scale(logDUP)

summary(A)
```

```{r}
data.frame(A=A) |> ggplot(aes(x=A)) + geom_histogram(bins = 50) + ggtitle("Scaled simulated logADD data")
```

This means that the intercept parameter (0) is to be interpreted as the `mean magnitude` of the parameter.
Negative values represent values with a lower magnitude, and higher represent a higher magnitude.

## Inventing a Linear Model based on our generated data

Next, we come up with some coefficients that we want our priors and our model to retrieve.

Thinking about the true data generating process, we state the following:

* Because of scaling, our intercept will reflect the mean magnitude of the parameters.
* It is likely that there are zero issues introduced if there are very few lines added or removed. So the zero inflation part of the model should reflect the number of added lines - when the number of added lines are low, there should be a comparatively high chance of introducing zero issues. Conversely, when more lines are added, the chance of hitting a zero decreases correspondingly (on the logit scale, i.e. "-4 is never, 4 is always, and 0 means 50%")
* Furthermore, it seems likely that the direction of the relationship between predictors and logLambda would be positive, or at least non-negative, overall. We do not expect the number of introduced issues to decrease as either of the predictors increase (but the effect could still be zero, i.e. no relationship).


For our analysis, we use the following formula:

* Intercept varies per team and repo. In repo R1, Team A is less well behaved than Team B. However, in Repo R2, both teams behave "bad" (unrealistic so, just to show the impact of the intercept).
* We do not vary the slope per team or repo at this time, all slopes are fixed (i.e. they are population-level effects).

```{r}
# model that both teams have the same intercept on the second repo, just to see the contrast with the first repo.
inputData <- data.frame(A=A, C=C, D=D,
                        team=rep(c("A", "B"), N/2),
                        repo=rep(c(rep("R1", N/2),
                                   rep("R2", N/2))),
                        I=rep(c(rep(c(-0.5, -2), N/4),
                                rep(c(-1, -1.5), N/4))))

logLambda <- inputData |> mutate(team = as.factor(team), 
                                 repo = as.factor(repo),
                                 lambda = I +  1.2 * A + 0.8 * C + 0.6 * D)
logLambda |> ggplot(aes(x=lambda)) + geom_histogram(bins = 50)
```

Check that the generated data produces reasonably well behaved outcomes

```{r}
logLambda |> ggplot(aes(x=exp(lambda))) + geom_histogram(bins = 400)
```

Construct the full negative binomial model. To help understand the impact of the shape parameter: https://influentialpoints.com/Training/negative_binomial_distribution-principles-properties-assumptions.htm

After shape about 8, the distribution becomes essentially symmetrical. We settle for a fixed shape, 25 (based on trial and error, and viewing the resulting distributions).

```{r}
mu <- exp(logLambda$lambda)
size <- 25
data.frame(count=rnbinom(350, mu=mu, size=size)) |> ggplot(aes(x=count)) + geom_histogram(bins=400)
```

Now construct the zero-inflation part. We start off with introducing 80% zeros, for the mean added lines (A == 0), and then decreasing that the more lines that were added (hence, when the number of added lines are less than the average magnitude, the chance of hitting a zero is greater than 80%).

```{r}
untouched_logit <- logit(0.80) - 0.25 * A
data.frame(p=inv.logit(untouched_logit)) |> ggplot(aes(x=p)) + geom_histogram(bins = 100)
p <- inv.logit(untouched_logit)
```

```{r}
prob_untouched <- p
untouched <- rbinom(N, 1, prob_untouched)
introd <- (1-untouched) * rnbinom(N, mu=mu, size=size)
```

This gives the following distribution of data.

```{r}
data.frame(introd=introd) |> ggplot(aes(x=introd)) + geom_histogram(bins=400) 
```

Zooming in:

```{r}
data.frame(introd=introd) |> ggplot(aes(x=introd)) + geom_histogram(bins=400) + scale_y_continuous(limits=c(0,75))
```


```{r}
summary(introd)
sd(introd)
```

```{r}
z <- data.frame(introd=introd) |> filter(introd == 0) |> tally()
z$n/N
```

### Scientific model

Our log-then-center approach of the predictors turns into the following model:

$log(\lambda) = \beta_{0,i} + \beta_A A_i + \beta_C C_i + \beta_D D_i$

where 

$A_i = \frac{ln(added_i+1) - \hat{\mu_{\alpha}}}{\hat{\sigma_{\alpha}}}$

and 
$\hat{\mu_{\alpha}} = mean(ln(added+1))$
$\hat{\sigma_{\alpha}} = stddev(ln(added+1))$

$C_i = \frac{ln(complexity_i+1) - \hat{\mu_{\gamma}}}{\hat{\sigma_{\gamma}}}$

and 
$\hat{\mu_{\gamma}} = mean(ln(complexity+1))$
$\hat{\sigma_{\gamma}} = stddev(ln(complexity+1))$

and

$D_i = \frac{ln(dupblocks_i+1) - \hat{\mu_{\delta}}}{\hat{\sigma_{\delta}}}$

and 
$\hat{\mu_{\delta}} = mean(ln(dupblocks+1))$
$\hat{\sigma_{\delta}} = stddev(ln(dupblocks+1))$

This corresponds to the following multiplicative model for $\lambda$ (also called $\mu$ in the negative binomial context):

$\lambda_i = e^{\beta_{0,i}} (\frac{added_i+1}{e^\hat{\mu_\alpha}})^{\frac{\beta_A}{\hat{\sigma_\alpha}}} (\frac{complexity_i+1}{e^\hat{\mu_\gamma}})^{\frac{\beta_C}{\hat{\sigma_\gamma}}} (\frac{duplicates_i+1}{e^\hat{\mu_\delta}})^{\frac{\beta_D}{\hat{\sigma_\delta}}}$

Here, $added$, $complexity$ and $duplicates$ are on their natural scale, starting at 0 and counting upwards.

Now, because we chose to let the intercept vary per repo and team, the exponent will change accordingly (hence the $i$ in $\beta_{0,i}$). 
If we were to vary the other betas per team and/or repo, they would also be individually changed like this. But in the current model, they are fixed.
The other parameters, $\hat{\mu_\alpha}$ and $\hat{\sigma_\alpha}$, et al. are just normalizing constants.

## Recovering the model parameters

```{r}
library(brms)
library(tidybayes)
library(bayesplot)
```

Argumenting for the priors:

* For the betas on logADD, logDUP and logCOMPLEX, we take a cautious approach, and state a prior or Normal(0, 0.25). On the log-scale, this means that on average, we expect "no change" for a beta (in the natural scale, an exponent of 0 is always 1), but allow 95% of the prior probability to vary between -0.5 and 0.5 (+/- 2 std.dev from mean).
* We do expect the intercept for the zero-inflation to be positive. At 0, the logit-link of the zi-part determines that there is 50% chance of seeing a zero. And we have many more zeros in our data than 50% (about 80%, judging from the histogram above). We set a prior of Normal(1, 0.25), which corresponds to having 95% probability between $inv_logit(0.5) = 0.62$ and $inv_logit(1.5) = 0.81$.
* The shape parameter is really tricky. With some experimentation, we decide to move away from the default $gamma(0.01,0.01)$ to $gamma(1, 0.25)$. The analysis for that is performed below.


### Discussing shape prior selection

With a the default shape prior of gamma(0.01,0.01), and 3500 samples, we immediately retrieve an estimate close to the real shape, albeit with a wide 95% CI interval, [13,83].

With shape prior of $Cauchy(0,25)$, and 350 samples, we got 7 divergent transitions.
Changing instead to $Gamma(3, 0.1)$, still 350 samples, sampled OK, but had 2 high Pareto k values that was solved by reloo. But the posterior looked almost like the prior, indicating that the prior might have been too strong.
Changing instead to $Gamma(1, 0.1)$, still 350 samples, sampled OK, but had 1 high Pareto k alue, 2 low neff_ratio values (lower than 0.1 threshold), and the model failed to find the correct shape value.
Default shape prior $Gamma(0.01, 0.01)$, still 350 samples, sampled OK, but had 1 high Pareto k value, solved by reloo. But it fails dismally to find the correct shape (25) - ending somewhere at 0.17, and corresponding bad predictions. Using 3500 samples helps the weak prior arrive at the reasonable shape.

Conclusion - if we suspect large shape values, we could always swap the default prior for the $Gamma(3, 0.1)$ one, and pay attention to if and how the posterior has changed. It might help dealing with small sample sizes in the groups.

```{r}
d <- data.frame(y=introd, logADD=A, logDUP=D, logCOMPLEX=C, team=logLambda$team, repo=logLambda$repo)
get_prior(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX + (1 | team/repo),
                                   zi ~ 1 + logADD))
```

On my laptop, this takes about 12 minutes to sample, so pur yourself a nice beverage and enjoy life...

```{r}
M_recover_params <- brm(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX + (1 | team/repo),
                                   zi ~ 1 + logADD),
                        prior = c(prior(normal(0, 0.5), class = Intercept),
                                  prior(normal(0, 0.25), class = b),
                                  prior(weibull(2, 1), class = sd),
                                  prior(weibull(2, 1), class = sd, group=team:repo),
                                  prior(weibull(2, 1), class = sd, group=team),
                                  prior(normal(0, 0.5), class = Intercept, dpar=zi),
                                  prior(normal(0.5, 0.5), class = b, dpar=zi),
                                  prior(gamma(1, 0.1), class = shape)),
                        warmup=1000,
                        iter=4000,
                        chains=CHAINS,
                        cores=CORES,
                        threads=threading(THREADS),
                        backend="cmdstanr",
                        save_pars = save_pars(all=T),
                        adapt_delta=0.95)
```

```{r}
m <- M_recover_params 
stopifnot(rhat(m) < 1.01)
stopifnot(neff_ratio(m) > 0.2)
mcmc_trace(m) # ok
loo <- loo(m) 
plot(loo)

```

```{r, eval=F}
loo <- loo(m, moment_match = T, reloo=T)
plot(loo)
```
```{r}
m <- add_criterion(m, "loo")
loo
```

How does Rhat and neff_ratio compare?

```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff() # + scale_x_continuous(limits=c(0, 0.5))
```
All $\hat{R}$ values are ok (close to 1), and also the n_eff_ratio are much greater than 0.1, which is commonly used as a cut-off threshold.

We should be able to trust our model.

How does it compare?

```{r}
summary(m)
```

The zero-inflation intercept are estimated to be $inv\_logit(1.36)$, which corresponds to 0.796

```{r}
inv.logit(1.36)
```

This is very close to our original parameter, 0.80.

Credible interval would be between 1.23 and 1.49, and the slope is estimated to -0.21, with a reliably negative 95% Credible Interval on the logit scale (-0.33 to -0.10).

```{r}
inv.logit(c(1.23, 1.49))
```

The population-level intercept is very close to 0, with a symmetric CI of [-0.84, 0.82]. This in itself is not surprising, as we chose -0.5, -2 and 1, 1 in our perfectly balanced sample.

The coefficient for logADD are specified as between 1.12 and 1.23, with a point estimate of 1.18. The real value is 1.20, close to the estimate.

The coefficient for logDUP are specified as between 0.53 and 0.60, with a point estimate of 0.56, slightly lower than the real 0.60 - but, given the large number of zeros in DUP, still very good.
Likewise, the estimate for logCOMPLEX are specified as between 0.71 and 0.8, with a point estimate of 0.75, slightly lower than the real 0.80.


### Prior vs posterior

Be aware that R `rgamma` uses shape and rate, or shape and scale. Rate parameter is the inverse of the scale - so `rate=0.1` is scale `10`. https://bookdown.org/content/4857/monsters-and-mixtures.html#over-dispersed-counts
Stan uses shape $\alpha$ and rate (inverse scale) $\beta$. https://mc-stan.org/docs/2_21/functions-reference/gamma-distribution.html


```{r}
post <- as_draws_df(m)

gamma_prior <- data.frame(x=rgamma(N, shape=1, rate=0.1))
cauchy_prior <- data.frame(x=rcauchy(N, location=0, scale=25))

post %>%
      ggplot() +
      geom_density(data = gamma_prior, aes(x=x),
                   color = "transparent", fill = "blue", alpha = .5) +
      geom_density(aes(x = shape),
                   color = "transparent", fill = "#A65141", alpha = .5) +
      annotate(geom = "text", 
               x = c(20, 5), y = c(0.15, 0.15), 
               label = c("posterior", "prior"), 
               color = c("#A65141", "blue")) +
      scale_x_continuous(limits = c(0,30)) +
      scale_y_continuous(limits = c(0,1)) +
      ggtitle("NegBinom Shape parameter, prior and posterior, limited to [0,1]", 
              subtitle = "Prior distribution is wider still") +
      geom_vline(xintercept=25) +
      labs(x = "gamma shape distribution")

```


### Conditional effects
```{r}
condeffect_logCOMPLEX_by_logADD <- function(d, aTeam, aRepo) {
  nd <- d |> mutate(logADD=0, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
  nd1 <- d |> mutate(logADD=1, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
  nd2 <- d |> mutate(logADD=2, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
  nd3 <- d |> mutate(logADD=3, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
  f <- fitted(m, newdata=nd, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd)
  f1 <- fitted(m, newdata=nd1, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd1)
  f2 <- fitted(m, newdata=nd2, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd2)
  f3 <- fitted(m, newdata=nd3, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd3)
  f$logADD <- "0"
  f1$logADD <- "1"
  f2$logADD <- "2"
  f3$logADD <- "3"
  ftot <- rbind(f, f1, f2, f3)
  return(ftot)
}

plot_logCOMPLEX_by_logADD <- function(ftot) {
  return(ftot |> ggplot(aes(x=logCOMPLEX)) + 
    geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5, group=logADD, color=logADD), stat="identity", alpha=.25, size=.5) +
    geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k, color=team, shape=repo), alpha=0.2)
  )
}

ftot <- condeffect_logCOMPLEX_by_logADD(d, "A", "R1")
plot_logCOMPLEX_by_logADD(ftot)
```
Rescaling to better show the relationship

```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) + 
 ggtitle("Conditional effects of complexity, for team A in repo R1 with four levels of ADD",
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

```{r}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "B", "R1")
plot_logCOMPLEX_by_logADD(ftot)
```

```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) + 
  ggtitle("Conditional effects of complexity, for team B in repo R1 with four levels of ADD",
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

```{r}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "A", "R2")
plot_logCOMPLEX_by_logADD(ftot)
```

```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) + 
  ggtitle("Conditional effects of complexity, for team A in repo R2 with four levels of ADD",
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")

```

Clearly, our beta for R2 is unrealistically large. We would not encounter such betas in real situations.

```{r}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "B", "R2")
plot_logCOMPLEX_by_logADD(ftot)
```

```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) + 
  ggtitle("Conditional effects of complexity, for team B in repo R2 with four levels of ADD",
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")

```


```{r}
condeffect_logADD_by_logCOMPLEX <- function(d, aTeam, aRepo) {
  ndmin1 <- d |> mutate(logCOMPLEX=-1, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
  nd <- d |> mutate(logCOMPLEX=0, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
  nd1 <- d |> mutate(logCOMPLEX=1, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
  nd2 <- d |> mutate(logCOMPLEX=2, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
  nd3 <- d |> mutate(logCOMPLEX=3, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
  fmin1 <- fitted(m, newdata=ndmin1, probs=c(.055, .945)) |> data.frame() |> bind_cols(ndmin1)
  f <- fitted(m, newdata=nd, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd)
  f1 <- fitted(m, newdata=nd1, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd1)
  f2 <- fitted(m, newdata=nd2, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd2)
  f3 <- fitted(m, newdata=nd3, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd3)
  fmin1$logCOMPLEX <- "-1"
  f$logCOMPLEX <- "0"
  f1$logCOMPLEX <- "1"
  f2$logCOMPLEX <- "2"
  f3$logCOMPLEX <- "3"
  ftot <- rbind(fmin1, f, f1, f2, f3)
  return(ftot)
}

plot_logADD_by_logCOMPLEX <- function(ftot) {
  return(ftot |> ggplot(aes(x=logADD)) + 
    geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5, group=logCOMPLEX, color=logCOMPLEX), stat="identity", alpha=.25, size=.5) + 
    geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k, color=team, shape=repo), alpha=0.2)
  )
}
```


```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "A", "R1")
plot_logADD_by_logCOMPLEX(ftot)
```

```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) + 
  ggtitle("Conditional effects of added lines, for team A in repo R1, and five levels of complexity", 
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "B", "R1")
plot_logADD_by_logCOMPLEX(ftot)
```

```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) + 
  ggtitle("Conditional effects of added lines, for team B in repo R1, and five levels of complexity", 
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "A", "R2")
plot_logADD_by_logCOMPLEX(ftot)
```

```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) + 
  ggtitle("Conditional effects of added lines, for team A in repo R2, and five levels of complexity", 
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "B", "R2")
plot_logADD_by_logCOMPLEX(ftot)
```

```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) + 
  ggtitle("Conditional effects of added lines, for team B in repo R2, and five levels of complexity", 
          subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

```{r}
ftot |> ggplot(aes(x=logADD)) + geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5, group=logCOMPLEX, color=logCOMPLEX), stat="identity", alpha=.25, size=.5) + 
  geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k, color=team, shape=repo), alpha=0.2) + scale_y_continuous(limit=c(0,25)) + 
  ggtitle("Conditional effects of added lines, for team A and five levels of complexity", subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```

#### Conclusion

So, if team B, in repo R1 makes an addition that is two standard deviations magnitude big (relative to the average magnitude), in a file that is three orders of magnitude complex, given that the average number of duplicates exist in the file, we would expect five duplicates to be added. If the file is average in complexity, we would only expect about 0-1 duplicates to be added.

If team A makes a similar change, we would expect about 17 duplicates to be added to the file, and about 2-3 duplicates if the file is average in complexity.


#### Model parameters

```{r}
mcmc_areas(m, regex_pars = c("^b_", "^b_zi"))
```

```{r}
mcmc_areas(m, regex_pars = c("^[^b].*ntercept"))
```

These intercepts distributions are really wide. And the general intercept parameter is very close to 0 (not strange, as we had -0.5, -2 and two times 1, and we had a perfectly balanced sample, i.e. equally many data points for each combination of factors.)

But why is the residuals (I interpret r_ parameters as differences from the population mean) so wide? Otherwise, the expected value seems quite close to the real value.

I suspect something to do with the prior settings (but I had 3500 samples for this simulation, so I would have expected the priors to have been overcome by now...)

```{r}
ranef(m)
```


Old way of doing conditional effects plots:

```{r}
nd <- d |> mutate(logCOMPLEX=0, logADD=0, logDUP=seq(from=-2, to=5, length.out=N))

f <- fitted(m, newdata=nd, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd)
f |> ggplot(aes(x=logDUP)) + geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5), stat="identity", alpha=.25, size=.5) + 
  geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k), alpha=0.8)

```
```{r}
f |> ggplot(aes(x=logDUP)) + geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5), stat="identity", alpha=.25, size=.5) + 
  geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k), alpha=0.8) + scale_y_continuous(limit=c(0,25))

```


## Posterior predictive checks

We can visualize how our model compares related to the simulated data by doing posterior predictive checks.

```{r}
yrep_zinb <- posterior_predict(m)
```

The proportion of zeros fall straight in line with the observed ratio (black line), about 0.875. These are zeros occurring regardless of the zero-inflation or the negative binomial. Our model thinks that about 86%-89% of file changes would have zero issues introduced. The real (simulated) proportion is about 87.5%.

```{r}
ppc_stat(y = d$y, yrep_zinb, stat = function(y) mean(y == 0))
```

The maximum value simulated ranges between 100 to about 2000. The larger values are unrealistic, and in particular the peak around 450 seems to warrant further scrutiny - the most likely cause for this peak would be the large scale value (25 in this simulation).

```{r}
ppc_stat(y = d$y, yrep_zinb, stat = "max") # + scale_x_continuous(limits=c(0,2000))
```

```{r}
max(yrep_zinb)
```

For ZINB distributions, suspended rootograms are recommended by Kleiber and Zeileis (2016).
These rootograms show the difference between observed and expected counts, with bars hanging from the zero-line rather than the expected count line. Therefore we can think of this rootogram as showing information about the model residuals. The scale of the y axis is the square root of the count, in order to elevate smaller differences.

```{r}
pp_check(m, type = "rootogram", style="suspended")
```


```{r}
pp_check(m, type = "rootogram", style="suspended") + scale_x_continuous(limits = c(0,100)) + ggtitle("Suspended rootogram beween 0 and 100.")

```

It appears that the priors did their job, and allowed the data to shine through.


### Conditional Effects

The conditional effect show the effect of varying one parameter, keeping all the others in the model constant (usually at their mean value). We see the comparatively larger impact of logADD and logCOMPLEX, relative to logADD (as we specified in our model, by giving them larger coefficients).
However, above I defined custom functions, which included both several curves (e.g. logADD curves in the logCOMPLEX plot), and the actual data points.

Which of these visualizations are preferred?

```{r}
plot(conditional_effects(m), ask = FALSE)
```