-
Notifications
You must be signed in to change notification settings - Fork 1
/
simulated_zinb_get_priors.Rmd
733 lines (541 loc) · 29.9 KB
/
simulated_zinb_get_priors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
---
title: "ZeroInflatedPoisson"
author: "Anders Sundelin"
date: "2022-10-18"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(boot)
```
```{r}
CHAINS <- 3
CORES <- 2
THREADS <- 2
```
# Simulating Issue Introduction and Removal with Zero-inflated Negative Binomial and brms
The goal of this Rmd is to arrive at reasonably sane, scientifically grounded priors.
In order to do so, we will simulate data that conceptually "look like" the data we see "in the wild", but have a known source distribution (specified by us).
This is done just to assess the quality of the priors, and does not imply that the true data-generating process obeys the same formula.
We merely want to assess the strength of the priors, in order to not make them too strong, as that would prevent the posterior from dominating the priors.
## Inventing Data Points
We know that our data for both ADD, DUP and COMPLEX are very right-skewed. After experimenting with exponential distributions, we instead settled for calculating lognormal distribution metrics instead.
Shape of the Introduced issues:
```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.2331 0.0000 150.0000
sd: 2.281958
Prop.zeros: 0.9341762
```
Shape of Removed issues:
```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 0.151 0.000 112.000
sd 1.841085
Prop.zeros: 0.9563002
```
Shape of the added lines:
```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 6.00 31.76 28.00 3772.00
sd: 88.80997
Prop.zeros: 0.1543845
```
Shape of the complexity score:
```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 3.00 16.00 58.83 52.00 1244.00
sd 128.2378
Prop.zeros: 0.1182959
```
Shape of the duplicated blocks (over 50% of the duplicated blocks are 0, 75% are 1 or 2:
```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 7.085 2.000 664.000
sd 33.09433
Prop.zeros: 0.6055084
```
Shape of removed lines: (lower mean/median/quantiles than added, but in general the same shape):
```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 2.00 20.69 12.00 3413.00
sd 78.87213
Prop.zeros: 0.257587
```
```{r}
# input the requested mean and sd, output is the mu of the rlnorm
rlnorm_mu <- function(mean, sd) {
log(mean^2/sqrt(mean^2+sd^2))
}
rlnorm_sd <- function(mean, sd) {
sqrt(log(1+sd^2/mean^2))
}
mu_added <- rlnorm_mu(31.76, 88.80997)
sd_added <- rlnorm_sd(31.76, 88.80997)
```
```{r}
mu_complex <- rlnorm_mu(58.83, 128.2378)
sd_complex <- rlnorm_sd(58.83, 128.2378)
# in particular for dup, we have many more zeros than what we expect. Squeeze down the mu, while keeping the sd, to get more zeros
mu_dup <- rlnorm_mu(log(20.69^2), 78.87213)
sd_dup <- rlnorm_sd(log(20.69^2), 78.87213)
```
```{r}
N <- 3500
set.seed(20230104)
ADD <- rlnorm(N, meanlog = mu_added, sdlog=sd_added)
COMPLEX <- rlnorm(N, meanlog = mu_complex, sdlog=sd_complex)
DUP <- rlnorm(N, meanlog = mu_dup, sdlog=sd_dup)
```
The main thing to look for is the maximum value (is it reasonable, within an order of magnitude), number of zeros/extremely low values, and the shape in between. We want to have a sharp right-skewed distribution, as that is how our own data looks like.
All three of our predictors have a similar shape, so we use the same formula to simulate them, though all use unique lognormal distributions.
```{r}
data.frame(add=DUP) |> ggplot(aes(x=add)) + geom_histogram(binwidth = 1) + ggtitle("Simulated DUP data")
```
The overall shape looks similar to our original dataset.
As we are looking for integer counts, it seems plausible we could truncate the logarithms to get to the integer part. This will allow us to reach the value `0`, which features a lot in our data.
```{r}
ADD <- trunc(ADD)
COMPLEX <- trunc(COMPLEX)
DUP <- trunc(DUP)
```
### Proportion of zeros in the predictor data
Proportion of zeros in the resulting data set:
```{r}
z <- data.frame(a=DUP) |> filter(a == 0) |> tally()
z/length(DUP)
```
So, our ADD only has 4.5% zeros in itself. Our model has 15%
Our COMPLEX is an even worse fit, it has 1.4%, versus the real 11.8%
Our original DUP had 16.2%, whereas the real has 60%. If we double the sd, we will at least get 31.1% zeros. Not enough, but still something...
Clearly, the DUP statistics does not follow a proper lognormal distribution, something else is influencing the large number of zeros.
All of our data underestimate the number of zeros in our simulated model.
If we squeeze down the mean of DUP by taking the logarithm of the mean, we can get 80% zeros there, more than we have in the existing data.
Squaring the mean, and taking the logarithm, means we again got down to 62% zeros after truncationg. We settle for that solution for now.
```{r}
data.frame(add=trunc(ADD)) |> ggplot(aes(x=add)) + geom_histogram(binwidth = 1) + ggtitle("Simulated ADD data")
```
```{r}
summary(ADD)
```
```{r}
summary(COMPLEX)
```
```{r}
summary(DUP)
```
Which distribution we use is not that important, just that its values are reasonably true to the form of the collected data.
### Preparing predictor data
Due to the right-skewedness, our model uses the logarithm of the datapoints in the linear regression. Otherwise the very few extreme values will have extremely high leverage over the shape of the linear model. We are also heeding the advice of Gelman: https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/
This also means that the linear model will be based on the _magnitude_ of the change in parameter (e.g. 0, 1, 2.7, 7.3, 19.8 added lines/existing complexity). So each individual line will have marginally lower impact, but the scale of the parameter will still matter.
```{r}
logADD <- log(ADD+1)
logCOMPLEX <- log(COMPLEX+1)
logDUP <- log(DUP+1)
```
As we know that our true parameters have a lower bound of 0, we add 1 to the parameter before taking the logarithm.
Our linear regression parameters are less spread out, but the lower bound is still 0 (meaning zero added lines):
```{r}
data.frame(logADD=logADD) |> ggplot(aes(x=logADD)) + geom_histogram(bins = 50) + ggtitle("Simulated logADD data")
```
```{r}
data.frame(x=logCOMPLEX) |> ggplot(aes(x=x)) + geom_histogram(bins = 50) + ggtitle("Simulated logCOMPLEX data")
data.frame(x=logDUP) |> ggplot(aes(x=x)) + geom_histogram(bins = 50) + ggtitle("Simulated logDUP data")
```
So, now we have three parameters that should reasonably well resemble our own data, even if the underlying process is totally different.
We follow recommendations from McElreath and others, and scale our log values -- this fixes the standard deviation to 1, and centers around the mean of the logarithm (that is, the average magnitude of the change).
```{r}
A <- scale(logADD)
C <- scale(logCOMPLEX)
D <- scale(logDUP)
summary(A)
```
```{r}
data.frame(A=A) |> ggplot(aes(x=A)) + geom_histogram(bins = 50) + ggtitle("Scaled simulated logADD data")
```
This means that the intercept parameter (0) is to be interpreted as the `mean magnitude` of the parameter.
Negative values represent values with a lower magnitude, and higher represent a higher magnitude.
## Inventing a Linear Model based on our generated data
Next, we come up with some coefficients that we want our priors and our model to retrieve.
Thinking about the true data generating process, we state the following:
* Because of scaling, our intercept will reflect the mean magnitude of the parameters.
* It is likely that there are zero issues introduced if there are very few lines added or removed. So the zero inflation part of the model should reflect the number of added lines - when the number of added lines are low, there should be a comparatively high chance of introducing zero issues. Conversely, when more lines are added, the chance of hitting a zero decreases correspondingly (on the logit scale, i.e. "-4 is never, 4 is always, and 0 means 50%")
* Furthermore, it seems likely that the direction of the relationship between predictors and logLambda would be positive, or at least non-negative, overall. We do not expect the number of introduced issues to decrease as either of the predictors increase (but the effect could still be zero, i.e. no relationship).
For our analysis, we use the following formula:
* Intercept varies per team and repo. In repo R1, Team A is less well behaved than Team B. However, in Repo R2, both teams behave "bad" (unrealistic so, just to show the impact of the intercept).
* We do not vary the slope per team or repo at this time, all slopes are fixed (i.e. they are population-level effects).
```{r}
# model that both teams have the same intercept on the second repo, just to see the contrast with the first repo.
inputData <- data.frame(A=A, C=C, D=D,
team=rep(c("A", "B"), N/2),
repo=rep(c(rep("R1", N/2),
rep("R2", N/2))),
I=rep(c(rep(c(-0.5, -2), N/4),
rep(c(-1, -1.5), N/4))))
logLambda <- inputData |> mutate(team = as.factor(team),
repo = as.factor(repo),
lambda = I + 1.2 * A + 0.8 * C + 0.6 * D)
logLambda |> ggplot(aes(x=lambda)) + geom_histogram(bins = 50)
```
Check that the generated data produces reasonably well behaved outcomes
```{r}
logLambda |> ggplot(aes(x=exp(lambda))) + geom_histogram(bins = 400)
```
Construct the full negative binomial model. To help understand the impact of the shape parameter: https://influentialpoints.com/Training/negative_binomial_distribution-principles-properties-assumptions.htm
After shape about 8, the distribution becomes essentially symmetrical. We settle for a fixed shape, 25 (based on trial and error, and viewing the resulting distributions).
```{r}
mu <- exp(logLambda$lambda)
size <- 25
data.frame(count=rnbinom(350, mu=mu, size=size)) |> ggplot(aes(x=count)) + geom_histogram(bins=400)
```
Now construct the zero-inflation part. We start off with introducing 80% zeros, for the mean added lines (A == 0), and then decreasing that the more lines that were added (hence, when the number of added lines are less than the average magnitude, the chance of hitting a zero is greater than 80%).
```{r}
untouched_logit <- logit(0.80) - 0.25 * A
data.frame(p=inv.logit(untouched_logit)) |> ggplot(aes(x=p)) + geom_histogram(bins = 100)
p <- inv.logit(untouched_logit)
```
```{r}
prob_untouched <- p
untouched <- rbinom(N, 1, prob_untouched)
introd <- (1-untouched) * rnbinom(N, mu=mu, size=size)
```
This gives the following distribution of data.
```{r}
data.frame(introd=introd) |> ggplot(aes(x=introd)) + geom_histogram(bins=400)
```
Zooming in:
```{r}
data.frame(introd=introd) |> ggplot(aes(x=introd)) + geom_histogram(bins=400) + scale_y_continuous(limits=c(0,75))
```
```{r}
summary(introd)
sd(introd)
```
```{r}
z <- data.frame(introd=introd) |> filter(introd == 0) |> tally()
z$n/N
```
### Scientific model
Our log-then-center approach of the predictors turns into the following model:
$log(\lambda) = \beta_{0,i} + \beta_A A_i + \beta_C C_i + \beta_D D_i$
where
$A_i = \frac{ln(added_i+1) - \hat{\mu_{\alpha}}}{\hat{\sigma_{\alpha}}}$
and
$\hat{\mu_{\alpha}} = mean(ln(added+1))$
$\hat{\sigma_{\alpha}} = stddev(ln(added+1))$
$C_i = \frac{ln(complexity_i+1) - \hat{\mu_{\gamma}}}{\hat{\sigma_{\gamma}}}$
and
$\hat{\mu_{\gamma}} = mean(ln(complexity+1))$
$\hat{\sigma_{\gamma}} = stddev(ln(complexity+1))$
and
$D_i = \frac{ln(dupblocks_i+1) - \hat{\mu_{\delta}}}{\hat{\sigma_{\delta}}}$
and
$\hat{\mu_{\delta}} = mean(ln(dupblocks+1))$
$\hat{\sigma_{\delta}} = stddev(ln(dupblocks+1))$
This corresponds to the following multiplicative model for $\lambda$ (also called $\mu$ in the negative binomial context):
$\lambda_i = e^{\beta_{0,i}} (\frac{added_i+1}{e^\hat{\mu_\alpha}})^{\frac{\beta_A}{\hat{\sigma_\alpha}}} (\frac{complexity_i+1}{e^\hat{\mu_\gamma}})^{\frac{\beta_C}{\hat{\sigma_\gamma}}} (\frac{duplicates_i+1}{e^\hat{\mu_\delta}})^{\frac{\beta_D}{\hat{\sigma_\delta}}}$
Here, $added$, $complexity$ and $duplicates$ are on their natural scale, starting at 0 and counting upwards.
Now, because we chose to let the intercept vary per repo and team, the exponent will change accordingly (hence the $i$ in $\beta_{0,i}$).
If we were to vary the other betas per team and/or repo, they would also be individually changed like this. But in the current model, they are fixed.
The other parameters, $\hat{\mu_\alpha}$ and $\hat{\sigma_\alpha}$, et al. are just normalizing constants.
## Recovering the model parameters
```{r}
library(brms)
library(tidybayes)
library(bayesplot)
```
Argumenting for the priors:
* For the betas on logADD, logDUP and logCOMPLEX, we take a cautious approach, and state a prior or Normal(0, 0.25). On the log-scale, this means that on average, we expect "no change" for a beta (in the natural scale, an exponent of 0 is always 1), but allow 95% of the prior probability to vary between -0.5 and 0.5 (+/- 2 std.dev from mean).
* We do expect the intercept for the zero-inflation to be positive. At 0, the logit-link of the zi-part determines that there is 50% chance of seeing a zero. And we have many more zeros in our data than 50% (about 80%, judging from the histogram above). We set a prior of Normal(1, 0.25), which corresponds to having 95% probability between $inv_logit(0.5) = 0.62$ and $inv_logit(1.5) = 0.81$.
* The shape parameter is really tricky. With some experimentation, we decide to move away from the default $gamma(0.01,0.01)$ to $gamma(1, 0.25)$. The analysis for that is performed below.
### Discussing shape prior selection
With a the default shape prior of gamma(0.01,0.01), and 3500 samples, we immediately retrieve an estimate close to the real shape, albeit with a wide 95% CI interval, [13,83].
With shape prior of $Cauchy(0,25)$, and 350 samples, we got 7 divergent transitions.
Changing instead to $Gamma(3, 0.1)$, still 350 samples, sampled OK, but had 2 high Pareto k values that was solved by reloo. But the posterior looked almost like the prior, indicating that the prior might have been too strong.
Changing instead to $Gamma(1, 0.1)$, still 350 samples, sampled OK, but had 1 high Pareto k alue, 2 low neff_ratio values (lower than 0.1 threshold), and the model failed to find the correct shape value.
Default shape prior $Gamma(0.01, 0.01)$, still 350 samples, sampled OK, but had 1 high Pareto k value, solved by reloo. But it fails dismally to find the correct shape (25) - ending somewhere at 0.17, and corresponding bad predictions. Using 3500 samples helps the weak prior arrive at the reasonable shape.
Conclusion - if we suspect large shape values, we could always swap the default prior for the $Gamma(3, 0.1)$ one, and pay attention to if and how the posterior has changed. It might help dealing with small sample sizes in the groups.
```{r}
d <- data.frame(y=introd, logADD=A, logDUP=D, logCOMPLEX=C, team=logLambda$team, repo=logLambda$repo)
get_prior(data=d,
family=zero_inflated_negbinomial,
formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX + (1 | team/repo),
zi ~ 1 + logADD))
```
On my laptop, this takes about 12 minutes to sample, so pur yourself a nice beverage and enjoy life...
```{r}
M_recover_params <- brm(data=d,
family=zero_inflated_negbinomial,
formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX + (1 | team/repo),
zi ~ 1 + logADD),
prior = c(prior(normal(0, 0.5), class = Intercept),
prior(normal(0, 0.25), class = b),
prior(weibull(2, 1), class = sd),
prior(weibull(2, 1), class = sd, group=team:repo),
prior(weibull(2, 1), class = sd, group=team),
prior(normal(0, 0.5), class = Intercept, dpar=zi),
prior(normal(0.5, 0.5), class = b, dpar=zi),
prior(gamma(1, 0.1), class = shape)),
warmup=1000,
iter=4000,
chains=CHAINS,
cores=CORES,
threads=threading(THREADS),
backend="cmdstanr",
save_pars = save_pars(all=T),
adapt_delta=0.95)
```
```{r}
m <- M_recover_params
stopifnot(rhat(m) < 1.01)
stopifnot(neff_ratio(m) > 0.2)
mcmc_trace(m) # ok
loo <- loo(m)
plot(loo)
```
```{r, eval=F}
loo <- loo(m, moment_match = T, reloo=T)
plot(loo)
```
```{r}
m <- add_criterion(m, "loo")
loo
```
How does Rhat and neff_ratio compare?
```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff() # + scale_x_continuous(limits=c(0, 0.5))
```
All $\hat{R}$ values are ok (close to 1), and also the n_eff_ratio are much greater than 0.1, which is commonly used as a cut-off threshold.
We should be able to trust our model.
How does it compare?
```{r}
summary(m)
```
The zero-inflation intercept are estimated to be $inv\_logit(1.36)$, which corresponds to 0.796
```{r}
inv.logit(1.36)
```
This is very close to our original parameter, 0.80.
Credible interval would be between 1.23 and 1.49, and the slope is estimated to -0.21, with a reliably negative 95% Credible Interval on the logit scale (-0.33 to -0.10).
```{r}
inv.logit(c(1.23, 1.49))
```
The population-level intercept is very close to 0, with a symmetric CI of [-0.84, 0.82]. This in itself is not surprising, as we chose -0.5, -2 and 1, 1 in our perfectly balanced sample.
The coefficient for logADD are specified as between 1.12 and 1.23, with a point estimate of 1.18. The real value is 1.20, close to the estimate.
The coefficient for logDUP are specified as between 0.53 and 0.60, with a point estimate of 0.56, slightly lower than the real 0.60 - but, given the large number of zeros in DUP, still very good.
Likewise, the estimate for logCOMPLEX are specified as between 0.71 and 0.8, with a point estimate of 0.75, slightly lower than the real 0.80.
### Prior vs posterior
Be aware that R `rgamma` uses shape and rate, or shape and scale. Rate parameter is the inverse of the scale - so `rate=0.1` is scale `10`. https://bookdown.org/content/4857/monsters-and-mixtures.html#over-dispersed-counts
Stan uses shape $\alpha$ and rate (inverse scale) $\beta$. https://mc-stan.org/docs/2_21/functions-reference/gamma-distribution.html
```{r}
post <- as_draws_df(m)
gamma_prior <- data.frame(x=rgamma(N, shape=1, rate=0.1))
cauchy_prior <- data.frame(x=rcauchy(N, location=0, scale=25))
post %>%
ggplot() +
geom_density(data = gamma_prior, aes(x=x),
color = "transparent", fill = "blue", alpha = .5) +
geom_density(aes(x = shape),
color = "transparent", fill = "#A65141", alpha = .5) +
annotate(geom = "text",
x = c(20, 5), y = c(0.15, 0.15),
label = c("posterior", "prior"),
color = c("#A65141", "blue")) +
scale_x_continuous(limits = c(0,30)) +
scale_y_continuous(limits = c(0,1)) +
ggtitle("NegBinom Shape parameter, prior and posterior, limited to [0,1]",
subtitle = "Prior distribution is wider still") +
geom_vline(xintercept=25) +
labs(x = "gamma shape distribution")
```
### Conditional effects
```{r}
condeffect_logCOMPLEX_by_logADD <- function(d, aTeam, aRepo) {
nd <- d |> mutate(logADD=0, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
nd1 <- d |> mutate(logADD=1, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
nd2 <- d |> mutate(logADD=2, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
nd3 <- d |> mutate(logADD=3, logDUP=0, team=aTeam, repo=aRepo, logCOMPLEX=seq(from=-2, to=4, length.out=N))
f <- fitted(m, newdata=nd, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd)
f1 <- fitted(m, newdata=nd1, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd1)
f2 <- fitted(m, newdata=nd2, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd2)
f3 <- fitted(m, newdata=nd3, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd3)
f$logADD <- "0"
f1$logADD <- "1"
f2$logADD <- "2"
f3$logADD <- "3"
ftot <- rbind(f, f1, f2, f3)
return(ftot)
}
plot_logCOMPLEX_by_logADD <- function(ftot) {
return(ftot |> ggplot(aes(x=logCOMPLEX)) +
geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5, group=logADD, color=logADD), stat="identity", alpha=.25, size=.5) +
geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k, color=team, shape=repo), alpha=0.2)
)
}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "A", "R1")
plot_logCOMPLEX_by_logADD(ftot)
```
Rescaling to better show the relationship
```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) +
ggtitle("Conditional effects of complexity, for team A in repo R1 with four levels of ADD",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "B", "R1")
plot_logCOMPLEX_by_logADD(ftot)
```
```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) +
ggtitle("Conditional effects of complexity, for team B in repo R1 with four levels of ADD",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "A", "R2")
plot_logCOMPLEX_by_logADD(ftot)
```
```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) +
ggtitle("Conditional effects of complexity, for team A in repo R2 with four levels of ADD",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
Clearly, our beta for R2 is unrealistically large. We would not encounter such betas in real situations.
```{r}
ftot <- condeffect_logCOMPLEX_by_logADD(d, "B", "R2")
plot_logCOMPLEX_by_logADD(ftot)
```
```{r}
plot_logCOMPLEX_by_logADD(ftot) + scale_y_continuous(limits=c(0,25)) +
ggtitle("Conditional effects of complexity, for team B in repo R2 with four levels of ADD",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
condeffect_logADD_by_logCOMPLEX <- function(d, aTeam, aRepo) {
ndmin1 <- d |> mutate(logCOMPLEX=-1, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
nd <- d |> mutate(logCOMPLEX=0, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
nd1 <- d |> mutate(logCOMPLEX=1, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
nd2 <- d |> mutate(logCOMPLEX=2, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
nd3 <- d |> mutate(logCOMPLEX=3, logDUP=0, team=aTeam, repo=aRepo, logADD=seq(from=-2, to=5, length.out=N))
fmin1 <- fitted(m, newdata=ndmin1, probs=c(.055, .945)) |> data.frame() |> bind_cols(ndmin1)
f <- fitted(m, newdata=nd, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd)
f1 <- fitted(m, newdata=nd1, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd1)
f2 <- fitted(m, newdata=nd2, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd2)
f3 <- fitted(m, newdata=nd3, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd3)
fmin1$logCOMPLEX <- "-1"
f$logCOMPLEX <- "0"
f1$logCOMPLEX <- "1"
f2$logCOMPLEX <- "2"
f3$logCOMPLEX <- "3"
ftot <- rbind(fmin1, f, f1, f2, f3)
return(ftot)
}
plot_logADD_by_logCOMPLEX <- function(ftot) {
return(ftot |> ggplot(aes(x=logADD)) +
geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5, group=logCOMPLEX, color=logCOMPLEX), stat="identity", alpha=.25, size=.5) +
geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k, color=team, shape=repo), alpha=0.2)
)
}
```
```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "A", "R1")
plot_logADD_by_logCOMPLEX(ftot)
```
```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) +
ggtitle("Conditional effects of added lines, for team A in repo R1, and five levels of complexity",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "B", "R1")
plot_logADD_by_logCOMPLEX(ftot)
```
```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) +
ggtitle("Conditional effects of added lines, for team B in repo R1, and five levels of complexity",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "A", "R2")
plot_logADD_by_logCOMPLEX(ftot)
```
```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) +
ggtitle("Conditional effects of added lines, for team A in repo R2, and five levels of complexity",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
ftot <- condeffect_logADD_by_logCOMPLEX(d, "B", "R2")
plot_logADD_by_logCOMPLEX(ftot)
```
```{r}
plot_logADD_by_logCOMPLEX(ftot) + scale_y_continuous(limit=c(0,25)) +
ggtitle("Conditional effects of added lines, for team B in repo R2, and five levels of complexity",
subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
```{r}
ftot |> ggplot(aes(x=logADD)) + geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5, group=logCOMPLEX, color=logCOMPLEX), stat="identity", alpha=.25, size=.5) +
geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k, color=team, shape=repo), alpha=0.2) + scale_y_continuous(limit=c(0,25)) +
ggtitle("Conditional effects of added lines, for team A and five levels of complexity", subtitle = "Color of observation is the team, size is the Pareto k value for that observation")
```
#### Conclusion
So, if team B, in repo R1 makes an addition that is two standard deviations magnitude big (relative to the average magnitude), in a file that is three orders of magnitude complex, given that the average number of duplicates exist in the file, we would expect five duplicates to be added. If the file is average in complexity, we would only expect about 0-1 duplicates to be added.
If team A makes a similar change, we would expect about 17 duplicates to be added to the file, and about 2-3 duplicates if the file is average in complexity.
#### Model parameters
```{r}
mcmc_areas(m, regex_pars = c("^b_", "^b_zi"))
```
```{r}
mcmc_areas(m, regex_pars = c("^[^b].*ntercept"))
```
These intercepts distributions are really wide. And the general intercept parameter is very close to 0 (not strange, as we had -0.5, -2 and two times 1, and we had a perfectly balanced sample, i.e. equally many data points for each combination of factors.)
But why is the residuals (I interpret r_ parameters as differences from the population mean) so wide? Otherwise, the expected value seems quite close to the real value.
I suspect something to do with the prior settings (but I had 3500 samples for this simulation, so I would have expected the priors to have been overcome by now...)
```{r}
ranef(m)
```
Old way of doing conditional effects plots:
```{r}
nd <- d |> mutate(logCOMPLEX=0, logADD=0, logDUP=seq(from=-2, to=5, length.out=N))
f <- fitted(m, newdata=nd, probs=c(.055, .945)) |> data.frame() |> bind_cols(nd)
f |> ggplot(aes(x=logDUP)) + geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5), stat="identity", alpha=.25, size=.5) +
geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k), alpha=0.8)
```
```{r}
f |> ggplot(aes(x=logDUP)) + geom_smooth(aes(y=Estimate, ymin=Q5.5, ymax=Q94.5), stat="identity", alpha=.25, size=.5) +
geom_point(data=bind_cols(d, m$criteria$loo$diagnostics), aes(y=y, size = pareto_k), alpha=0.8) + scale_y_continuous(limit=c(0,25))
```
## Posterior predictive checks
We can visualize how our model compares related to the simulated data by doing posterior predictive checks.
```{r}
yrep_zinb <- posterior_predict(m)
```
The proportion of zeros fall straight in line with the observed ratio (black line), about 0.875. These are zeros occurring regardless of the zero-inflation or the negative binomial. Our model thinks that about 86%-89% of file changes would have zero issues introduced. The real (simulated) proportion is about 87.5%.
```{r}
ppc_stat(y = d$y, yrep_zinb, stat = function(y) mean(y == 0))
```
The maximum value simulated ranges between 100 to about 2000. The larger values are unrealistic, and in particular the peak around 450 seems to warrant further scrutiny - the most likely cause for this peak would be the large scale value (25 in this simulation).
```{r}
ppc_stat(y = d$y, yrep_zinb, stat = "max") # + scale_x_continuous(limits=c(0,2000))
```
```{r}
max(yrep_zinb)
```
For ZINB distributions, suspended rootograms are recommended by Kleiber and Zeileis (2016).
These rootograms show the difference between observed and expected counts, with bars hanging from the zero-line rather than the expected count line. Therefore we can think of this rootogram as showing information about the model residuals. The scale of the y axis is the square root of the count, in order to elevate smaller differences.
```{r}
pp_check(m, type = "rootogram", style="suspended")
```
```{r}
pp_check(m, type = "rootogram", style="suspended") + scale_x_continuous(limits = c(0,100)) + ggtitle("Suspended rootogram beween 0 and 100.")
```
It appears that the priors did their job, and allowed the data to shine through.
### Conditional Effects
The conditional effect show the effect of varying one parameter, keeping all the others in the model constant (usually at their mean value). We see the comparatively larger impact of logADD and logCOMPLEX, relative to logADD (as we specified in our model, by giving them larger coefficients).
However, above I defined custom functions, which included both several curves (e.g. logADD curves in the logCOMPLEX plot), and the actual data points.
Which of these visualizations are preferred?
```{r}
plot(conditional_effects(m), ask = FALSE)
```