lesson1-tasks.Rmd

---
title: "Lesson 1 - Tasks"
output: html_document
---


## Task 1: Installing Stan

Note: Windows users in both Stan and Python might need RTools installed.
Martin has a flash drive with RTools installers so that you don't
have to wait for the large download.

### R

We will use the `cmdstanr` package, follow  https://mc-stan.org/cmdstanr/articles/cmdstanr.html


### Python

We will use the `CmdStanPy` package, follow https://mc-stan.org/cmdstanpy/installation.html


## Task 2: Hello World model

Create a file called `initial.stan` and paste the following simple model. 

```{r, comment="", echo=FALSE}
cat(readLines("stan/lesson1/initial.stan"), sep = "\n")
```

Note that we express all contributions to the target density explicitly with `target += `. 

The corresponding statistical model is 

$$
y_i \sim N(\mu, 1) \\
\mu \sim N(0,1)
$$

i.e. we are estimating the mean of a normal distribution with known standard deviation and
a somewhat informative prior on the mean.

_Note: You'll notice many Stan models use statements like `y ~ normal(mu, 1)` instead of `target += normal(y | mu, 1)`. We will discuss this alternative syntax and the differences in next lesson. _

Simulate data for the model with 10 data points, using the same structure as the model assumes, i.e. draw $\mu$ from its prior, then draw $y$ (using `rnorm` in R, `numpy.random.Generator.normal` in Python)

Compile the model, prepare the data in a format for the model, fit the model and display a summary:

- In R: https://mc-stan.org/cmdstanr/articles/cmdstanr.html#compiling-a-model
  - You may want to use `options(mc.cores = 4)` to compute chains in parallel.
- In Python: https://mc-stan.org/cmdstanpy/users-guide/hello_world.html, you can also use the Arviz package - convert the CmdStanPy object to `InferenceData` via
[arviz.from_cmdstanpy](https://python.arviz.org/en/latest/api/generated/arviz.from_cmdstanpy.html) and then use [arviz.summary](https://python.arviz.org/en/latest/api/generated/arviz.summary.html)

Inspect the summary - how big is the uncertainty? Are the "true" values we used in simulation recovered?

## Task 3: Estimate `sigma`

Add a `sigma` parameter of type `real` to model the standard deviation, with a half-Normal prior i.e. the new mathematical model is:

$$
y_i \sim N(\mu, \sigma) \\
\mu \sim N(0,1) \\
\sigma \sim HalfN(0, 2)
$$

Since standard deviation cannot be negative, add a `<lower=0>` constraint to `sigma`. 

HMC needs the parameter space to be Euclidean (i.e. $\mathbb{R}^K$ for some $K$), so when something is constrained, there is an implicit transformation, so Stan will use `log(sigma)` internally.

There is no builtin half-normal distribution in Stan. But it turns out, using a normal prior on a variable constrained to be positive is the same as using a half-normal distribution. We will go into the reasons for this and some nuances next time.


Generate data according to this model, i.e. first draw $\mu$ and $\sigma$ from their priors, then draw $y$. You can draw from the half-normal distribution simply by taking the absolute value of a draw from the normal distribution.

Explore the summary, do you recover the true parameters?


## Task 4: Our first convergence problems

Use the model from the previous task with dataset of size 1. What happens?

Is there a reason the model may not work well with just a single datapoint?

Does the model from Task 2 also have issues with a dataset of size 1? Why?

**Bonus:** Try this if you feel it won't take you too much time. Assume that the single data point is $y = 1$, make a contour and/or heatmap plot of the posterior density as a function $\mu$ and $\sigma$ (in R use `dnorm( log = TRUE)`, in Python use `scipy.stats.norm` and the `logpdf` method).

Generally, to get a basic understanding of warnings about convergence from Stan, the guide at https://mc-stan.org/misc/warnings is IMHO quite good (I wrote most of it :-D).

## Task 5: Prior as data

Add two new data elements of type `real` - `mu_prior_mean` and `mu_prior_sd` to define the location and scale of the normal prior for $\mu$

Should those data have some constraints attached? If yes, add them as well.

Test that the model now works with some of the datasets you generated previously.

## Task 6: Prior influence

Generate a dataset of size 10 with the prior $\mu \sim N(0,1), \sigma \sim HalfN(0, 2)$. 
Find the least extreme values of `mu_prior_mean` and `mu_prior_sd` that make the posterior 
generated by Stan exclude the true value of $\mu$ from your simulation.

Try the same task with a dataset of size 3 and a dataset of size 100.

## Task 7: Bayesian "t-test"

Expand the model so that you now have two sets of data (possibly of unequal size), each belonging to a different group and
estimate the mean in the first group and the difference in means. More specifically, the model should look like this:

$$
y_A \sim N(\mu, \sigma_A) \\
y_B \sim N(\mu + \delta, \sigma_B) \\
\mu \sim N(0, 1)\\
\delta \sim N\left(0, \frac{1}{2}\right) \\
\sigma_A, \sigma_B \sim HalfN(0, 1)
$$

(you can keep the location and scale of the prior as data, but you can also hardcode it)

Repeatedly simulate data according to the model and fit it.
How big do the groups need to be, so that your posterior interval for $\delta$ excludes 0 most of the time?