uchicago-dsi · jesteria · Dec 7, 2022 · Dec 1, 2022 · campbelle1 · Dec 5, 2022
diff --git a/textbook/12/1/uniform.ipynb b/textbook/12/1/uniform.ipynb
diff --git a/textbook/12/2/bias-variance.png b/textbook/12/2/bias-variance.png
diff --git a/textbook/12/2/normal.ipynb b/textbook/12/2/normal.ipynb
diff --git a/textbook/12/3/binomial.ipynb b/textbook/12/3/binomial.ipynb
diff --git a/textbook/12/empirical-distributions.ipynb b/textbook/12/empirical-distributions.ipynb
@@ -0,0 +1,60 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Empirical and Probability Distributions\n",
+    "*Susanna Lange and Amanda R. Kube Jotte*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the past few chapters, we have discussed methods of sampling individuals from a population and how biased samples can affect the generalizability of our data. Sampling is used to make inferences about a population when gathering information about the entire population is difficult or impossible. We make these inferences through calculating statistics on our sample with the goal of estimating the true population parameter we are interested in.\n",
+    "\n",
+    "## Probabilistic Sampling and Random Variables\n",
+    "Earlier in this book, we learned how to slice dataframes or select elements from arrays. This is a type of sampling known as *deterministic sampling* since there is no chance involved. In this section, we will build on our use of the random.choice function from Chapter 10 to create *probabalistic samples* where the probability of each unit being chosen is known before sampling is done. Simple random samples (SRS), as we learned in the previous chapter, are samples in which each unit has equal probability of being chosen. Since we know the probability of each unit being chosen, a SRS is an example of a probabilistic sample. When we are considering a random event or phenomenon, we are interested in the outcome of the event. A *random variable,* often denoted by uppercase letters $X$ or $Y$, is a numerical quantity representing an outcome of the event. The collection of possible outcomes, or the sample space as discussed in [Chapter 11](../11/Probability.ipynb), contains all possible values the random variable can take. Random variables can be either discrete, that is containing finite or countably infinite elements in its sample space, or continuous, that is containing infinite elements in its sample space. In the case of a discrete random variable, the sample space is a set of possible outcomes. An example of this would be $\\{\\text{Heads}, \\text{Tails} \\}$ for the outcome of a coin flip. In the case of a continuous random variable, the sample space is often an interval of possible outcomes. An example of this would be an interval of possible adult heights in inches [24, 107].\n",
+    "\n",
+    "## Probability Distributions\n",
+    "When we look at all possible values a random variable could take over all possible samples of the same size taken from the same population, we are building a *probability distribution* or sampling distribution of that random variable. In fact, a probability distribution corresponds to a function that assigns probabilites to each random variable, where the domain, or input, is the entire sample space. Such a function, usually denoted $P(X=x)$ where $X$ is a random variable and $x$ is the outcome of an event, is called a *probability density function (pdf)* for continuous random variables or *probability mass function (pmf)* for discrete random variables.\n",
+    "$P(X=x)$ must satifsy the following critera:\n",
+    "- the probability of each element occuring is greater than or equal to 0\n",
+    "- the sum of all probabilities of elements in the sample space equals 1  \n",
+    "When we are referring to either a pmf or a pdf, we will use the general term probability distribution.\n",
+    "\n",
+    "Consider the discrete coin toss example above, the probability mass function will compute the probability of getting $\\text{Heads}$, $\\text{Tails}$, or any union or intersection of the sample space. For example, $P(X=\\text{``Heads or Tails\"})$ represents the probability that a coin flip will result in either $\\text{Heads}$ or $\\text{Tails}$ (the union of the sample space), which is of course 1 as a coin is guaranteed to give one of these outcomes.\n",
+    "\n",
+    "Regarding the continuous height example, the probability density function will compute the probability of getting any interval subset, including unions and intersections, of the given sample space [24, 107]. For example, we can calculate $P(X=``< 60\")$ or more simply $P(X < 60)$. However, with continuous random variables, the probability of getting a single value from the sample space, for example $P(X = 76.2)$, is always 0 as the probability of picking this exact value out of the infinite number of values in the sample space is infinitesimally small. For that reason, when discussing continuous random variables, we are interested in intervals such as $P(X < 60)$, $P(X >= 100.3)$, or $P(72.5 < X < 65.6)$.\n",
+    "\n",
+    "## Measures of Center and Spread\n",
+    "The probability distribution of a random variable is useful in many ways, one of which - to summarize the data. Particularly, providing information on the *center* and *spread* of the data. The *center* often called *mean* or *expected value* of a random variable is defined by $\\mu(X)$ or $E(X)$. This describes the average value of the sample space. The *spread* of the data or the *variance* is symbolized by $\\sigma^2(X)=Var(X)$. This describes how the data is dispersed. Another commonly used measure of spread is the *standard deviation* which is the square root of the variance and is symbolized by $\\sigma(X)$. These measures are used so often that mathematicians have found formulas to calculate them for each probability distribution. We will explore these formulas for specific probability distributions later in this chapter. \n",
+    "\n",
+    "## Empirical Distributions\n",
+    "As a probability distribution depicts *all possible* samples of the same size from a population, it is not based on observed data. However, we can estimate a probability distribution empirically by taking many samples from a population and plotting the distribution of the observed values of a statistic. This is known as an *empirical distribution*. We can calculate measures of center and spread for an empirical distribution using the *sample mean* or *sample variance*. The sample mean is defined as $\\bar{x} = \\frac{\\Sigma x_i}{n}$. In words, to calculate the sample mean, you sum the sample items and then divide by the number of samples. The sample variance is defined as $s^2 = \\frac{\\Sigma (x_i - \\bar{x})^2}{n-1}$, and the sample standard deviation, denoted by $s$, is the square root of this. As these can be cumbersome to calculate by hand, especially for large numbers of samples, numpy has functions that will calculate these for us: `np.mean` and `np.std` .\n",
+    "\n",
+    "In the rest of this chapter, we will investigate 3 probability distributions, their measures of center and spread, and how to estimate them empirically."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.9.13 64-bit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.9.13"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/textbook/12/placeholder12.ipynb b/textbook/12/placeholder12.ipynb
diff --git a/textbook/_toc.yml b/textbook/_toc.yml
@@ -92,7 +92,11 @@ chapters:
   - file: 11/4/Birthday_Pb_Relaxed_Assumptions
 
 - title: "12. Empirical and Probability Distributions"
-  file: 12/placeholder12
+  file: 12/empirical-distributions
+  sections:
+  - file: 12/1/uniform
+  - file: 12/2/normal
+  - file: 12/3/binomial
 
 - title: "13. Hypothesis Testing"
   file: 13/placeholder13