Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding draft of Chapter 12 #37

Merged
merged 1 commit into from
Dec 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
767 changes: 767 additions & 0 deletions textbook/12/1/uniform.ipynb

Large diffs are not rendered by default.

Binary file added textbook/12/2/bias-variance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
296 changes: 296 additions & 0 deletions textbook/12/2/normal.ipynb

Large diffs are not rendered by default.

383 changes: 383 additions & 0 deletions textbook/12/3/binomial.ipynb

Large diffs are not rendered by default.

60 changes: 60 additions & 0 deletions textbook/12/empirical-distributions.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
{
Copy link
Collaborator

@campbelle1 campbelle1 Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Probability Distributions paragraph is a bit of a mouthful for someone that may be unfamiliar. Could you possibly offer a simple, anecdotal example to illustrate the meaning of the terms random variables, sample space, and probability distribution after this paragraph? I think having such an example, and directly pointing out which terms are what in the example, could help in comprehending this paragraph.

Just to clarify, you defined variance as σ^2(X) and then said standard deviation is the square root of variance and is σ^2(X). How so? it wouldn't be sqrt(σ^2(X))? Just want to make sure I'm understanding...

denoted by s is the -> denoted by s, is the

Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a typo resulting in a broken image:

![Chapter 11](../11/Probability.ipynb)

Rather than say a link:

[Chapter 11](../11/Probability.html)

Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right that the ! is a typeo. The file should be .html instead of .ipynb? Will a rendered html file be in that folder in the built version of the textbook? Asking because I think I made this mistake other places if so...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. The ambiguity in authoring such things is honestly frustrating.

There are generally three ways about it:

  • Write proper HTML, which will be passed through as-is, and work in the textbook.
  • Write Markdown, which should be fixed up before conversion to HTML.
  • Use a Jupyter Book (Sphinx) helper, which will ostensibly make things easier, (but this markup may be even less natural).

Here's documentation I found.

And so, any of the following would work:

  • <a href="../11/Probability.html">Chapter 11</a>
  • [Chapter 11](../11/Probability.ipynb)
  • [](../11/Probability.ipynb)
  • [Chapter 11](../11/Probability)
  • [](../11/Probability)
  • {doc}`Chapter 11 <../11/Probability>`
  • {doc}`../11/Probability`

That is, all references are relative, and:

  • HTML should presume the full final document path (with .html suffix)
  • Markdown may either refer to the actual file name (including .ipynb suffix) or omit the suffix entirely
  • Sphinx-style references may not include the file suffix
  • Markdown and Sphinx references may omit the link text, in which case it will be derived from the referenced document's full header, (in this case Probability: Mathematical/Theoretical and Computational Approaches).

So! Yes, in short, this would do the trick, (among other options)…:

[Chapter 11](../11/Probability.ipynb)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goodness, so confusing! Thank you for clarifying! I have pushed the change, so hopefully it works now! Please let me know if other changes are needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me. So this is ready to merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Empirical and Probability Distributions\n",
"*Susanna Lange and Amanda R. Kube Jotte*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the past few chapters, we have discussed methods of sampling individuals from a population and how biased samples can affect the generalizability of our data. Sampling is used to make inferences about a population when gathering information about the entire population is difficult or impossible. We make these inferences through calculating statistics on our sample with the goal of estimating the true population parameter we are interested in.\n",
"\n",
"## Probabilistic Sampling and Random Variables\n",
"Earlier in this book, we learned how to slice dataframes or select elements from arrays. This is a type of sampling known as *deterministic sampling* since there is no chance involved. In this section, we will build on our use of the random.choice function from Chapter 10 to create *probabalistic samples* where the probability of each unit being chosen is known before sampling is done. Simple random samples (SRS), as we learned in the previous chapter, are samples in which each unit has equal probability of being chosen. Since we know the probability of each unit being chosen, a SRS is an example of a probabilistic sample. When we are considering a random event or phenomenon, we are interested in the outcome of the event. A *random variable,* often denoted by uppercase letters $X$ or $Y$, is a numerical quantity representing an outcome of the event. The collection of possible outcomes, or the sample space as discussed in [Chapter 11](../11/Probability.ipynb), contains all possible values the random variable can take. Random variables can be either discrete, that is containing finite or countably infinite elements in its sample space, or continuous, that is containing infinite elements in its sample space. In the case of a discrete random variable, the sample space is a set of possible outcomes. An example of this would be $\\{\\text{Heads}, \\text{Tails} \\}$ for the outcome of a coin flip. In the case of a continuous random variable, the sample space is often an interval of possible outcomes. An example of this would be an interval of possible adult heights in inches [24, 107].\n",
"\n",
"## Probability Distributions\n",
"When we look at all possible values a random variable could take over all possible samples of the same size taken from the same population, we are building a *probability distribution* or sampling distribution of that random variable. In fact, a probability distribution corresponds to a function that assigns probabilites to each random variable, where the domain, or input, is the entire sample space. Such a function, usually denoted $P(X=x)$ where $X$ is a random variable and $x$ is the outcome of an event, is called a *probability density function (pdf)* for continuous random variables or *probability mass function (pmf)* for discrete random variables.\n",
"$P(X=x)$ must satifsy the following critera:\n",
"- the probability of each element occuring is greater than or equal to 0\n",
"- the sum of all probabilities of elements in the sample space equals 1 \n",
"When we are referring to either a pmf or a pdf, we will use the general term probability distribution.\n",
"\n",
"Consider the discrete coin toss example above, the probability mass function will compute the probability of getting $\\text{Heads}$, $\\text{Tails}$, or any union or intersection of the sample space. For example, $P(X=\\text{``Heads or Tails\"})$ represents the probability that a coin flip will result in either $\\text{Heads}$ or $\\text{Tails}$ (the union of the sample space), which is of course 1 as a coin is guaranteed to give one of these outcomes.\n",
"\n",
"Regarding the continuous height example, the probability density function will compute the probability of getting any interval subset, including unions and intersections, of the given sample space [24, 107]. For example, we can calculate $P(X=``< 60\")$ or more simply $P(X < 60)$. However, with continuous random variables, the probability of getting a single value from the sample space, for example $P(X = 76.2)$, is always 0 as the probability of picking this exact value out of the infinite number of values in the sample space is infinitesimally small. For that reason, when discussing continuous random variables, we are interested in intervals such as $P(X < 60)$, $P(X >= 100.3)$, or $P(72.5 < X < 65.6)$.\n",
"\n",
"## Measures of Center and Spread\n",
"The probability distribution of a random variable is useful in many ways, one of which - to summarize the data. Particularly, providing information on the *center* and *spread* of the data. The *center* often called *mean* or *expected value* of a random variable is defined by $\\mu(X)$ or $E(X)$. This describes the average value of the sample space. The *spread* of the data or the *variance* is symbolized by $\\sigma^2(X)=Var(X)$. This describes how the data is dispersed. Another commonly used measure of spread is the *standard deviation* which is the square root of the variance and is symbolized by $\\sigma(X)$. These measures are used so often that mathematicians have found formulas to calculate them for each probability distribution. We will explore these formulas for specific probability distributions later in this chapter. \n",
"\n",
"## Empirical Distributions\n",
"As a probability distribution depicts *all possible* samples of the same size from a population, it is not based on observed data. However, we can estimate a probability distribution empirically by taking many samples from a population and plotting the distribution of the observed values of a statistic. This is known as an *empirical distribution*. We can calculate measures of center and spread for an empirical distribution using the *sample mean* or *sample variance*. The sample mean is defined as $\\bar{x} = \\frac{\\Sigma x_i}{n}$. In words, to calculate the sample mean, you sum the sample items and then divide by the number of samples. The sample variance is defined as $s^2 = \\frac{\\Sigma (x_i - \\bar{x})^2}{n-1}$, and the sample standard deviation, denoted by $s$, is the square root of this. As these can be cumbersome to calculate by hand, especially for large numbers of samples, numpy has functions that will calculate these for us: `np.mean` and `np.std` .\n",
"\n",
"In the rest of this chapter, we will investigate 3 probability distributions, their measures of center and spread, and how to estimate them empirically."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.13 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.9.13"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
32 changes: 0 additions & 32 deletions textbook/12/placeholder12.ipynb

This file was deleted.

6 changes: 5 additions & 1 deletion textbook/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,11 @@ chapters:
- file: 11/4/Birthday_Pb_Relaxed_Assumptions

- title: "12. Empirical and Probability Distributions"
file: 12/placeholder12
file: 12/empirical-distributions
sections:
- file: 12/1/uniform
- file: 12/2/normal
- file: 12/3/binomial

- title: "13. Hypothesis Testing"
file: 13/placeholder13
Expand Down