-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding draft of Chapter 12 #37
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like a typo resulting in a broken image:
![Chapter 11](../11/Probability.ipynb)
Rather than say a link:
[Chapter 11](../11/Probability.html) Reply via ReviewNB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are absolutely right that the ! is a typeo. The file should be .html instead of .ipynb? Will a rendered html file be in that folder in the built version of the textbook? Asking because I think I made this mistake other places if so... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you're right. The ambiguity in authoring such things is honestly frustrating. There are generally three ways about it:
And so, any of the following would work:
That is, all references are relative, and:
So! Yes, in short, this would do the trick, (among other options)…:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Goodness, so confusing! Thank you for clarifying! I have pushed the change, so hopefully it works now! Please let me know if other changes are needed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! Looks good to me. So this is ready to merge? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so |
||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Empirical and Probability Distributions\n", | ||
"*Susanna Lange and Amanda R. Kube Jotte*" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In the past few chapters, we have discussed methods of sampling individuals from a population and how biased samples can affect the generalizability of our data. Sampling is used to make inferences about a population when gathering information about the entire population is difficult or impossible. We make these inferences through calculating statistics on our sample with the goal of estimating the true population parameter we are interested in.\n", | ||
"\n", | ||
"## Probabilistic Sampling and Random Variables\n", | ||
"Earlier in this book, we learned how to slice dataframes or select elements from arrays. This is a type of sampling known as *deterministic sampling* since there is no chance involved. In this section, we will build on our use of the random.choice function from Chapter 10 to create *probabalistic samples* where the probability of each unit being chosen is known before sampling is done. Simple random samples (SRS), as we learned in the previous chapter, are samples in which each unit has equal probability of being chosen. Since we know the probability of each unit being chosen, a SRS is an example of a probabilistic sample. When we are considering a random event or phenomenon, we are interested in the outcome of the event. A *random variable,* often denoted by uppercase letters $X$ or $Y$, is a numerical quantity representing an outcome of the event. The collection of possible outcomes, or the sample space as discussed in [Chapter 11](../11/Probability.ipynb), contains all possible values the random variable can take. Random variables can be either discrete, that is containing finite or countably infinite elements in its sample space, or continuous, that is containing infinite elements in its sample space. In the case of a discrete random variable, the sample space is a set of possible outcomes. An example of this would be $\\{\\text{Heads}, \\text{Tails} \\}$ for the outcome of a coin flip. In the case of a continuous random variable, the sample space is often an interval of possible outcomes. An example of this would be an interval of possible adult heights in inches [24, 107].\n", | ||
"\n", | ||
"## Probability Distributions\n", | ||
"When we look at all possible values a random variable could take over all possible samples of the same size taken from the same population, we are building a *probability distribution* or sampling distribution of that random variable. In fact, a probability distribution corresponds to a function that assigns probabilites to each random variable, where the domain, or input, is the entire sample space. Such a function, usually denoted $P(X=x)$ where $X$ is a random variable and $x$ is the outcome of an event, is called a *probability density function (pdf)* for continuous random variables or *probability mass function (pmf)* for discrete random variables.\n", | ||
"$P(X=x)$ must satifsy the following critera:\n", | ||
"- the probability of each element occuring is greater than or equal to 0\n", | ||
"- the sum of all probabilities of elements in the sample space equals 1 \n", | ||
"When we are referring to either a pmf or a pdf, we will use the general term probability distribution.\n", | ||
"\n", | ||
"Consider the discrete coin toss example above, the probability mass function will compute the probability of getting $\\text{Heads}$, $\\text{Tails}$, or any union or intersection of the sample space. For example, $P(X=\\text{``Heads or Tails\"})$ represents the probability that a coin flip will result in either $\\text{Heads}$ or $\\text{Tails}$ (the union of the sample space), which is of course 1 as a coin is guaranteed to give one of these outcomes.\n", | ||
"\n", | ||
"Regarding the continuous height example, the probability density function will compute the probability of getting any interval subset, including unions and intersections, of the given sample space [24, 107]. For example, we can calculate $P(X=``< 60\")$ or more simply $P(X < 60)$. However, with continuous random variables, the probability of getting a single value from the sample space, for example $P(X = 76.2)$, is always 0 as the probability of picking this exact value out of the infinite number of values in the sample space is infinitesimally small. For that reason, when discussing continuous random variables, we are interested in intervals such as $P(X < 60)$, $P(X >= 100.3)$, or $P(72.5 < X < 65.6)$.\n", | ||
"\n", | ||
"## Measures of Center and Spread\n", | ||
"The probability distribution of a random variable is useful in many ways, one of which - to summarize the data. Particularly, providing information on the *center* and *spread* of the data. The *center* often called *mean* or *expected value* of a random variable is defined by $\\mu(X)$ or $E(X)$. This describes the average value of the sample space. The *spread* of the data or the *variance* is symbolized by $\\sigma^2(X)=Var(X)$. This describes how the data is dispersed. Another commonly used measure of spread is the *standard deviation* which is the square root of the variance and is symbolized by $\\sigma(X)$. These measures are used so often that mathematicians have found formulas to calculate them for each probability distribution. We will explore these formulas for specific probability distributions later in this chapter. \n", | ||
"\n", | ||
"## Empirical Distributions\n", | ||
"As a probability distribution depicts *all possible* samples of the same size from a population, it is not based on observed data. However, we can estimate a probability distribution empirically by taking many samples from a population and plotting the distribution of the observed values of a statistic. This is known as an *empirical distribution*. We can calculate measures of center and spread for an empirical distribution using the *sample mean* or *sample variance*. The sample mean is defined as $\\bar{x} = \\frac{\\Sigma x_i}{n}$. In words, to calculate the sample mean, you sum the sample items and then divide by the number of samples. The sample variance is defined as $s^2 = \\frac{\\Sigma (x_i - \\bar{x})^2}{n-1}$, and the sample standard deviation, denoted by $s$, is the square root of this. As these can be cumbersome to calculate by hand, especially for large numbers of samples, numpy has functions that will calculate these for us: `np.mean` and `np.std` .\n", | ||
"\n", | ||
"In the rest of this chapter, we will investigate 3 probability distributions, their measures of center and spread, and how to estimate them empirically." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3.9.13 64-bit", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"name": "python", | ||
"version": "3.9.13" | ||
}, | ||
"orig_nbformat": 4, | ||
"vscode": { | ||
"interpreter": { | ||
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Probability Distributions paragraph is a bit of a mouthful for someone that may be unfamiliar. Could you possibly offer a simple, anecdotal example to illustrate the meaning of the terms random variables, sample space, and probability distribution after this paragraph? I think having such an example, and directly pointing out which terms are what in the example, could help in comprehending this paragraph.
Just to clarify, you defined variance as σ^2(X) and then said standard deviation is the square root of variance and is σ^2(X). How so? it wouldn't be sqrt(σ^2(X))? Just want to make sure I'm understanding...
denoted by s is the -> denoted by s, is the
Reply via ReviewNB