datacarpentry · naupaka · Oct 3, 2024 · Oct 2, 2024 · Oct 2, 2024 · Oct 2, 2024
diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd
@@ -185,7 +185,15 @@
 of 29 variables (columns). Double-clicking on the name of the object will open
 a view of the data in a new tab.
 
-![RStudio data frame view]("fig/rstudio_dataframeview.png")
+![RStudio data frame view]("epidoes/fig/rstudio_dataframeview.png")
+
+We can also quickly query the dimensions of the variable using `dim()`. You'll see that the first number `801` shows the number of rows, then `29` the number of columns
+
+```{r, purl=FALSE}
+## get summary statistics on a data frame
+
+dim(variants)
+```
 
 ## Summarizing, subsetting, and determining the structure of a data frame.
 
@@ -208,12 +216,17 @@
 other variables (e.g. `sample_id`) are treated as characters data (more on this
 in a bit).
 
-There is a lot to work with, so we will subset the first three columns into a
-new data frame using the `data.frame()` function.
+There is a lot to work with, so we will subset the columns into a new data frame using
+the `data.frame()` function. To subset/index a two dimensional variable, we need to
+define them on the appropriate side of the brackets. The left hand side of the comma
+indicates the rows you want to subset, and the right is the column position 
+(e.g. ["row index", "column index"]).
 
-```{r, purl=FALSE}
-## put the first three columns of variants into a new data frame called subset
+Let's put the columns 1, 2, 3, and 6 into a new data frame called subset:
 
+```{r, purl=FALSE}
+## Notice that we are wrapping the numbers in a c() function, to indicate a vector
+## in the right hand side of the comma. 
 subset <- data.frame(variants[, c(1:3, 6)])
 ```
 
@@ -228,12 +241,13 @@
 
 Ok, thats a lot up unpack! Some things to notice.
 
-- the object type `data.frame` is displayed in the first row along with its
+- The object type `data.frame` is displayed in the first row along with its
   dimensions, in this case 801 observations (rows) and 4 variables (columns)
-- Each variable (column) has a name (e.g. `sample_id`). This is followed
-  by the object mode (e.g. chr, int, etc.). Notice that before each
+- Each variable (column) has a name (e.g. `sample_id`). Notice that before each
   variable name there is a `$` - this will be important later.
-
+- Each variable name is followed by the data type it contains (e.g. chr, int, etc.). 
+  The `int` type shows an integer, which is a type of numerical data, where it can only 
+  store whole numbers (i.e. no decimal points ).
 
 
   :::::::::::::::::::::::::::::::::::::::  challenge
@@ -297,10 +311,19 @@
 ```
 
 There are 801 alleles (one for each row). To simplify, lets look at just the
-single-nucleotide alleles (SNPs). We can use some of the vector indexing skills
-from the last episode.
+single-nucleotide alleles (SNPs). 
+
+Let's review some of the vector indexing skills from the last episode that can help:
 
 ```{r, purl=FALSE}
+# This will find all matching alleles with the single nucleotide "A" and provide a TRUE/FASE vector
+alt_alleles == "A"
+
+# Then, we wrap them into an index to pull all the positions that match this. 
+alt_alleles[alt_alleles == "A"]
+
+# If we repeat this for each nucleotide A, T, G, and C, and connect them using `c()`,
+# we can index all the single nucleotide changes.
 snps <- c(alt_alleles[alt_alleles == "A"],
   alt_alleles[alt_alleles=="T"],
   alt_alleles[alt_alleles=="G"],
@@ -318,7 +341,13 @@
 ```
 
 Whoops! Though the `plot()` function will do its best to give us a quick plot,
-it is unable to do so here. One way to fix this it to tell R to treat the SNPs
+it is unable to do so here. Let's use `str()` to see why this might be:
+
+```{r, purl=FALSE}
+str(snps)
+```
+
+R may not know how to plot a character vector! One way to fix this it to tell R to treat the SNPs
 as categories (i.e. a factor vector); we will create a new object to avoid
 confusion using the `factor()` function:
 
@@ -349,9 +378,12 @@
 
 ```{r, purl=FALSE}
 summary(factor_snps)
+
+# Compare the character vector 
+summary(snps)
 ```
 
-As you can imagine, this is already useful when you want to generate a tally.
+As you can imagine, factors are already useful when you want to generate a tally.
 
 :::::::::::::::::::::::::::::::::::::::::  callout
 
@@ -812,12 +844,12 @@
 First, in the RStudio menu go to **File**, select **Import Dataset**, and
 choose **From Excel...** (notice there are several other options you can
 explore).

 ![RStudio import menu]("fig/rstudio_import_menu.png")

 Next, under **File/Url:** click the <KBD>Browse</KBD> button and navigate to the **Ecoli\_metadata.xlsx** file located at `/home/dcuser/dc_sample_data/R`.
 You should now see a preview of the data to be imported:

 ![RStudio import screen]("fig/rstudio_import_screen.png")

 Notice that you have the option to change the data type of each variable by