Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing issues in mega issue section: R basics continued - factors and data frames #290

Merged
merged 5 commits into from
Oct 3, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 45 additions & 13 deletions episodes/03-basics-factors-dataframes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,15 @@
of 29 variables (columns). Double-clicking on the name of the object will open
a view of the data in a new tab.

![RStudio data frame view]("fig/rstudio_dataframeview.png")
![RStudio data frame view]("epidoes/fig/rstudio_dataframeview.png")

Check warning on line 188 in episodes/03-basics-factors-dataframes.Rmd

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: "epidoes/fig/rstudio_dataframeview.png"

We can also quickly query the dimensions of the variable using `dim()`. You'll see that the first number `801` shows the number of rows, then `29` the number of columns

```{r, purl=FALSE}
## get summary statistics on a data frame

dim(variants)
```

## Summarizing, subsetting, and determining the structure of a data frame.

Expand All @@ -208,12 +216,17 @@
other variables (e.g. `sample_id`) are treated as characters data (more on this
in a bit).

There is a lot to work with, so we will subset the first three columns into a
new data frame using the `data.frame()` function.
There is a lot to work with, so we will subset the columns into a new data frame using
the `data.frame()` function. To subset/index a two dimensional variable, we need to
define them on the appropriate side of the brackets. The left hand side of the comma
indicates the rows you want to subset, and the right is the column position
(e.g. ["row index", "column index"]).

```{r, purl=FALSE}
## put the first three columns of variants into a new data frame called subset
Let's put the columns 1, 2, 3, and 6 into a new data frame called subset:

```{r, purl=FALSE}
## Notice that we are wrapping the numbers in a c() function, to indicate a vector
## in the right hand side of the comma.
subset <- data.frame(variants[, c(1:3, 6)])
```

Expand All @@ -228,12 +241,13 @@

Ok, thats a lot up unpack! Some things to notice.

- the object type `data.frame` is displayed in the first row along with its
- The object type `data.frame` is displayed in the first row along with its
dimensions, in this case 801 observations (rows) and 4 variables (columns)
- Each variable (column) has a name (e.g. `sample_id`). This is followed
by the object mode (e.g. chr, int, etc.). Notice that before each
- Each variable (column) has a name (e.g. `sample_id`). Notice that before each
variable name there is a `$` - this will be important later.

- Each variable name is followed by the data type it contains (e.g. chr, int, etc.).
The `int` type shows an integer, which is a type of numerical data, where it can only
store whole numbers (i.e. no decimal points ).


::::::::::::::::::::::::::::::::::::::: challenge
Expand Down Expand Up @@ -297,10 +311,19 @@
```

There are 801 alleles (one for each row). To simplify, lets look at just the
single-nucleotide alleles (SNPs). We can use some of the vector indexing skills
from the last episode.
single-nucleotide alleles (SNPs).

Let's review some of the vector indexing skills from the last episode that can help:

```{r, purl=FALSE}
# This will find all matching alleles with the single nucleotide "A" and provide a TRUE/FASE vector
alt_alleles == "A"

# Then, we wrap them into an index to pull all the positions that match this.
alt_alleles[alt_alleles == "A"]

# If we repeat this for each nucleotide A, T, G, and C, and connect them using `c()`,
# we can index all the single nucleotide changes.
snps <- c(alt_alleles[alt_alleles == "A"],
alt_alleles[alt_alleles=="T"],
alt_alleles[alt_alleles=="G"],
Expand All @@ -318,7 +341,13 @@
```

Whoops! Though the `plot()` function will do its best to give us a quick plot,
it is unable to do so here. One way to fix this it to tell R to treat the SNPs
it is unable to do so here. Let's use `str()` to see why this might be:

```{r, purl=FALSE}
str(snps)
```

R may not know how to plot a character vector! One way to fix this it to tell R to treat the SNPs
as categories (i.e. a factor vector); we will create a new object to avoid
confusion using the `factor()` function:

Expand Down Expand Up @@ -349,9 +378,12 @@

```{r, purl=FALSE}
summary(factor_snps)

# Compare the character vector
summary(snps)
```

As you can imagine, this is already useful when you want to generate a tally.
As you can imagine, factors are already useful when you want to generate a tally.

::::::::::::::::::::::::::::::::::::::::: callout

Expand Down Expand Up @@ -812,12 +844,12 @@
First, in the RStudio menu go to **File**, select **Import Dataset**, and
choose **From Excel...** (notice there are several other options you can
explore).

Check warning on line 847 in episodes/03-basics-factors-dataframes.Rmd

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: "fig/rstudio_import_menu.png"
![RStudio import menu]("fig/rstudio_import_menu.png")

Next, under **File/Url:** click the <KBD>Browse</KBD> button and navigate to the **Ecoli\_metadata.xlsx** file located at `/home/dcuser/dc_sample_data/R`.
You should now see a preview of the data to be imported:

Check warning on line 852 in episodes/03-basics-factors-dataframes.Rmd

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: "fig/rstudio_import_screen.png"
![RStudio import screen]("fig/rstudio_import_screen.png")

Notice that you have the option to change the data type of each variable by
Expand Down
Loading