-
Notifications
You must be signed in to change notification settings - Fork 0
/
2017-12-05 Intro to R.Rmd
218 lines (152 loc) · 10.9 KB
/
2017-12-05 Intro to R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "Intro"
author: "MKJ"
date: "11/28/2017"
output: html_document
---
#Help with installation
- R
- RStudio
#Intro to workshop
- Welcome to R! This workshop is for people with little or no R experience. We will introduce the R statistical computing environment, RStudio, and the dataset that we will work with for the remainder of the lesson. We will cover very basic functionality in R, including variables, functions, and importing/inspecting data frames.
- Who I am
- Who are you? Why do you want to learn R?
#Goals for today
- Learn the basics of R
- Introduce the tidyverse
- Second hour = dplyr
#--------------------------------------------------
#Lesson
### R Studio
Let's start by learning about RStudio. Open RStudio. **R** is the underlying statistical computing environment, but using R alone is no fun. **RStudio** is a graphical integrated development environment that makes using R much easier.
- Panes in RStudio. There are four panes, and their orientation is configurable under "Tools -- Global Options." I just stick with the defaults:
- Editor in the top left
- Console bottom left
- Environment/history on the top right
- Plots/help on the bottom right.
- Projects: first, start a new project in a new folder somewhere easy to remember. When we start reading in data it'll be important that the _code and the data are in the same place._ Creating a project creates an Rproj file that opens R running _in that folder_. This way, when you want to read in dataset _whatever.txt_, you just tell it the filename rather than a full path. This is critical for reproducibility, and we'll talk about that more later.
- Code that you type into the console is code that R executes. From here forward we will use the editor window to write a script that we can save to a file and run it again whenever we want to. We usually give it a `.R` extension, but it's just a plain text file. If you want to send commands from your editor to the console, use `CMD`+`Enter` (`Ctrl`+`Enter` on Windows).
- Anything after a `#` sign is a comment. Use them liberally to *comment your code*.
## Basic operations
R can be used as a glorified calculator. Try typing this in directly into the console. Make sure you're typing into into the editor, not the console, and save your script. Use the run button, or press `CMD`+`Enter` (`Ctrl`+`Enter` on Windows).
```{r calculator}
2+2
5*4
2^3
```
R knows order of operations and scientific notation.
```{r calculator2}
2+3*4/(5+3)*15/2^2+3*4^2
5e4
```
However, to do useful and interesting things, we need to assign _values_ to _objects_. To create objects, we need to give it a name followed by the assignment operator `<-` and the value we want to give it:
```{r assignment}
weight_kg <- 55
```
`<-` is the assignment operator. Assigns values on the right to objects on the left, it is like an arrow that points from the value to the object. Mostly similar to `=` but not always. Learn to use `<-` as it is good programming practice. Using `=` in place of `<-` can lead to issues down the line. The keyboard shortcut for inserting the `<-` operator is `Alt-dash` on a pc or `option-dash` on a mac.
Objects can be given any name such as `x`, `current_temperature`, or `subject_id`. You want your object names to be explicit and not too long. They cannot start with a number (`2x` is not valid but `x2` is). R is case sensitive (e.g., `weight_kg` is different from `Weight_kg`). There are some names that cannot be used because they represent the names of fundamental functions in R. It is also recommended to use nouns for variable names, and verbs for function names.
When assigning a value to an object, R does not print anything. You can force to print the value by typing the name:
```{r printAssignment}
weight_kg
```
Now that R has `weight_kg` in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg).
```{r modAssignment}
2.2 * weight_kg
```
We can also change a variable's value by assigning it a new one:
```{r newAssignment}
weight_kg <- 57.5
2.2 * weight_kg
```
This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a variable.
```{r calculationWithVar}
weight_lb <- 2.2 * weight_kg
```
and then change `weight_kg` to 100.
```{r modAssignment2}
weight_kg <- 100
```
What do you think is the current content of the object `weight_lb`? 126.5 or 220?
## Functions
R has built-in functions.
```{r fns}
# Notice that this is a comment.
# Anything behind a # is "commented out" and is not run.
sqrt(144)
log(1000)
```
Get help by typing a question mark in front of the function's name, or `help(functionname)`:
```
help(log)
?log
```
Note syntax highlighting when typing this into the editor. Also note how we pass *arguments* to functions. The `base=` part inside the parentheses is called an argument, and most functions use arguments. Arguments modify the behavior of the function. Functions some input (e.g., some data, an object) and other options to change what the function will return, or how to treat the data provided. Finally, see how you can *nest* one function inside of another (here taking the square root of the log-base-10 of 1000).
```{r log}
log(1000)
log(1000, base=10)
log(1000, 10)
sqrt(log(1000, base=10))
```
----
**EXERCISE `r .ex``r .ex=.ex+1`**
See `?abs` and calculate the square root of the log-base-10 of the absolute value of `-4*(2550-50)`. Answer should be `2`.
----
## Data Frames
There are _lots_ of different basic data structures in R. If you take any kind of longer introduction to R you'll probably learn about arrays, lists, matrices, etc. Let's skip straight to the data structure you'll probably use most -- the **data frame**. We use data frames to store heterogeneous tabular data in R: tabular, meaning that individuals or observations are typically represented in rows, while variables or features are represented as columns; heterogeneous, meaning that columns/features/variables can be different classes (on variable, e.g. age, can be numeric, while another, e.g., cause of death, can be text).
We will initially be using the `read_*` functions from the [**readr** or **tidyverse** package]. These functions load data into a _tibble_ instead of R's traditional data.frame. Tibbles are data frames, but they tweak some older behaviors to make life a little easier. There is a nice reading about this on Hadley Wickham's r4ds.had.co.nz page.
The data we are going to look at is cleaned data from the National Health and Nutrition Examination Survey from the CDC.
There are some built-in functions for reading in data in text files. These functions are _read-dot-something_ -- for example, `read.csv()` reads in comma-delimited text data; `read.delim()` reads in tab-delimited text, etc. We're going to read in data a little bit differently here using the [readr](http://readr.tidyverse.org/) package. When you load the readr package, you'll have access to very similar looking functions, named _read-underscore-something_ -- e.g., `read_csv()`. You have to have the readr package installed to access these functions. Compared to the base functions, they're _much_ faster, they're good at guessing the types of data in the columns, they don't do some of the other silly things that the base functions do. We're going to use another package later on called [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html), and if you have the dplyr package loaded as well, and you read in the data with readr, the data will display nicely.
First let's install and load those packages.
```{r loadpkgs}
#install.packages("tidyverse")
library(tidyverse)
```
If you see a warning that looks like this: `Error in library(packageName) : there is no package called 'packageName'`, then you don't have the package installed correctly.
### `read_csv()`
Now, let's actually load the data. You can get help for the import function with `?read_csv`. When we load data we assign it to a variable just like any other, and we can choose a name for that data. Since we're going to be referring to this data a lot, let's give it a short easy name to type. I'm going to call it `nh`. Once we've loaded it we can type the name of the object itself (`nh`) to see it printed to the screen.
```{r loaddata}
nh <- read_csv(file="nhanes.csv")
nh
```
Take a look at that output. The nice thing about loading dplyr and reading in data with readr is that data frames are displayed in a much more friendly way. This dataset has 5,000 rows and 32 columns. When you import data this way and try to display the object in the console, instead of trying to display all 5,000 rows, you'll only see 10 by default. Also, if you have so many columns that the data would wrap off the edge of your screen, those columns will not be displayed, but you'll see at the bottom of the output which, if any, columns were hidden from view. If you want to see the whole dataset, there are two ways to do this. First, you can click on the name of the data.frame in the **Environment** panel in RStudio. Or you could use the `View()` function (_with a capital V_).
```{r view, eval=FALSE}
View(nh)
```
## Inspecting data.frame objects
There are several built-in functions that are useful for working with data frames.
* Content:
* `head()`: shows the first few rows
* `tail()`: shows the last few rows
* Size:
* `dim()`: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
* `nrow()`: returns the number of rows
* `ncol()`: returns the number of columns
* Summary:
* `colnames()` (or just `names()`): returns the column names
* `str()`: structure of the object and information about the class, length and content of each column
* `summary()`: works differently depending on what kind of object you pass to it. Passing a data frame to the `summary()` function prints out useful summary statistics about numeric column (min, max, median, mean, etc.)
```{r data_frame_functions}
head(nh)
tail(nh)
dim(nh)
names(nh)
str(nh)
summary(nh)
```
## Accessing variables & subsetting data frames
We can access individual variables within a data frame using the `$` operator, e.g., `mydataframe$specificVariable`. Let's print out the SmokingStatus entries in the data. Then let's calculate the average Weight across all the data (using the built-in `mean()` function).
```{r}
# display all Smoking Statuses
nh$SmokingStatus
#mean Weight (kg)
mean(nh$Weight)
```
Oops. We got NA as the average weight. Why?
```{r}
mean(nh$Weight, na.rm = TRUE) #kg
```
Ok. That makes sense. We might actually be interested in seeing how Smoking, Diabetes, etc impact Weight. We'll get to that in the dplyr lesson
----
**EXERCISE `r .ex``r .ex=.ex+1`**
1. What's the standard deviation of Income (hint: get help on the `sd` function with `?sd`).
1. What's the range of Pulse represented in the data? (hint: `range()`).