lecture_day1-intermediate.Rmd

---
title: "R 'hadley' workshop"
author: "Aurélien Ginolhac"
date: "2^nd^ June 2016"
output:
  ioslides_presentation:
    css: style.css
    logo: img/uni.png
    smaller: yes
    fig_width: 6
    fig_height: 5
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
```

# dplyr

## dplyr - intro

There's a cheatsheet!

```{r echo=FALSE, out.width='90%'}
knitr::include_graphics("http://lsru.github.io/r_workshop/img/dplyr_cheatsheet.png")
```


```{r}
with(mtcars, aggregate(mpg, list(cyl), mean))

library("dplyr") # library("dplyr", warn.conflicts = FALSE)
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg))
```

[source by Steve Simpson](http://data-steve.github.io/base-r-groupby-tapply-ave-by/)

## Web-based app to learn by practise

step-by-step tidying and manipulating data frames

https://exploratory.io/

## nycflights13

`flights` is a `tbl_df`

```{r}
library("dplyr", warn.conflicts = FALSE)
library("nycflights13")
flights
```

## glimpse


Use `glimpse` to show some values and types per column. Environment tab does it too

```{r}
glimpse(flights)
```

## filter: inspect subsets of data

How many flights flew to La Guardia, NY in 2013? Expecting none...

```{r}
flights %>%
  filter(dest == "LGA")
```
base version equivalent (`subset` could also be used)
```{r}
flights[which(flights$dest == "LGA"), ]
```

## filter: multiple conditions AND (&)

How many flights flew to Madison in first week of January?

```{r}
# Comma separated conditions are combined with '&'
flights %>%
  filter(dest == "MSN", month == 1, day <= 7)
```


## filter:  multiple conditions OR

```{r, eval = FALSE}
flights %>%
  filter(dest == "MSN" | dest == "ORD" | dest == "MDW")
```

For more complicated checks, prefer a set operation.  
The following 2 are equivalent:

```{r, eval = FALSE}
flights %>%
  filter(is.element(dest, c("MSN", "ORD", "MDW")))
```

```{r, eval = FALSE}
flights %>%
  filter(dest %in% c("MSN", "ORD", "MDW"))
```


## arrange: sort columns

Perform a nested sorting of all flights in NYC:

    1. By which airport they departed
    2. year
    3. month
    4. day


```{r}
flights %>%
  arrange(origin, year, month, day)
```

## arrange desc: reverses sorting

Find the longest delays for flights to Madison.
<!--Find the longest delayed flights to Madison.-->
<!-- This would mean to search for the longest flight whatever the delay...-->

```{r}
flights %>%
  filter(dest == "MSN") %>%
  arrange(desc(arr_delay)) %>%
  select(arr_delay, everything()) # way to reorder arr_delay 1st column
```

## 

Find the most delayed (in minutes) flight in 2013

```{r}
flights %>%
  arrange(desc(arr_delay)) %>%
  select(arr_delay, everything()) %>% head(3)
```

```{r}
1272 / 60
```

## select columns

```{r}
flights %>%
  select(origin, year, month, day)
```

## select's helpers

`select` has many helper functions. See `?select`.

works with `tidyr` functions too
```{r}
flights %>%
  select(origin, year:day, starts_with("dep"))
```

## negative selecting

We can drop columns by "negating" the name. Since helpers
give us column names, we can negate them too.

```{r}
flights %>%
  select(-dest, -starts_with("arr"),
         -ends_with("time"))
```

## Recap: Verbs for inspecting data


* convert to a `tbl_df`. Now in `[tibble](https://github.com/hadley/tibble)`
* `glimpse` - some of each column
* `filter` - subsetting
* `arrange` - sorting (`desc` to reverse the sort)
* `select` - picking (and omitting) columns

## rename

Rename columns with `rename(NewName = OldName)`. To keep the order
correct, read/remember the renaming `=` as "was".

```{r}
flights %>%
  rename(y = year, m = month, d = day)
```


## mutate

- How much departure delay did the flight make up in the air?  
- Note that new variables can be used right away

```{r}
flights %>%
  mutate(
    gain = arr_delay - dep_delay,
    speed = (distance / air_time) * 60,
    gain_per_hour = gain / (air_time / 60)) %>%
  select(gain:gain_per_hour)
```


## Could the gain be explained by speed?


```{r, warning=FALSE, fig.height=3.5}
library("ggplot2")
flights %>%
  mutate(gain = arr_delay - dep_delay,
    speed = (distance / air_time) * 60) %>%
  sample_n(10000) %>% # subsample 1e5 rows randomly
  ggplot(aes(x = gain, y = speed))+
  geom_point(alpha = 0.4)
```


## group_by

For the flights to Madison, let's compute the average delay per month.

instead of `aggregate`, `dplyr` has its own grouping function.  
Here, we `group_by` date. See the helpful reminder from `tbl_df` print method

```{r}
flights %>%
  filter(dest == "MSN") %>%
  group_by(month) %>%
  #Some values are missing, thus tell `mean` to remove them from the calculation.
  summarise(mean_dep_delay = mean(dep_delay, na.rm = TRUE))
```


## group_by (2)

Work per day, note the tibble info about the 365 groupings

```{r}
by_day <- flights %>%
  group_by(year, month, day)
by_day
```

Note that one level (right most) is removed from grouping.


## summarise

Now we use `summarise` to compute (several) aggregate values within
each group (per day). `summarise` returns one row per group.

```{r}
by_day %>%
  summarise(
    flights = n(), # dplyr specific function
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n_planes = n_distinct(tailnum)) # dplyr specific function
```

## Exercice

* In average, how many flights does a single plane perform each day?

* plot the distribution, display the mean / median (`geom_vline()`)

* plot the average delay per day. Use `tidyr:unite` and `as.Date`

* which day should be avoided?

## Solution 1

```{r}
by_day %>%
  summarise(flights = n(),
            avg_delay = mean(dep_delay, na.rm = TRUE),
            n_planes = n_distinct(tailnum)) %>%
  mutate(avg_flights = flights / n_planes)
```

## Solution 2


```{r, fig.align = 'center', out.width='100%', fig.height=3.1}
by_day %>%
  summarise(flights = n(),
            avg_delay = mean(dep_delay, na.rm = TRUE),
            n_planes = n_distinct(tailnum)) %>%
  mutate(avg_flights = flights / n_planes) %>%
  ggplot()+
  geom_density(aes(x = avg_flights))+
  geom_vline(aes(xintercept = mean(avg_flights)), colour = "red")+
  geom_vline(aes(xintercept = median(avg_flights)), colour = "blue")
```


## Solution 3

```{r, fig.align = 'center', out.width='100%', fig.height=3.3, warning = FALSE}
library("tidyr")
by_day %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ungroup() %>% 
  unite(date, -avg_delay, sep = "-") %>%
  mutate(date = as.Date(date)) %>%
  ggplot()+geom_bar(aes(x = date, y = avg_delay), stat = "identity")
```

## Solution 3


```{r, fig.align = 'center', out.width='80%', fig.height=3.5}
by_day %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay))
```

What's wrong?

## Solution 3.2

Mind that `arrange` uses grouping! (will change in version `0.4.5`)

```{r, fig.align = 'center', out.width='100%'}
by_day %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ungroup %>%
  arrange(desc(avg_delay))
```


## Exercice


* Find the destinations with the highest average arrival delay?
  - discard flights with missing arrival delays
  - count the number of flights per destination
  - discard results with less than 10 flights: mean will not be meaningful

## Solution


```{r}
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(dest) %>%
  summarise(mean = mean(arr_delay),
            n = n()) %>%
  filter(n > 10) %>%
  arrange(desc(mean))
  
```


## Is there a spatial pattern for those delays?

First get the GPS coordinate of airports using the data frame `airports`

```{r}
airports
```

## join two data frames

```{r}
delays <- flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(dest) %>%
  summarise(mean = mean(arr_delay),
            n = n()) %>%
  filter(n > 10) %>%
  arrange(desc(mean)) %>%
  inner_join(airports, by = c("dest" = "faa")) # provide the equivalence since columns have a different name
```

We could have used **left_join** but 4 rows with a 3-letters acronym have no correspondance in the `airports` data frame. 

**inner_join** narrows down the lines that are present in both data frames.


## join types

```{r, echo=FALSE, out.width='60%'}
knitr::include_graphics("http://www.dofactory.com/Images/sql-joins.png")
```

Of note: **anti_join** can select rows in which identifiers are **absent** in the second data frame.

## plot on a map

```{r, echo=TRUE, eval=FALSE, out.width='100%'}
library("ggplot2")
library("maps") # US map
ggplot(delays)+
  geom_point(aes(x = lon, y = lat, colour = mean), size = 3, alpha = 0.8)+
  scale_color_gradient2()+borders("state")
```
```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("img/delays_usmap_grey.png")
```

## plot on a map, with text

<!-- I'm not able to set warning or message to FALSE! -->
<!-- Index out of range error for map_data... -->
```{r, eval=FALSE, message = TRUE, warning = TRUE, out.width='100%'}
library("ggrepel")
filter(delays, lon > -140) %>% # remove Honolulu
  ggplot()+geom_point(aes(x = lon, y = lat, colour = mean), size = 3, alpha = 0.8)+
  geom_text_repel(aes(x = lon, y = lat, label = name), size = 2.5)+
  scale_color_gradient2()+theme_classic()+borders("state")
```
```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("img/delays_usmap.png")
```

## plot on a map, with conditional text

```{r, out.width='100%', eval=FALSE}
filter(delays, lon > -140) %>% # remove Honolulu
  ggplot()+geom_point(aes(x = lon, y = lat, colour = mean),
                      size = 3, alpha = 0.8)+borders("state")+
  geom_label_repel(data = delays %>% filter(mean > 20),
            aes(x = lon, y = lat + 1, label = name), fill = "brown", colour = "white", size = 3)+
  scale_color_gradient2()+theme_classic()
```

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("img/delays_cond_usmap.png")
```

## tally / count

`tally` is a shortcut to counting the number of items per group.

```{r}
flights %>%
  group_by(dest, month) %>%
  tally() %>% head(3) # could sum up with multiple tally calls
```
`count` does the grouping for you
```{r}
flights %>%
  count(dest, month) %>% head(3)
```


## That covers 80% of dplyr


- select
- filter
- arrange
- glimpse
- rename
- mutate
- group_by, ungroup
- summarise

## Other 20%


- assembly: `bind_rows`, `bind_cols`
- windows function, `min_rank`, `dense_rank`, `cumsum`
- column-wise operations: `mutate_each`, `transmute`, `summarise_each`
- join tables together: `right_join`, `full_join`
- filtering joins: `semi_join`, `anti_join`
- `do`: arbitrary code on each chunk
- different types of tabular data (databases, data.tables)

## bind_rows + purrr

How to read in and merge files in 2 lines

```{r}
library("purrr")
library("readxl")
files <- list.files(path = "./data/", pattern = "xlsx$", full.names = TRUE)
df <- lapply(files, read_excel) %>% bind_rows(.id = "file_number")
```

using `purrr`
```{r}
purr_df <- map_df(files, read_excel, .id = "file_number")
```
filenames are better
```{r}
library("stringr")
files %>%
  set_names(nm = str_match(basename(.), "([^.]+)\\.[[:alnum:]]+$")[, 2]) %>%
  map_df(read_excel, .id = "file_name") -> purr_name_df
```


# Appendix

## Coding style


`R` has a rather flexible and permissive syntax. However, being more strict tends to ease the debugging process.

See [Hadley's recommendations](http://adv-r.had.co.nz/Style.html)

```{r, eval=FALSE}
long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}
```


## Useful rstudio shortcuts


- Scripting (replace  <kbd>Cmd</kbd> by <kbd>Ctrl</kbd> for PC)
    + <kbd>Cmd</kbd> + <kbd>-</kbd>: insert <kbd> <- </kbd>
    + <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd>: insert <kbd>%>%</kbd>
    + <kbd>Alt</kbd> + <kbd>↑</kbd> or <kbd>↓</kbd>: move line up / down
    + <kbd>Cmd</kbd> + <kbd>Alt</kbd> + <kbd>↑</kbd> or <kbd>↓</kbd>: copy line up   / down
    + <kbd>Ctrl</kbd> + <kbd>Alt</kbd> + <kbd>↑</kbd> or <kbd>↓</kbd>: multi-line edition
- `# analysis step ####` for navigating
- Running
    + <kbd>Cmd</kbd> + <kbd>Enter</kbd>: run code
    + <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>P</kbd>: re-run Previous code
    + <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>K</kbd>: Knit document


## Rstudio addins

[Addins](https://rstudio.github.io/rstudioaddins/) are small integrated packages that solve small tasks.

```{r, echo = FALSE, out.width='60%'}
knitr::include_graphics("img/rstudio_addins.png")
```

[Dean Attali](deanattali.com/) created a [package](https://github.com/daattali/addinslist#readme) to explore and manage your addins.

Sometimes, cause crashes

## Recommended reading

- [R for data science](http://r4ds.had.co.nz/)

```{r, echo = FALSE, out.width='10%'}
knitr::include_graphics("https://raw.githubusercontent.com/hadley/r4ds/master/cover.png")
```

- [Advanced in R](http://adv-r.had.co.nz/) by Hadley

- About R weirdness
    + [R inferno](http://www.burns-stat.com/documents/books/the-r-inferno/) by Patrick Burns
    + [Rbitrary](https://ironholds.org/projects/rbitrary/) by Oliver Keyes
    
See why:
- `<-` and not `=`
- `::` and not `:::`
- `library` and not `require` etc.

## Acknowledgments

* Hadley Wickham
* Steve Simpson
* Jenny Bryan
* Dean Attali
* David Robinson
* Eric Koncina