-
Notifications
You must be signed in to change notification settings - Fork 0
/
mod_wrangle.qmd
565 lines (405 loc) · 31 KB
/
mod_wrangle.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
---
title: "Data Harmonization & Wrangling"
code-annotations: hover
---
## Overview
Now that we have covered how to find data and use data visualization methods to explore it, we can move on to combining separate data files and preparing that combined data file for analysis. For the purposes of this module we're adopting a very narrow view of harmonization and a very broad view of wrangling but this distinction aligns well with two discrete philosophical/practical arenas. To make those definitions explicit:
- <u>"Harmonization" = process of combining separate primary data objects into one object</u>. This includes things like synonymizing columns, or changing data format to support combination. This _excludes_ quality control steps--even those that are undertaken before harmonization begins.
- <u>"Wrangling" = all modifications to data meant to create an analysis-ready 'tidy' data object</u>. This includes quality control, unit conversions, and data 'shape' changes to name a few. Note that attaching ancillary data to your primary data object (e.g., attaching temperature data to a dataset on plant species composition) _also falls into this category!_
## Learning Objectives
After completing this module you will be able to:
- <u>Identify</u> typical steps in data harmonization and wrangling workflows
- <u>Create</u> a harmonization workflow
- <u>Define</u> quality control
- <u>Summarize</u> typical operations in a quality control workflow
- <u>Use</u> regular expressions to perform flexible text operations
- <u>Write</u> custom functions to reduce code duplication
- <u>Identify</u> value of and typical obstacles to data 'joining'
- <u>Explain</u> benefits and drawbacks of using data shape to streamline code
- <u>Design</u> a complete data wrangling workflow
## Preparation
TBD (<u>T</u>o <u>B</u>e <u>D</u>etermined)
## Needed Packages
If you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:
```{r install-pkgs}
#| eval: false
# Note that these lines only need to be run once per computer
## So you can skip this step if you've installed these before
install.packages("ltertools")
install.packages("lterdatasampler")
install.packages("psych")
install.packages("supportR")
install.packages("tidyverse")
```
We'll load the Tidyverse meta-package here to have access to many of its useful tools when we need them later.
```{r libs}
#| message: false
# Load tidyverse
library(tidyverse)
```
## Harmonizing Data
Data harmonization is an interesting topic in that it is _vital_ for synthesis projects but only very rarely relevant for primary research. Synthesis projects must reckon with the data choices made by each team of original data collectors. These collectors may or may not have recorded their judgement calls (or indeed, any metadata) but before synthesis work can be meaningfully done these independent datasets must be made comparable to one another and combined.
For tabular data, we recommend using the [`ltertools` R package](https://lter.github.io/ltertools/) to perform any needed harmonization. This package relies on a "column key" to translate the original column names into equivalents that apply across all datasets. Users can generate this column key however they would like but Google Sheets is a strong option as it allows multiple synthesis team members to simultaneously work on filling in the needed bits of the key. If you already have a set of files locally, `ltertools` does offer a `begin_key` function that creates the first two required columns in the column key.
The column key requires three columns:
1. "source" -- Name of the raw file
2. "raw_name" -- Name of all raw columns in that file to be synonymized
3. "tidy_name" -- New name for each raw column that should be carried to the harmonized data
Note that any raw names either not included in the column key or that lack a tidy name equivalent will be excluded from the final data object. For more information, consult the `ltertools` [package vignette](https://lter.github.io/ltertools/articles/ltertools.html). For convenience, we're attaching the visual diagram of this method of harmonization from the package vignette.
<p align="center">
<img src="images/image_harmonize-workflow.png" alt="Four color-coded tables are in a soft rectangle. One is pulled out and its column names are replaced based on their respective 'tidy names' in the column key table. This is done for each of the other tables then the four tables--with fixed column names--are combined into a single data table" width="90%">
</p>
## Wrangling Data
Data wrangling is a _huge_ subject that covers a wide range of topics. In this part of the module, we'll attempt to touch on a wide range of tools that may prove valuable to your data wrangling efforts. This is certainly non-exhaustive and you'll likely find new tools that fit your coding style and professional intuition better. However, hopefully the topics covered below provide a nice 'jumping off' point to reproducibly prepare your data for analysis and visualization work later in the lifecycle of the project.
To begin, we'll load the Plum Island Ecosystems fiddler crab dataset we've used in other modules.
```{r prep}
#| message: false
#| warning: false
# Load the lterdatasampler package
library(lterdatasampler)
# Load the fiddler crab dataset
data(pie_crab)
```
### Exploring the Data
Before beginning any code operations, it's important to get a sense for the data. Characteristics like the dimensions of the dataset, the column names, and the type of information stored in each column are all crucial pre-requisites to knowing what tools can or should be used on the data.
Checking the data structure is one way of getting a lot of this high-level information.
```{r structure}
# Check dataset structure
str(pie_crab)
```
For data that are primarily numeric, you may find data summary functions to be valuable. Note that most functions of this type do not provide useful information on text columns so you'll need to find that information elsewhere.
```{r summary}
# Get a simple summary of the data
summary(pie_crab)
```
For text columns it can sometimes be useful to simply look at the unique entries in a given column and sort them alphabetically for ease of parsing.
```{r sort-unique}
# Look at the sites included in the data
sort(unique(pie_crab$site))
```
For those of you who think more visually, a histogram can be a nice way of examining numeric data. There are simple histogram functions in the 'base' packages of most programming languages but it can sometimes be worth it to use those from special libraries because they can often convey additional detail.
```{r multi-hist}
#| message: false
#| fig-align: center
#| fig-width: 4
#| fig-height: 3
# Load the psych library
library(psych)
# Get the histogram of crab "size" (carapace width in mm)
psych::multi.hist(pie_crab$size)
```
:::{.callout-warning icon="false"}
#### Discussion: Data Exploration Tools
With a group of 4-5 others discuss the following questions:
- What tools do you use when exploring new data?
- Do you already use any of these tools to explore your data?
- If you do, why do you use them?
- If not, where do you think they might be valuable to include?
- What value--if any--do you see in including these exploratory efforts in your code workflow?
:::
### Quality Control
You may have encountered the phrase "QA/QC" (<u>Q</u>uality <u>A</u>ssurance / <u>Q</u>uality <u>C</u>ontrol) in relation to data cleaning. Technically, quality assurance only encapsulates _preventative_ measures for reducing errors. One example of QA would be using a template for field datasheets because using standard fields reduces the risk that data are recorded inconsistently and/or incompletely. Quality control on the other hand refers to all steps taken to resolve errors _after_ data are collected. Any code that you write to fix typos or remove outliers from a dataset falls under the umbrella of QC.
In synthesis work, QA is only very rarely an option. You'll be working with datasets that have already been collected and attempting to handle any issues _post hoc_ which means the vast majority of data wrangling operations will be quality control methods. These QC efforts can be **incredibly** time-consuming so using a programming language (like R or Python) is a dramatic improvement over manually looking through the data using Microsoft Excel or other programs like it.
#### QC Considerations
The datasets you gather for your synthesis project will likely have a multitude of issues you'll need to resolve before the data are ready for visualization or analysis. Some of these issues may be clearly identified in that datasets' metadata or apply to all datasets but it is good practice to make a thorough QC effort as early as is feasible. Keep the following data issues and/or checks in mind as we cover code tools that may be useful in this context later in the module.
- Verify taxonomic classificiations against authorities
- [ITIS](https://www.itis.gov/), [GBIF](https://www.gbif.org/), and [WoRMS](https://www.marinespecies.org/) are all examples of taxonomic authorities
- Note that many of these authorities have R or Python libraries that can make this verification step scripted rather than dependent on manual searches
- Handle missing data
- Some datasets will use a code to indicate missing values (likely identified in their metadata) while others will just have empty cells
- Check for unreasonable values / outliers
- Can use conditionals to create "flags" for these values or just filter them out
- Check geographic coordinates' reasonability
- E.g., western hemisphere coordinates may lack the minus sign
- Check date formatting
- I.e., if all sampling is done in the first week of each month it can be difficult to say whether a given date is formatted as MM/DD/YY or DD/MM/YY
- Consider spatial and temporal granularity among datasets
- You may need to aggregate data from separate studies in different ways to ensure that the data are directly comparable across all of the data you gather
- Handle duplicate data / rows
#### Number Checking
When you read in a dataset and a column that _should be_ numeric is instead read in as a character, it can be a sign that there are malformed numbers lurking in the background. Checking for and resolving these non-numbers is preferable to simply coercing the column into being numeric because the latter method typically changes those values to 'NA' where a human might be able to deduce the true number each value 'should be.'
```{r num-check}
# Load the supportR package
library(supportR)
# Create an example dataset with non-numbers in ideally numeric columns
fish_ct <- data.frame("species" = c("salmon", "bass", "halibut", "moray eel"),
"count" = c(12, "14x", "_23", 1))
# Check for malformed numbers in column(s) that should be numeric
bad_nums <- supportR::num_check(data = fish_ct, col = "count")
```
In the above example, "14x" would be coerced to NA if you simply force the column without checking but you could drop the "x" with text replacing methods once you use tools like this one to flag it for your attention.
#### Text Replacement
One of the simpler ways of handling text issues is just to replace a string with another string. Most programming languages support this functionality.
```{r text-replace}
# Use pattern match/replace to simplify problem entries
fish_ct$count <- gsub(pattern = "x|_", replacement = "", x = fish_ct$count)
# Check that they are fixed
bad_nums <- supportR::num_check(data = fish_ct, col = "count")
```
The vertical line in the `gsub` example above lets us search for (and replace) multiple patterns. Note however that while you can search for many patterns at once, only a single replacement value can be provided with this function.
#### Regular Expressions
You may sometimes want to perform more generic string matching where you don't necessarily know--or want to list--all possible strings to find and replace. For instance, you may want remove any letter in a numeric column or find and replace numbers with some sort of text note. "Regular expressions" are how programmers specify these generic matches and using them can be a nice way of streamlining code.
```{r regex}
# Make a test vector
regex_vec <- c("hello", "123", "goodbye", "456")
# Find all numbers and replace with the letter X
gsub(pattern = "[[:digit:]]", replacement = "x", x = regex_vec)
# Replace any number of letters with only a single 0
gsub(pattern = "[[:alpha:]]+", replacement = "0", x = regex_vec)
```
The [`stringr` package cheatsheet](https://github.com/rstudio/cheatsheets/blob/afaa1fec4c5b9aabfa886218b6ba20317446d378/strings.pdf) has a really nice list of regular expression options that you may find valuable if you want to delve deeper on this topic. Scroll to the second page of the PDF to see the most relevant parts.
### Conditionals
Rather than finding and replacing content, you may want to create a new column based on the contents of a different column. In plain language you might phrase this as 'if column X has \[some values\] then column Y should have \[other values\]'. These operations are called <u>conditionals</u> and are an important part of data wrangling.
If you only want your conditional to support two outcomes (as in an either/or statement) there are useful functions that support this. Let's return to our Plum Island Ecosystems crab dataset for an example.
```{r ifelse}
# Make a new colum with an either/or conditional
pie_crab_v2 <- pie_crab %>%
dplyr::mutate(size_category = ifelse(test = (size >= 15), # <1>
yes = "big",
no = "small"),
.after = size)
# Count the number of crabs in each category
pie_crab_v2 %>%
dplyr::group_by(size_category) %>%
dplyr::summarize(crab_ct = dplyr::n())
```
1. `mutate` makes a new column, `ifelse` is actually doing the conditional
If you have multiple different conditions you _can_ just stack these either/or conditionals together but this gets cumbersome quickly. It is preferable to instead use a function that supports as many alternates as you want!
```{r case-when}
# Make a new column with several conditionals
pie_crab_v2 <- pie_crab %>%
dplyr::mutate(size_category = dplyr::case_when(
size <= 10 ~ "tiny", # <1>
size > 10 & size <= 15 ~ "small",
size > 15 & size <= 20 ~ "big",
size > 20 ~ "huge",
TRUE ~ "uncategorized"), # <2>
.after = size)
# Count the number of crabs in each category
pie_crab_v2 %>%
dplyr::group_by(size_category) %>%
dplyr::summarize(crab_ct = dplyr::n())
```
1. Syntax is 'test ~ what to do when true'
2. This line is a catch-all for any rows that _don't_ meet previous conditions
Note that you can use functions like this one when you do have an either/or conditional if you prefer this format.
:::{.callout-note icon="false"}
#### Activity: Conditionals
In a script, attempt the following with the PIE crab data:
- Create a column indicating when air temperature is above or below 13° Fahrenheit
- Create a column indicating whether water temperature is lower than the first quartile, between the first quartile and the median water temp, between the median and the third quartile or greater than the third quartile
- _Hint:_ consult the `summary` function output!
:::
### Uniting / Separating Columns
Sometimes one column has multiple pieces of information that you'd like to consider separately. A date column is a common example of this because particular months are always in a given season regardless of the specific day or year. So, it can be useful to break a complete date (i.e., year/month/day) into its component bits to be better able to access those pieces of information.
```{r separate-wider-delim}
# Split date into each piece of temporal info
pie_crab_v3 <- pie_crab_v2 %>%
tidyr::separate_wider_delim(cols = date,
delim = "-", # <1>
names = c("year", "month", "day"),
cols_remove = TRUE) # <2>
# Check that out
str(pie_crab_v3)
```
1. 'delim' is short for "delimiter" which we covered in the Reproducibility module
2. This argument specifies whether to remove the original column when making the new columns
While breaking apart a column's contents can be useful, it can also be helpful to combine the contents of several columns!
```{r unite}
# Re-combine data information back into date
pie_crab_v4 <- pie_crab_v3 %>%
tidyr::unite(col = "date",
sep = "/", # <1>
year:day,
remove = FALSE) # <2>
# Structure check
str(pie_crab_v4)
```
1. This is equivalent to the 'delim' argument in the previous function
2. Comparable to the 'cols_remove' argument in the previous function
Note in this output how despite re-combining data information the column is listed as a character column! Simply combining or separating data is not always enough so you need to really lean into frequent data structure checks to be sure that your data are structured in the way that you want.
### Joining Data
Often the early steps of a synthesis project involve combining the data tables horizontally. You might imagine that you have two groups' data on sea star abundance and--once you've synonymized the column names--you can simply 'stack' the tables on top of one another. Slightly trickier but no less common is combining tables by the contents of a shared column (or columns). Cases like this include wanting to combine your sea star table with ocean temperature data from the region of each group's research. You can't simply attach the columns because that assumes that the row order is identical between the two data tables (and indeed, that there are the same number of rows in both to begin with!). In this case, if both data tables shared some columns (perhaps "site" and coordinate columns) you can use **joins** to let your computer match these key columns and make sure that only appropriate rows are combined.
Because joins are completely dependent upon the value in both columns being an _exact_ match, it is a good idea to carefully check the contents of those columns before attempting a join to make sure that the join will be successful.
```{r diff-check-1}
# Create a fish taxonomy dataframe that corresponds with the earlier fish dataframe
fish_tax <- data.frame("species" = c("salmon", "bass", "halibut", "eel"),
"family" = c("Salmonidae", "Serranidae", "Pleuronectidae", "Muraenidae"))
# Check to make sure that the 'species' column matches between both tables
supportR::diff_check(old = fish_ct$species, new = fish_tax$species)
```
```{r diff-check-2}
# Use text replacement methods to fix that mistake in one table
fish_tax_v2 <- fish_tax %>%
dplyr::mutate(species = gsub(pattern = "^eel$", # <1>
replacement = "moray eel",
x = species))
# Re-check to make sure that fixed it
supportR::diff_check(old = fish_ct$species, new = fish_tax_v2$species)
```
1. The symbols around "eel" mean that we're only finding/replacing _exact_ matches. It doesn't matter in this context but often replacing a partial match would result in more problems. For example, replacing "eel" with "moray eel" could make "electric eel" into "electric moray eel".
Now that the shared column matches between the two two dataframes we can use a join to combine them! There are four types of join:
1. left/right join
2. full join (a.k.a. outer join)
3. inner join
4. anti join
You can learn more about the types of join [here](https://nceas.github.io/scicomp-workshop-tidyverse/join.html) or [here](https://njlyon0.github.io/teach_r-for-biologists/materials/slides_4a.html#/title-slide) but the quick explanation is that <u>the name of the join indicates whether the rows of the "left" and/or the "right" table are retained in the combined table</u>. In synthesis work a left join or full join is most common (where you have your primary data in the left position and some ancillary/supplementary dataset in the right position).
```{r join}
# Let's combine the fish count and fish taxonomy information
fish_df <- fish_ct %>%
# Actual join step
dplyr::left_join(y = fish_tax_v2, by = "species") %>% # <1>
# Move 'family' column to the left of all other columns
dplyr::relocate(family, .before = dplyr::everything())
# Look at the result of that
fish_df
```
1. The 'by' argument accepts a vector of column names found in both data tables
:::{.callout-note icon="false"}
#### Activity: Separating Columns & Joining Data
In a script, attempt the following with the PIE crab data:
1. Create a data frame where you bin months into seasons (i.e., winter, spring, summer, fall)
- Use your judgement on which month(s) should fall into each given PIE's latitude/location
2. Join your season table to the PIE crab data based on month
- _Hint:_ you may need to modify the PIE dataset to ensure both data tables share at least one column upon which they can be joined
3. Calculate the average size of crabs in each season in order to identify which season correlates with the largest crabs
:::
### Leveraging Data Shape
You may already be familiar with data shape but fewer people recognize how playing with the shape of data can make certain operations _dramatically_ more efficient. If you haven't encountered it before, any data table can be said to have one of two 'shapes': either **long** or **wide**. Wide data have all measured variables from a single observation in one row (typically resulting in more columns than rows or "wider" data tables). Long data usually have one observation split into many rows (typically resulting in more rows than columns or "longer" data tables).
Data shape is often important for statistical analysis or visualization but it has an under-appreciated role to play in quality control efforts as well. If many columns have the shared criteria for what constitutes "tidy", you can reshape the data to get all of those values into a single column (i.e., reshape longer), perform any needed wrangling, then--when you're finished--reshape back into the original data shape (i.e., reshape wider). As opposed to applying the same operations repeatedly to each column individually.
Let's consider an example to help clarify this. We'll simulate a butterfly dataset where both the number of different species and their sex were recorded in the same column. This makes the column not technically numeric and therefore unusable in analysis or visualization.
```{r shape}
# Generate a butterfly dataframe
bfly_v1 <- data.frame("pasture" = c("PNW", "PNW", "RCS", "RCS"),
"monarch" = c("14m", "10f", "7m", "16f"),
"melissa_blue" = c("32m", "2f", "6m", "0f"),
"swallowtail" = c("1m", "3f", "0m", "5f"))
# First we'll reshape this into long format
bfly_v2 <- bfly_v1 %>%
tidyr::pivot_longer(cols = -pasture,
names_to = "butterfly_sp",
values_to = "count_sex")
# Check what that leaves us with
head(bfly_v2, n = 4)
# Let's separate count from sex to be more usable later
bfly_v3 <- bfly_v2 %>%
tidyr::separate_wider_regex(cols = count_sex,
c(count = "[[:digit:]]+", sex = "[[:alpha:]]")) %>%
# Make the 'count' column a real number now
dplyr::mutate(count = as.numeric(count))
# Re-check output
head(bfly_v3, n = 4)
# Reshape back into wide-ish format
bfly_v4 <- bfly_v3 %>%
tidyr::pivot_wider(names_from = "butterfly_sp", values_from = count)
# Re-re-check output
head(bfly_v4)
```
While we absolutely _could_ have used the same function to break apart count and butterfly sex data it would have involved copy/pasting the same information repeatedly. By pivoting to long format first, we can greatly streamline our code. This can also be advantageous for unit conversions, applying data transformations, or checking text column contents among many other possible applications.
### Loops
Another way of simplfying repetitive operations is to use a "for loop" (often called simply "loops"). Loops allow you to iterate across a piece of code for a set number of times. Loops require you to define an "index" object that will change itself at the end of each iteration of the loop before beginning the next iteration. This index object's identity will be determined by whatever set of values you define at the top of the loop.
Here's a very bare bones example to demonstrate the fundamentals.
```{r loop-1}
# Loop across each number between 2 and 4
for(k in 2:4){ # <1>
# Square the number
result <- k^2
# Message that outside of the loop
message(k, " squared is ", result)
} # <2>
```
1. 'k' is our index object in this loop
2. Note that the operations to iterate across are wrapped in curly braces (`{...}`)
Once you get the hang of loops, they can be a nice way of simplifying your code in a relatively human-readable way! Let's return to our Plum Island Ecosystems crab dataset for a more nuanced example.
```{r loop-2}
# Create an empty list
crab_list <- list()
# Let's loop across size categories of crab
for(focal_size in unique(pie_crab_v4$size_category)){ # <1>
# Subset the data to just this size category
focal_df <- pie_crab_v4 %>%
dplyr::filter(size_category == focal_size)
# Calculate average and standard deviation of size within this category
size_avg <- mean(focal_df$size, na.rm = T)
size_dev <- sd(focal_df$size, na.rm = T)
# Assemble this into a data table and add to our list
crab_list[[focal_size]] <- data.frame("size_category" = focal_size,
"size_mean" = size_avg,
"size_sd" = size_dev)
} # Close loop
# Unlist the outputs into a dataframe
crab_df <- purrr::list_rbind(x = crab_list) # <2>
# Check out the resulting data table
crab_df
```
1. Note that this is not the most efficient way of doing group-wise summarization but is--hopefully--a nice demonstration of loops!
2. When all elements of your list have the same column names, `list_rbind` efficiently stacks those elements into one longer data table.
### Custom Functions
Finally, writing your own, customized functions can be really useful particularly when doing synthesis work. Custom functions are generally useful for reducing duplication and increasing ease of maintenance (see the note on custom functions in the SSECR [Reproducibility module](https://lter.github.io/ssecr/mod_reproducibility.html#consider-custom-functions)) and also can be useful end products of synthesis work in and of themselves.
If one of your group's outputs is a new standard data format or analytical workflow, the functions that you develop to aid yourself become valuable to anyone who adopts your synthesis project's findings into their own workflows. If you get enough functions you can even release a package that others can install and use on their own computers. Such packages are a valuable product of synthesis efforts and can be a nice addition to a robust scientific resume/CV.
```{r custom-fxns}
#| fig-align: center
#| fig-width: 4
#| fig-height: 3
# Define custom function
crab_hist <- function(df, size_cat){
# Subset data to the desired category
data_sub <- dplyr::filter(.data = df, size_category == size_cat)
# Create a histogram
p <- psych::multi.hist(x = data_sub$size)
}
# Invoke function
crab_hist(df = pie_crab_v4, size_cat = "tiny")
```
When writing your own functions it can also be useful to program defensively. This involves anticipating likely errors and writing your own error messages that are more informative to the user than whatever machine-generated error would otherwise get generated
```{r custom-fxns-improved}
#| fig-align: center
#| fig-width: 4
#| fig-height: 3
# Define custom function
crab_hist <- function(df, size_cat = "small"){ # <1>
# Error out if 'df' isn't the right format
if(is.data.frame(df) != TRUE) # <2>
stop("'df' must be provided as a data frame")
# Error out if the data doesn't have the right columns
if(all(c("size_category", "size") %in% names(df)) != TRUE) # <3>
stop("'df' must include a 'size' and 'size_category' column")
# Error out for unsupported size category values
if(size_cat %in% unique(df$size_category) != TRUE)
stop("Specified 'size_cat' not found in provided data")
# Subset data to the desired category
data_sub <- dplyr::filter(.data = df, size_category == size_cat)
# Create a histogram
p <- psych::multi.hist(x = data_sub$size)
}
# Invoke new-and-improved function
crab_hist(df = pie_crab_v4) # <4>
```
1. The default category is now set to "small"
2. We recommend phrasing your error checks with this format (i.e., 'if \<some condition\> is _not_ true, then \<informative error/warning message\>)
3. The `%in%` operator lets you check whether one value matches any element of a set of accepted values. Very useful in contexts like this because the alternative would be a lot of separate "or" conditionals
4. We don't need to specify the 'size_cat' argument because we can rely on the default
:::{.callout-note icon="false"}
#### Activity: Custom Functions
In a script, attempt the following on the PIE crab data:
- Write a function that:
- (A) calculates the median of the user-supplied column
- (B) determines whether each value is above, equal to, or below the median
- (C) makes a column indicating the results of step B
- Use the function on the _standard deviation_ of water temperature
- Use it again on the standard deviation of air temperature
- Revisit your function and identify 2-3 likely errors
- Write custom checks (and error messages) for the set of likely issues you just identified
:::
## Additional Resources
### Papers & Documents
- Todd-Brown, K.E.O. _et al._ [Reviews and Syntheses: The Promise of Big Diverse Soil Data, Moving Current Practices Towards Future Potential](https://bg.copernicus.org/articles/19/3505/2022/bg-19-3505-2022.html). **2022**. _Biogeosciences_
- Elgarby, O. [The Ultimate Guide to Data Cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4). **2019**. _Medium_
- Borer, E. _et al._ [Some Simple Guidelines for Effective Data Management](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/0012-9623-90.2.205). **2009**. _Ecological Society of America Bulletin_
### Workshops & Courses
- The Carpentries. [Data Analysis and Visualization in R for Ecologists: Working with Data](https://datacarpentry.org/R-ecology-lesson/working-with-data.html). **2024**.
- LTER Scientific Computing Team. [Coding in the Tidyverse](https://lter.github.io/workshop-tidyverse/). **2023**.
- National Center for Ecological Analysis and Synthesis (NCEAS) Learning Hub [coreR: Cleaning & Wrangling Data](https://learning.nceas.ucsb.edu/2023-10-coreR/session_08.html). **2023**.
- NCEAS Learning Hub [coreR: Writing Functions & Packages](https://learning.nceas.ucsb.edu/2023-10-coreR/session_16.html). **2023**.
- Delta Science Program [Data Munging / QA / QC / Cleaning](https://nceas.github.io/oss-lessons/data-munging-qa-qc-cleaning/data-munging-qa-qc-cleaning.html). **2019**.
### Websites
- Fox, J. [Ten Commandments for Good Data Management](https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-good-data-management/). **2016**. _Dynamic Ecology_