-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
154 lines (114 loc) · 3.88 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/"
)
options(digits = 4, width = 120)
```
# dbc: Dictionary-based cleaning
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![R-CMD-check](https://github.com/epicentre-msf/dbc/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/epicentre-msf/dbc/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
Tools for creating and applying dictionaries of value-replacement pairs, to
clean non-valid values of numeric, categorical, or date-type variables within a
dataset.
### Installation
Install from GitHub with:
```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("epicentre-msf/dbc")
```
### Example usage
#### Example dataset and dictionary
```{r}
library(dbc)
data(ll1) # example messy dataset
data(dict_categ1) # example dictionary of categorical vars and allowed values
ll1
```
#### Check and clean numeric variables
##### 1. Produce an initial dictionary of non-valid numeric values
```{r}
dict_clean_numeric <- check_numeric(
ll1,
vars = c("age", "contacts"), # cols that should be numeric
fn = as.integer # values not coercible by `fn` are non-valid
)
dict_clean_numeric
```
##### 2. Manually review non-valid values and give appropriate replacements, or use keyword ".na" to indicate that the value has been reviewed and cannot be corrected.
Normally one would do this step in a spreadsheet but we'll do it in R here for simplicity.
```{r}
dict_clean_numeric$replacement <- c(".na", "39", "10", ".na")
```
##### 3. Use the cleaning dictionary to clean numeric variables
```{r}
clean_numeric(
ll1,
vars = c("age", "contacts"),
dict_clean = dict_clean_numeric,
fn = as.integer
)
```
##### 4. If the original dataset is updated, we can repeat the cleaning steps while retaining the corrections already made in the previous cleaning dictionary.
*Check for new non-valid numeric values, after incorporating previous cleaning*
```{r}
dict_clean_numeric_update <- check_numeric(
ll2, # same as ll1 but with 3 additional entries
vars = c("age", "contacts"),
dict_clean = dict_clean_numeric, # incorporate previous cleaning before checking
fn = as.integer,
return_all = TRUE # return original cleaning dict + new entries
)
dict_clean_numeric_update
```
*Manually specify replacement for new non-valid entry*
```{r}
dict_clean_numeric_update$replacement[5] <- "6"
```
*Apply updated cleaning dictionary to updated dataset*
```{r}
clean_numeric(
ll2,
vars = c("age", "contacts"),
dict_clean = dict_clean_numeric_update,
fn = as.integer
)
```
#### Check and clean categorical variables
##### 1. Produce an initial dictionary of non-valid categorical values
```{r}
dict_clean_categ <- check_categorical(
ll1,
dict_allowed = dict_categ1 # dictionary of categorical vars and their allowed values
)
dict_clean_categ
```
##### 2. Manually review non-valid values and give appropriate replacements, or use keyword ".na" to indicate that the value has been reviewed and cannot be corrected.
Again, we would normally do this step in a spreadsheet but we do it in R here for simplicity.
```{r}
dict_clean_categ$replacement <- c(
"Years",
"Years",
"Cured",
".na",
"M",
".na",
"Suspected"
)
```
##### 3. Use the cleaning dictionary to clean categorical variables
```{r}
clean_categorical(
ll1,
dict_allowed = dict_categ1,
dict_clean = dict_clean_categ
)
```
##### 4. If the original dataset is updated, we can repeat the cleaning steps while retaining the corrections already made in the previous cleaning dictionary, as in the example with numeric variables above.