Vignette data cleaning and workflows #92

florianm · 2020-08-25T08:46:36Z

Feature

From the ODK Forum
vignette "u r ODK, now what"

Fun things to do with ODK data via ruODK downstream:

For smaller projects, build the analysis straight from the ruODK Rmd template.
For medium sized projects, build an R package and turn the analysis into a vignette. Example https://dbca-wa.github.io/rOzCBI/articles/analysis.html
For larger projects, build data ETL pipelines into data warehouses that handle QA and serve as point of truth for the records. This makes data from ODK read-only, upstream "data as captured", and shifts QA/data cleaning downstream of ODK.
For things like ETL pipelines, check out the R package drake, wonderful screencast by Miles McBain here.
An example drake plan for ETL using ruODK is here, broken up into extract/transform/upload steps with a few extras around skip logic and resolving/QAing user names; workhorse functions are from wastdr.
For QA, Rich Iannone's pointblank package is amazingly useful. Check out his screencast here.

lognaturel · 2021-03-29T21:41:38Z

Related to this theme of cleaning, I've been wondering whether it might be possible/in scope to use the upcoming submission review features to automatically flag records with possible issues. I'm imagining something like defining data constraints and for any submission/row that violates them, calling home to Central to set the review status to "Has Issues" with the constraint violation text as the note. I describe this with surface understanding of ruODK so please don't take the suggestion literally but hopefully it helps illustrate the concept! Broadly, I'm interested in ways that users can automatically flag suspicious submissions. Happy to move the conversation somewhere else more appropriate if this is not feeling like the right place!

florianm · 2021-03-30T02:10:43Z

Oh that's a great idea, thanks for the suggestion!
I could imagine this as a worked example in a vignette. There could be an angle of "turn data validation errors into suggestions for form validation".

The use cases in which my users want to update records are:

An enumerator encounters an animal/secondary sign (e.g. turtle track in sand)/object of species/classification X, but is not sure and selects "unsure" as species. The form prompts to take a photo. If there's a photo, a QA operator can then review the photo and update the species.
An enumerator is absolutely sure to have seem species X far, far outside its distribution (something we know only downstream of ODK), so it's highly likely that the species is wrongly identified. These species should also be reviewed.

In my own use case, all data from ODK are imported into a data warehouse (Django), where we audit all QA operations (edits - django-revision) and decisions (quality levels - django-fsm). That's of course the most heavyweight implementation.

For a light-weight implementation purely in ODK Central / R / ruODK, I could imagine:

ruODK downloads and parses all data.
Rich Iannone's pointblank R package is used to interrogate the data and easily define data validation rules
Let pointblank create agent reports showing which records fail which quality check.
Tabulate pointblank's failing records with links to ODK Central: view each failing record, mark one/all as "Has Issues".

This outsources all validation logic to pointblank, and focuses ruODK on the use case "mark this list of records as Has Issues".

lognaturel · 2021-03-30T22:16:04Z

That sounds really great, @florianm! Thanks for the outline. Would be amazing to make some of these a reality. I'll see if I can help make that happen.

florianm · 2023-06-10T09:54:16Z

Stu Norris suggested to use Microsoft 365 R package to notify enumerators via email on data quality issues.

florianm added the feature a feature request or enhancement label Aug 25, 2020

florianm added this to the Release 1.0 milestone Aug 25, 2020

florianm self-assigned this Aug 25, 2020

florianm added documentation Package and function level documentation and removed feature a feature request or enhancement labels Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vignette data cleaning and workflows #92

Vignette data cleaning and workflows #92

florianm commented Aug 25, 2020 •

edited

Loading

lognaturel commented Mar 29, 2021

florianm commented Mar 30, 2021

lognaturel commented Mar 30, 2021

florianm commented Jun 10, 2023

Vignette data cleaning and workflows #92

Vignette data cleaning and workflows #92

Comments

florianm commented Aug 25, 2020 • edited Loading

Feature

lognaturel commented Mar 29, 2021

florianm commented Mar 30, 2021

lognaturel commented Mar 30, 2021

florianm commented Jun 10, 2023

florianm commented Aug 25, 2020 •

edited

Loading