Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignette data cleaning and workflows #92

Open
florianm opened this issue Aug 25, 2020 · 4 comments
Open

Vignette data cleaning and workflows #92

florianm opened this issue Aug 25, 2020 · 4 comments
Assignees
Labels
documentation Package and function level documentation

Comments

@florianm
Copy link
Collaborator

florianm commented Aug 25, 2020

Feature

From the ODK Forum
vignette "u r ODK, now what"

Fun things to do with ODK data via ruODK downstream:

  • For smaller projects, build the analysis straight from the ruODK Rmd template.
  • For medium sized projects, build an R package and turn the analysis into a vignette. Example https://dbca-wa.github.io/rOzCBI/articles/analysis.html
  • For larger projects, build data ETL pipelines into data warehouses that handle QA and serve as point of truth for the records. This makes data from ODK read-only, upstream "data as captured", and shifts QA/data cleaning downstream of ODK.
  • For things like ETL pipelines, check out the R package drake, wonderful screencast by Miles McBain here.
  • An example drake plan for ETL using ruODK is here, broken up into extract/transform/upload steps with a few extras around skip logic and resolving/QAing user names; workhorse functions are from wastdr.
  • For QA, Rich Iannone's pointblank package is amazingly useful. Check out his screencast here.
@florianm florianm added the feature a feature request or enhancement label Aug 25, 2020
@florianm florianm added this to the Release 1.0 milestone Aug 25, 2020
@florianm florianm self-assigned this Aug 25, 2020
@florianm florianm added documentation Package and function level documentation and removed feature a feature request or enhancement labels Oct 20, 2020
@lognaturel
Copy link

Related to this theme of cleaning, I've been wondering whether it might be possible/in scope to use the upcoming submission review features to automatically flag records with possible issues. I'm imagining something like defining data constraints and for any submission/row that violates them, calling home to Central to set the review status to "Has Issues" with the constraint violation text as the note. I describe this with surface understanding of ruODK so please don't take the suggestion literally but hopefully it helps illustrate the concept! Broadly, I'm interested in ways that users can automatically flag suspicious submissions. Happy to move the conversation somewhere else more appropriate if this is not feeling like the right place!

@florianm
Copy link
Collaborator Author

Oh that's a great idea, thanks for the suggestion!
I could imagine this as a worked example in a vignette. There could be an angle of "turn data validation errors into suggestions for form validation".

The use cases in which my users want to update records are:

  • An enumerator encounters an animal/secondary sign (e.g. turtle track in sand)/object of species/classification X, but is not sure and selects "unsure" as species. The form prompts to take a photo. If there's a photo, a QA operator can then review the photo and update the species.
  • An enumerator is absolutely sure to have seem species X far, far outside its distribution (something we know only downstream of ODK), so it's highly likely that the species is wrongly identified. These species should also be reviewed.

In my own use case, all data from ODK are imported into a data warehouse (Django), where we audit all QA operations (edits - django-revision) and decisions (quality levels - django-fsm). That's of course the most heavyweight implementation.

For a light-weight implementation purely in ODK Central / R / ruODK, I could imagine:

  • ruODK downloads and parses all data.
  • Rich Iannone's pointblank R package is used to interrogate the data and easily define data validation rules
  • Let pointblank create agent reports showing which records fail which quality check.
  • Tabulate pointblank's failing records with links to ODK Central: view each failing record, mark one/all as "Has Issues".

This outsources all validation logic to pointblank, and focuses ruODK on the use case "mark this list of records as Has Issues".

@lognaturel
Copy link

That sounds really great, @florianm! Thanks for the outline. Would be amazing to make some of these a reality. I'll see if I can help make that happen.

@florianm
Copy link
Collaborator Author

Stu Norris suggested to use Microsoft 365 R package to notify enumerators via email on data quality issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Package and function level documentation
Projects
None yet
Development

No branches or pull requests

2 participants