Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish NA (missing) from NA (accumulated) #547

Open
sbfnk opened this issue Feb 14, 2024 · 13 comments · May be fixed by #851
Open

Distinguish NA (missing) from NA (accumulated) #547

sbfnk opened this issue Feb 14, 2024 · 13 comments · May be fixed by #851

Comments

@sbfnk
Copy link
Contributor

sbfnk commented Feb 14, 2024

Enabling this would, I think, require some sort of that marks dates explicitly as missing vs. NA.

I think this would be my preferred option as it would be more general but I also think it can be addressed in its own review as it would be a superset of this PR.

My thought on how that would work is to have a new variable (accumulate) that indicates which days should be summed.

Originally posted by @seabbs in #534 (review)

@sbfnk
Copy link
Contributor Author

sbfnk commented Feb 14, 2024

My thought on how that would work is to have a new variable (accumulate) that indicates which days should be summed.

Another option would be to distinguish between explicit (date exists in the data, value is NA meaning missing) vs. implicit (date doesn't exist in the data meaning accumulate) NAs which might be easier preprocessing if potentially easier to inadvertently get wrong.

@seabbs
Copy link
Contributor

seabbs commented Feb 19, 2024

yeah potentially but also think that could be a bit dangerous. I would instead suggest making a helper function that maps from that structure to the less dangerous explicit version for those that clearly want that.

@seabbs
Copy link
Contributor

seabbs commented Feb 19, 2024

and if going that way I'd suggest that becomes a dependent issue

@sbfnk
Copy link
Contributor Author

sbfnk commented Sep 16, 2024

With appropriate warnings messages as suggested in #771 then I think this is the best option as it can take all information from a 2-column data frame as before:

distinguish between explicit (date exists in the data, value is NA meaning missing) vs. implicit (date doesn't exist in the data meaning accumulate) NAs

@seabbs
Copy link
Contributor

seabbs commented Sep 16, 2024

I don't think I agree. I think there should be one way of handling missing data (as missing) and it can throw a warning if creating missing dates saying what it is doing.

I think overloading NAs like we have done for accumulation is confusing and dangerous and would much prefer a separate feature describing this.

Something I think we want to be aware of is non-standard schemes. These could be 1. Non-constant reporting and 2. repeated reporting (some counts are reported twice as aggregates of different dates).

I haven't really seen the latter and its quite an edge case so its unclear to me if we really want to support it or not.

@sbfnk
Copy link
Contributor Author

sbfnk commented Sep 16, 2024

I'm open to suggestions and acknowledge there are dangers in overloading interpretations. My ideal would be one in which it's fairly straightforward (and safe) to handle the common cases of daily/weekly data on incidence/prevalence and missingness that could correspond to zeroes or missed reports.

@jamesmbaazam
Copy link
Contributor

With appropriate warnings messages as suggested in #771 then I think this is the best option as it can take all information from a 2-column data frame as before:

distinguish between explicit (date exists in the data, value is NA meaning missing) vs. implicit (date doesn't exist in the data meaning accumulate) NAs

This can now be checked with the test_data_complete() function introduced in #774, when merged.

@seabbs
Copy link
Contributor

seabbs commented Sep 23, 2024

My ideal would be one in which it's fairly straightforward (and safe) to handle the common cases of daily/weekly data on incidence/prevalence and missingness that could correspond to zeroes or missed reports

Do you not think this is possible without overloading using the kind of approach I have suggested with helper utilities?

@sbfnk
Copy link
Contributor Author

sbfnk commented Oct 22, 2024

My ideal would be one in which it's fairly straightforward (and safe) to handle the common cases of daily/weekly data on incidence/prevalence and missingness that could correspond to zeroes or missed reports

Do you not think this is possible without overloading using the kind of approach I have suggested with helper utilities?

It probably is. It could relate to supporting option (1) as suggested in #346 (comment)
with an additional fill_missing(na = "accumulate") function or one with a better name, and then failing with an error if any data with NAs is passed to estimate_infections(). As mentioned there it would be quite a breaking change but also one which probably helps avoid the kind of confusion and potential for error that we have seen.

@seabbs
Copy link
Contributor

seabbs commented Oct 23, 2024

I'm not sure I totally follow this now. Does the suggestion support a mixture of accumulation and missing data?

@sbfnk
Copy link
Contributor Author

sbfnk commented Oct 23, 2024

estimate_infections would expect an accumulation column if one wanted to accumulate but a user could create this based on some interpretation of NAs using a separate function.

@seabbs
Copy link
Contributor

seabbs commented Oct 23, 2024

and NAs would go back to working only with missing data and you wouldn't support repeated reporting?

@seabbs
Copy link
Contributor

seabbs commented Oct 23, 2024

if yes and no sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants