-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bsweger/skip data checks option #47
Bsweger/skip data checks option #47
Conversation
@annakrystalli This isn't ready for a full review, but before going further, would love your thoughts on the direction of these changes. [This question(S) turned out longer than I'd anticipated--maybe worth a synchronous chat?] The new option to skip data checks has the performance impact we want: hub_path_cloud <- s3_bucket("uscdc-flusight-hub-v1")
# below will fail if we don't specify file_format, because the hub's valid formats
# are parquet AND csv (because we're no longer validating file formats during open_dataset)
# https://github.com/hubverse-org/flusight_hub_archive/blob/main/hub-config/admin.json#L10
hub_con <- connect_hub(hub_path_cloud, file_format = "parquet")) That said, I don't love how the change introduces different default behavior for S3-based hubs. Should we:
If we accept the second statement, it doesn't really make sense that we set Given the above (and considering the possibility that our arrow options for cloud-based hubs may diverge even further from the local options as we learn more), does it make sense to try using DuckDB instead of arrow in
|
Just to note,I have not looked at the code yet so this is a more high level response. I totally agree with you on all your areas and degrees of hesitation. Personally I feel it's worse for the default settings to return an error than be slow and I definitely don't feel the default of S3 hub connections should be to assume the hub is a hubverse transformed hub and therefore set the format to parquet by default. Given the above I almost feel like we should not have different defaults for local Vs S3 but instead issue a performance warning with more (or links to more) information about skipping certain checks and their implications/expectations. This is also making me feel even more strongly that the very special situation of a hubverse cloud transformed hub should be able to be detected/communicated and override defaults. |
Re duckdb I am definitely curious and open to exploring that as the default! |
One last thought. If not a special file, perhaps a dedicated function for specifically accessing hubverse transformed cloud hubs with appropriate defaults might work? |
Thanks for weighing in--we're in agreement that this PR as written is not the way to go for solving our performance issues on cloud-based data. I'm not clear on what you mean by a "special file." Can you say more about that? I think trying DuckDB is a worthy experiment: the performance limitations we're hitting with For example, I tried passing in a specific list of S3 URIs (instead of the model-output directory) to That strategy was not successful (branch here), seemingly because Arrow's R implementation will make an S3 call for every file in the list: apache/arrow#35715 (this issue was filed by a friend of yours 😄 ). If DuckDB can provide a consistent interface for all cloud-based hubs, regardless of where they are hosted or what the file format is, my .02 is that would be preferable to options that require a special handling for Hubverse-hosted connections. |
I was referring tp the suggestion I had made in this issue: #37 (comment)
😜 yeah I thinks I experimented with that initially and then abandoned it...
💯 I've always felt collecting everything into some sort of database is the right and most robust approach and should be an option in the hubverse ecosystem so also agree that trying DuckDB is a worthy experiment! |
@annakrystalli Been thinking about this some more. @matthewcornell and I are continuing to explore DuckDB for consuming data from a cloud-based hub. What we learn will likely be relevant to potentially using DuckDB in Agree with your statement that defaults shouldn't cause errors. So what if we add "skip checks" but never make it the default? That way we have an option that will allow people to connect to something like the flusight archives, even if they have to be explicit. |
Yes agree that's the easiest effective way forward. Also what do you think about this suggestion?
Is it weird to issue a message by default when connecting to cloud? |
BTW when you're ready for an actual code review just let me know! |
1376dd7
to
4f52c73
Compare
@annakrystalli PR is ready for review!
A message like that would probably help people, but I wasn't sure where to put it. After Happy to update this PR, but if we don't have a good idea/wording in the short-term, I'd suggest merging this and revisiting later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks pretty good to me, just some minor change requests.
I think it would also be good to have a short section in the connect_hub
vignette about this also: https://github.com/hubverse-org/hubData/blob/main/vignettes/articles/connect_hub.Rmd
One last comment. We generally also need to bump the version in the |
This changeset adds an optional skip_checks parameter to connect_hub.R and connect_model_output.R per the requirements outlined in hubverse-org#37. When working with hub data on a local filesystem, the behavior is unchanged. When working with hub data in an S3 bucket, the connect functions will now skip data checks by default to improve performance. The former connection behavior for S3-based hubs can obtained by explicitly setting skip_checks=FALSE. This comment fixes the test suite to work when using skip_checks=FALSE to force the previous behavior. The next commit will add new tests to ensure the new behavior works as intended.
This changeset updates the test suite to test the behavior of skip_checks = TRUE (which is the default for S3-based hubs). However, the code as written will not work when there multiple file types (e.g., csv and parquet), because it performs an Arrow open_dataset for each file type. That doesn't work when exclude_invalid_files is FALSE because open_dataset will then grab every file every time it is run.
Because connect_hub and connect_model_output rely on the use of "exclude_invalid_files=TRUE" when making multiple passes of arrow::open_dataset (one for each file format), we cannot allow skip_checks=TRUE for hubs that contain more than one model-output format. Otherwise, open_dataset would grab all the files every time and cause errors when a user tries to run queries against the resulting arrow table.
90fa882
to
59d8bc6
Compare
@annakrystalli Thanks for the review--I pushed updates to address the requested changes, so it's ready for another look! |
hub_path_cloud <- s3_bucket("hubverse/hubutils/testhubs/parquet/") | ||
hub_con_cloud <- connect_hub(hub_path_cloud, file_format = "parquet", skip_checks = TRUE) | ||
hub_con_cloud | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I do wonder whether there should be a header for the section to bring attention to the topic (it'll appear in the TOC too) and also an introductory sentence to warn folks that the default behaviour of running these checks can be extremely slow on very large hubs in the cloud?
Fixes #43
Fixes #37
This PR adds a
skip_check
option tohubData
'sconnect_hub
andconnect_model_output
functions. When set toTRUE
, these functions will skip individual files checks in the hub's model output directory (default value isFALSE
, i.e., no behavior change).skip_check
improves performance when connecting to a cloud-based hub, though it can only be used on hubs that have a single file format in their model output directory.Timing without
skip_check
: 8 min4042 model output files
Timing with
skip_check
: 8 sec4042 model output files