Bsweger/skip data checks option #47

bsweger · 2024-07-09T03:10:01Z

Fixes #43
Fixes #37

This PR adds a skip_check option to hubData's connect_hub and connect_model_output functions. When set to TRUE, these functions will skip individual files checks in the hub's model output directory (default value is FALSE, i.e., no behavior change).

skip_check improves performance when connecting to a cloud-based hub, though it can only be used on hubs that have a single file format in their model output directory.

Timing without skip_check: 8 min
4042 model output files

> hub_path <- s3_bucket('s3://uscdc-flusight-hub-v1')
> s=Sys.time(); connect_hub(hub_path, file_format = "parquet"); Sys.time()-s;
[snip]
Time difference of 8.027225 mins

Timing with skip_check: 8 sec
4042 model output files

> hub_path <- s3_bucket('s3://uscdc-flusight-hub-v1')
> s=Sys.time(); connect_hub(hub_path, file_format = "parquet", skip_checks = TRUE); Sys.time()-s;
[snip]
Time difference of 8.030099 secs

bsweger · 2024-07-09T03:54:20Z

@annakrystalli This isn't ready for a full review, but before going further, would love your thoughts on the direction of these changes.

[This question(S) turned out longer than I'd anticipated--maybe worth a synchronous chat?]

The new option to skip data checks has the performance impact we want: connect_hub against the FluSight archives was taking ~9 minutes to run on my laptop and now completes anywhere between 5 and 7 seconds.

hub_path_cloud <- s3_bucket("uscdc-flusight-hub-v1")
# below will fail if we don't specify file_format, because the hub's valid formats 
# are parquet AND csv (because we're no longer validating file formats during open_dataset)
# https://github.com/hubverse-org/flusight_hub_archive/blob/main/hub-config/admin.json#L10
hub_con <- connect_hub(hub_path_cloud, file_format = "parquet"))

That said, I don't love how the change introduces different default behavior for S3-based hubs. Should we:

force users of S3-based hubs to explicitly pass in skip_checks = TRUE, thereby retaining the default behavior of skip_checks = FALSE regardless of hub location? OR
accept that cloud based hubs have different needs and favor doing the "right" thing by default?

If we accept the second statement, it doesn't really make sense that we set skip_checks to TRUE on behalf of S3 users, but we don't set file_format to parquet, when both parameters are necessary most of the time. I feel okay about the skip_checks default, but I don't think we should make an assumption that every S3-based hub will contain parquet files only.

Given the above (and considering the possibility that our arrow options for cloud-based hubs may diverge even further from the local options as we learn more), does it make sense to try using DuckDB instead of arrow in connect_hub.SubTreeFileSystem (or file an issue to do that later)

sidestep skip_checks and prevent us from handling S3 different that other cloud file systems
but may introduce new classes of errors and edge cases

annakrystalli · 2024-07-09T10:57:45Z

Just to note,I have not looked at the code yet so this is a more high level response.

I totally agree with you on all your areas and degrees of hesitation.

Personally I feel it's worse for the default settings to return an error than be slow and I definitely don't feel the default of S3 hub connections should be to assume the hub is a hubverse transformed hub and therefore set the format to parquet by default.

Given the above I almost feel like we should not have different defaults for local Vs S3 but instead issue a performance warning with more (or links to more) information about skipping certain checks and their implications/expectations.

This is also making me feel even more strongly that the very special situation of a hubverse cloud transformed hub should be able to be detected/communicated and override defaults.
I continue to think a special file might be the way to go. It is certainly common in R (e.g. .Rproj files, _pkgdown files etc), am pretty sure it is in python too and could also act as a location for further hubverse transform configurations (e.g. partitions etc)

annakrystalli · 2024-07-09T10:58:40Z

Re duckdb I am definitely curious and open to exploring that as the default!

annakrystalli · 2024-07-09T11:43:28Z

One last thought. If not a special file, perhaps a dedicated function for specifically accessing hubverse transformed cloud hubs with appropriate defaults might work?

bsweger · 2024-07-09T14:01:25Z

One last thought. If not a special file, perhaps a dedicated function for specifically accessing hubverse transformed cloud hubs with appropriate defaults might work?

Thanks for weighing in--we're in agreement that this PR as written is not the way to go for solving our performance issues on cloud-based data.

I'm not clear on what you mean by a "special file." Can you say more about that?

I think trying DuckDB is a worthy experiment: the performance limitations we're hitting with connect_hub aren't related to partitioning or other physical storage strategies. We seem to be hitting Arrow-related limitations since we can't even get a lazy table object open without S3-related performance issues.

For example, I tried passing in a specific list of S3 URIs (instead of the model-output directory) to open_dataset, in hopes that Arrow could open the correct files w/o needing to scan the entire S3 contents via exclude_invalid_files = TRUE.

That strategy was not successful (branch here), seemingly because Arrow's R implementation will make an S3 call for every file in the list: apache/arrow#35715 (this issue was filed by a friend of yours 😄 ).

If DuckDB can provide a consistent interface for all cloud-based hubs, regardless of where they are hosted or what the file format is, my .02 is that would be preferable to options that require a special handling for Hubverse-hosted connections.

annakrystalli · 2024-07-09T14:15:58Z

I'm not clear on what you mean by a "special file." Can you say more about that?

I was referring tp the suggestion I had made in this issue: #37 (comment)
i.e.

It would be nicer however for the function to be able to detect that it's dealing with a cloud hub were the above conditions are guaranteed and set those checks to off by default in that case. Perhaps through including a .cloud-hub file in the root of the hub that the function could use to detect such a hub?

That strategy was not successful (branch here), seemingly because Arrow's R implementation will make an S3 call for every file in the list: apache/arrow#35715 (this issue was filed by a friend of yours 😄 ).

😜 yeah I thinks I experimented with that initially and then abandoned it...

If DuckDB can provide a consistent interface for all cloud-based hubs, regardless of where they are hosted or what the file format is, my .02 is that would be preferable to options that require a special handling for Hubverse-hosted connections.

💯 I've always felt collecting everything into some sort of database is the right and most robust approach and should be an option in the hubverse ecosystem so also agree that trying DuckDB is a worthy experiment!

bsweger · 2024-07-17T20:42:38Z

@annakrystalli Been thinking about this some more. @matthewcornell and I are continuing to explore DuckDB for consuming data from a cloud-based hub.

What we learn will likely be relevant to potentially using DuckDB in hubData, but in the meantime, what do you think about solving the immediate pain by returning to the "skip data checks" idea, just implemented differently than this current PR?

Agree with your statement that defaults shouldn't cause errors. So what if we add "skip checks" but never make it the default? That way we have an option that will allow people to connect to something like the flusight archives, even if they have to be explicit.

annakrystalli · 2024-07-18T00:48:22Z

Yes agree that's the easiest effective way forward.

Also what do you think about this suggestion?

Given the above I almost feel like we should not have different defaults for local Vs S3 but instead issue a performance warning on cloud with more (or links to more) information about skipping certain checks and their implications/expectations.

Is it weird to issue a message by default when connecting to cloud?

annakrystalli · 2024-07-18T10:20:41Z

BTW when you're ready for an actual code review just let me know!

bsweger · 2024-07-19T18:52:22Z

@annakrystalli PR is ready for review!

Is it weird to issue a message by default when connecting to cloud?

A message like that would probably help people, but I wasn't sure where to put it. After connect_hub/connect_model_output is too late. We could preemptively display a message about skip_checks after someone creates an s3_bucket object?

Happy to update this PR, but if we don't have a good idea/wording in the short-term, I'd suggest merging this and revisiting later.

R/utils-connect_hub.R

annakrystalli

This all looks pretty good to me, just some minor change requests.

I think it would also be good to have a short section in the connect_hub vignette about this also: https://github.com/hubverse-org/hubData/blob/main/vignettes/articles/connect_hub.Rmd

NEWS.md

R/connect_hub.R

annakrystalli · 2024-07-24T15:03:01Z

One last comment. We generally also need to bump the version in the DESCRIPTION file, although given I'm going to be merging in after #49 , I can take care of that after if you like when I resolve the merge conflicts.

This changeset adds an optional skip_checks parameter to connect_hub.R and connect_model_output.R per the requirements outlined in hubverse-org#37. When working with hub data on a local filesystem, the behavior is unchanged. When working with hub data in an S3 bucket, the connect functions will now skip data checks by default to improve performance. The former connection behavior for S3-based hubs can obtained by explicitly setting skip_checks=FALSE. This comment fixes the test suite to work when using skip_checks=FALSE to force the previous behavior. The next commit will add new tests to ensure the new behavior works as intended.

This changeset updates the test suite to test the behavior of skip_checks = TRUE (which is the default for S3-based hubs). However, the code as written will not work when there multiple file types (e.g., csv and parquet), because it performs an Arrow open_dataset for each file type. That doesn't work when exclude_invalid_files is FALSE because open_dataset will then grab every file every time it is run.

Because connect_hub and connect_model_output rely on the use of "exclude_invalid_files=TRUE" when making multiple passes of arrow::open_dataset (one for each file format), we cannot allow skip_checks=TRUE for hubs that contain more than one model-output format. Otherwise, open_dataset would grab all the files every time and cause errors when a user tries to run queries against the resulting arrow table.

bsweger · 2024-07-25T13:50:02Z

@annakrystalli Thanks for the review--I pushed updates to address the requested changes, so it's ready for another look!

annakrystalli · 2024-07-25T14:07:17Z

vignettes/articles/connect_hub.Rmd

+hub_path_cloud <- s3_bucket("hubverse/hubutils/testhubs/parquet/")
+hub_con_cloud <- connect_hub(hub_path_cloud, file_format = "parquet", skip_checks = TRUE)
+hub_con_cloud
+```


This is great. I do wonder whether there should be a header for the section to bring attention to the topic (it'll appear in the TOC too) and also an introductory sentence to warn folks that the default behaviour of running these checks can be extremely slow on very large hubs in the cloud?

…tions

bsweger force-pushed the bsweger/skip_data_checks_option branch 6 times, most recently from 1376dd7 to 4f52c73 Compare July 19, 2024 18:26

bsweger marked this pull request as ready for review July 19, 2024 18:36

bsweger requested a review from annakrystalli July 19, 2024 18:36

bsweger commented Jul 19, 2024

View reviewed changes

R/utils-connect_hub.R Show resolved Hide resolved

annakrystalli requested changes Jul 24, 2024

View reviewed changes

NEWS.md Show resolved Hide resolved

R/connect_hub.R Show resolved Hide resolved

R/connect_hub.R Outdated Show resolved Hide resolved

R/connect_hub.R Outdated Show resolved Hide resolved

bsweger added 7 commits July 25, 2024 08:55

Remove logic that automatically sets skip_checks to true for cloud hubs

6ae859d

Minor updates to .Rbuildignore and .gitignore

2d11380

Documentation updates

0bee551

Appease the linter

59d8bc6

bsweger force-pushed the bsweger/skip_data_checks_option branch from 90fa882 to 59d8bc6 Compare July 25, 2024 13:07

Updates from code review

3dbe9e9

bsweger added 2 commits July 25, 2024 09:43

Add skip_checks example to connect_hub vignette

b6ecaae

Bump package version in DESCRIPTION

250301f

annakrystalli reviewed Jul 25, 2024

View reviewed changes

Create a separate vignette section dedicated to performance considera…

762a001

…tions

bsweger requested a review from annakrystalli July 25, 2024 17:29

annakrystalli approved these changes Jul 25, 2024

View reviewed changes

annakrystalli merged commit 4f79c69 into hubverse-org:main Jul 26, 2024
8 checks passed

annakrystalli added a commit that referenced this pull request Aug 26, 2024

Update snapshots and add missing ones from #47

29395fe

annakrystalli mentioned this pull request Aug 26, 2024

Hotfix / Use CRAN arrow #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bsweger/skip data checks option #47

Bsweger/skip data checks option #47

bsweger commented Jul 9, 2024 •

edited

Loading

bsweger commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

bsweger commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

bsweger commented Jul 17, 2024

annakrystalli commented Jul 18, 2024

annakrystalli commented Jul 18, 2024 •

edited

Loading

bsweger commented Jul 19, 2024

annakrystalli left a comment

annakrystalli commented Jul 24, 2024

bsweger commented Jul 25, 2024

annakrystalli Jul 25, 2024 •

edited

Loading

Bsweger/skip data checks option #47

Bsweger/skip data checks option #47

Conversation

bsweger commented Jul 9, 2024 • edited Loading

bsweger commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

bsweger commented Jul 9, 2024

annakrystalli commented Jul 9, 2024

bsweger commented Jul 17, 2024

annakrystalli commented Jul 18, 2024

annakrystalli commented Jul 18, 2024 • edited Loading

bsweger commented Jul 19, 2024

annakrystalli left a comment

Choose a reason for hiding this comment

annakrystalli commented Jul 24, 2024

bsweger commented Jul 25, 2024

annakrystalli Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

bsweger commented Jul 9, 2024 •

edited

Loading

annakrystalli commented Jul 18, 2024 •

edited

Loading

annakrystalli Jul 25, 2024 •

edited

Loading