Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/skip data checks option #47

Merged

Conversation

bsweger
Copy link
Contributor

@bsweger bsweger commented Jul 9, 2024

Fixes #43
Fixes #37

This PR adds a skip_check option to hubData's connect_hub and connect_model_output functions. When set to TRUE, these functions will skip individual files checks in the hub's model output directory (default value is FALSE, i.e., no behavior change).

skip_check improves performance when connecting to a cloud-based hub, though it can only be used on hubs that have a single file format in their model output directory.


Timing without skip_check: 8 min
4042 model output files

> hub_path <- s3_bucket('s3://uscdc-flusight-hub-v1')
> s=Sys.time(); connect_hub(hub_path, file_format = "parquet"); Sys.time()-s;
[snip]
Time difference of 8.027225 mins

Timing with skip_check: 8 sec
4042 model output files

> hub_path <- s3_bucket('s3://uscdc-flusight-hub-v1')
> s=Sys.time(); connect_hub(hub_path, file_format = "parquet", skip_checks = TRUE); Sys.time()-s;
[snip]
Time difference of 8.030099 secs

@bsweger
Copy link
Contributor Author

bsweger commented Jul 9, 2024

@annakrystalli This isn't ready for a full review, but before going further, would love your thoughts on the direction of these changes.

[This question(S) turned out longer than I'd anticipated--maybe worth a synchronous chat?]

The new option to skip data checks has the performance impact we want: connect_hub against the FluSight archives was taking ~9 minutes to run on my laptop and now completes anywhere between 5 and 7 seconds.

hub_path_cloud <- s3_bucket("uscdc-flusight-hub-v1")
# below will fail if we don't specify file_format, because the hub's valid formats 
# are parquet AND csv (because we're no longer validating file formats during open_dataset)
# https://github.com/hubverse-org/flusight_hub_archive/blob/main/hub-config/admin.json#L10
hub_con <- connect_hub(hub_path_cloud, file_format = "parquet"))

That said, I don't love how the change introduces different default behavior for S3-based hubs. Should we:

  1. force users of S3-based hubs to explicitly pass in skip_checks = TRUE, thereby retaining the default behavior of skip_checks = FALSE regardless of hub location? OR
  2. accept that cloud based hubs have different needs and favor doing the "right" thing by default?

If we accept the second statement, it doesn't really make sense that we set skip_checks to TRUE on behalf of S3 users, but we don't set file_format to parquet, when both parameters are necessary most of the time. I feel okay about the skip_checks default, but I don't think we should make an assumption that every S3-based hub will contain parquet files only.

Given the above (and considering the possibility that our arrow options for cloud-based hubs may diverge even further from the local options as we learn more), does it make sense to try using DuckDB instead of arrow in connect_hub.SubTreeFileSystem (or file an issue to do that later)

  • sidestep skip_checks and prevent us from handling S3 different that other cloud file systems
  • but may introduce new classes of errors and edge cases

@annakrystalli
Copy link
Contributor

Just to note,I have not looked at the code yet so this is a more high level response.

I totally agree with you on all your areas and degrees of hesitation.

Personally I feel it's worse for the default settings to return an error than be slow and I definitely don't feel the default of S3 hub connections should be to assume the hub is a hubverse transformed hub and therefore set the format to parquet by default.

Given the above I almost feel like we should not have different defaults for local Vs S3 but instead issue a performance warning with more (or links to more) information about skipping certain checks and their implications/expectations.

This is also making me feel even more strongly that the very special situation of a hubverse cloud transformed hub should be able to be detected/communicated and override defaults.
I continue to think a special file might be the way to go. It is certainly common in R (e.g. .Rproj files, _pkgdown files etc), am pretty sure it is in python too and could also act as a location for further hubverse transform configurations (e.g. partitions etc)

@annakrystalli
Copy link
Contributor

Re duckdb I am definitely curious and open to exploring that as the default!

@annakrystalli
Copy link
Contributor

One last thought. If not a special file, perhaps a dedicated function for specifically accessing hubverse transformed cloud hubs with appropriate defaults might work?

@bsweger
Copy link
Contributor Author

bsweger commented Jul 9, 2024

One last thought. If not a special file, perhaps a dedicated function for specifically accessing hubverse transformed cloud hubs with appropriate defaults might work?

Thanks for weighing in--we're in agreement that this PR as written is not the way to go for solving our performance issues on cloud-based data.

I'm not clear on what you mean by a "special file." Can you say more about that?

I think trying DuckDB is a worthy experiment: the performance limitations we're hitting with connect_hub aren't related to partitioning or other physical storage strategies. We seem to be hitting Arrow-related limitations since we can't even get a lazy table object open without S3-related performance issues.

For example, I tried passing in a specific list of S3 URIs (instead of the model-output directory) to open_dataset, in hopes that Arrow could open the correct files w/o needing to scan the entire S3 contents via exclude_invalid_files = TRUE.

That strategy was not successful (branch here), seemingly because Arrow's R implementation will make an S3 call for every file in the list: apache/arrow#35715 (this issue was filed by a friend of yours 😄 ).

If DuckDB can provide a consistent interface for all cloud-based hubs, regardless of where they are hosted or what the file format is, my .02 is that would be preferable to options that require a special handling for Hubverse-hosted connections.

@annakrystalli
Copy link
Contributor

I'm not clear on what you mean by a "special file." Can you say more about that?

I was referring tp the suggestion I had made in this issue: #37 (comment)
i.e.

It would be nicer however for the function to be able to detect that it's dealing with a cloud hub were the above conditions are guaranteed and set those checks to off by default in that case. Perhaps through including a .cloud-hub file in the root of the hub that the function could use to detect such a hub?

That strategy was not successful (branch here), seemingly because Arrow's R implementation will make an S3 call for every file in the list: apache/arrow#35715 (this issue was filed by a friend of yours 😄 ).

😜 yeah I thinks I experimented with that initially and then abandoned it...

If DuckDB can provide a consistent interface for all cloud-based hubs, regardless of where they are hosted or what the file format is, my .02 is that would be preferable to options that require a special handling for Hubverse-hosted connections.

💯 I've always felt collecting everything into some sort of database is the right and most robust approach and should be an option in the hubverse ecosystem so also agree that trying DuckDB is a worthy experiment!

@bsweger
Copy link
Contributor Author

bsweger commented Jul 17, 2024

@annakrystalli Been thinking about this some more. @matthewcornell and I are continuing to explore DuckDB for consuming data from a cloud-based hub.

What we learn will likely be relevant to potentially using DuckDB in hubData, but in the meantime, what do you think about solving the immediate pain by returning to the "skip data checks" idea, just implemented differently than this current PR?

Agree with your statement that defaults shouldn't cause errors. So what if we add "skip checks" but never make it the default? That way we have an option that will allow people to connect to something like the flusight archives, even if they have to be explicit.

@annakrystalli
Copy link
Contributor

Yes agree that's the easiest effective way forward.

Also what do you think about this suggestion?

Given the above I almost feel like we should not have different defaults for local Vs S3 but instead issue a performance warning on cloud with more (or links to more) information about skipping certain checks and their implications/expectations.

Is it weird to issue a message by default when connecting to cloud?

@annakrystalli
Copy link
Contributor

annakrystalli commented Jul 18, 2024

BTW when you're ready for an actual code review just let me know!

@bsweger bsweger force-pushed the bsweger/skip_data_checks_option branch 6 times, most recently from 1376dd7 to 4f52c73 Compare July 19, 2024 18:26
@bsweger bsweger marked this pull request as ready for review July 19, 2024 18:36
@bsweger
Copy link
Contributor Author

bsweger commented Jul 19, 2024

@annakrystalli PR is ready for review!

Is it weird to issue a message by default when connecting to cloud?

A message like that would probably help people, but I wasn't sure where to put it. After connect_hub/connect_model_output is too late. We could preemptively display a message about skip_checks after someone creates an s3_bucket object?

Happy to update this PR, but if we don't have a good idea/wording in the short-term, I'd suggest merging this and revisiting later.

Copy link
Contributor

@annakrystalli annakrystalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty good to me, just some minor change requests.

I think it would also be good to have a short section in the connect_hub vignette about this also: https://github.com/hubverse-org/hubData/blob/main/vignettes/articles/connect_hub.Rmd

NEWS.md Show resolved Hide resolved
R/connect_hub.R Show resolved Hide resolved
R/connect_hub.R Outdated Show resolved Hide resolved
R/connect_hub.R Outdated Show resolved Hide resolved
@annakrystalli
Copy link
Contributor

One last comment. We generally also need to bump the version in the DESCRIPTION file, although given I'm going to be merging in after #49 , I can take care of that after if you like when I resolve the merge conflicts.

This changeset adds an optional skip_checks parameter to
connect_hub.R and connect_model_output.R per the requirements
outlined in hubverse-org#37.

When working with hub data on a local filesystem, the behavior
is unchanged. When working with hub data in an S3 bucket, the
connect functions will now skip data checks by default to
improve performance. The former connection behavior for
S3-based hubs can obtained by explicitly setting
skip_checks=FALSE.

This comment fixes the test suite to work when using
skip_checks=FALSE to force the previous behavior. The
next commit will add new tests to ensure the new behavior
works as intended.
This changeset updates the test suite to test the behavior
of skip_checks = TRUE (which is the default for S3-based hubs).
However, the code as written will not work when there multiple file
types (e.g., csv and parquet), because it performs an Arrow
open_dataset for each file type. That doesn't work when
exclude_invalid_files is FALSE because open_dataset will then
grab every file every time it is run.
Because connect_hub and connect_model_output rely on the use of
"exclude_invalid_files=TRUE" when making multiple passes of
arrow::open_dataset (one for each file format), we cannot allow
skip_checks=TRUE for hubs that contain more than one model-output
format. Otherwise, open_dataset would grab all the files every
time and cause errors when a user tries to run queries against
the resulting arrow table.
@bsweger bsweger force-pushed the bsweger/skip_data_checks_option branch from 90fa882 to 59d8bc6 Compare July 25, 2024 13:07
@bsweger
Copy link
Contributor Author

bsweger commented Jul 25, 2024

@annakrystalli Thanks for the review--I pushed updates to address the requested changes, so it's ready for another look!

hub_path_cloud <- s3_bucket("hubverse/hubutils/testhubs/parquet/")
hub_con_cloud <- connect_hub(hub_path_cloud, file_format = "parquet", skip_checks = TRUE)
hub_con_cloud
```
Copy link
Contributor

@annakrystalli annakrystalli Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I do wonder whether there should be a header for the section to bring attention to the topic (it'll appear in the TOC too) and also an introductory sentence to warn folks that the default behaviour of running these checks can be extremely slow on very large hubs in the cloud?

@annakrystalli annakrystalli merged commit 4f79c69 into hubverse-org:main Jul 26, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants