Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add collect_hub function #18

Merged
merged 16 commits into from
Apr 1, 2024
Merged

Add collect_hub function #18

merged 16 commits into from
Apr 1, 2024

Conversation

annakrystalli
Copy link
Contributor

@annakrystalli annakrystalli commented Mar 27, 2024

This PR adds a collect_hub() function which wraps dplyr::collect() but also converts the output to a model_out_tbl class object by default where possible. The function also accepts additional arguments that can be passed to as_model_out_tbl().

Th PR resolves:

I've also modified the R CMD Check workflow to run nightly so that we can pick up any issues arising from upgrades in dependencies promptly. Let's test it out and I can roll it out to all our mature packages when ready.

Copy link

codecov bot commented Mar 27, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 87.23%. Comparing base (18ff938) to head (ca2eec0).

Files Patch % Lines
R/collect_hub.R 90.47% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #18      +/-   ##
==========================================
+ Coverage   86.98%   87.23%   +0.25%     
==========================================
  Files           9       10       +1     
  Lines         676      705      +29     
==========================================
+ Hits          588      615      +27     
- Misses         88       90       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Mar 27, 2024

@annakrystalli
Copy link
Contributor Author

annakrystalli commented Mar 27, 2024

The macos latest build is failing because of the temporary issue described in #15 . Not sure what to do about it. I could fix it by installing from the Apache R Universe version in the workflow. That's not where most users would do though so, until we swapped it back to CRAN, we
a. wouldn't get errors representative of median user experience
b. I won't know when the issue is fixed unless I explictly track when a macOS arrow fix is pushed to CRAN.

Happy to hear people's thoughts.

@bsweger
Copy link
Contributor

bsweger commented Mar 29, 2024

Leaving this here for other reviewers who might want to fetch this feature branch and give it a spin locally. Where hubverse-infrastructure-test is an S3 bucket with valid hub-config contents and an empty model-output directory:

dplyr::collect() 😢

> load_all()
ℹ Loading hubData
> hub_bad_path <- s3_bucket('hubverse-infrastructure-test')
> hubData::connect_hub(hub_path = hub_bad_path) %>% dplyr::collect()
Error in UseMethod("collect") : 
  no applicable method for 'collect' applied to an object of class "c('hub_connection', 'list')"
In addition: Warning message:
In hubData::connect_hub(hub_path = hub_bad_path) :
  No files of file formats "csv", "parquet", and "arrow" found in model output directory.

hub_collect() 😀

> tbl <- connect_hub(hub_bad_path) %>% collect_hub()
Warning messages:
1: In connect_hub(hub_bad_path) :
  No files of file formats "csv", "parquet", and "arrow" found in model output directory.
2: Hub is empty. No data to collect. Returning `NULL` 

Copy link
Contributor

@bsweger bsweger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @annakrystalli. Glad we're trying to handle the hard bits on behalf of users!

One or two inline notes, but nothing that would prevent rolling out this improvement.

@@ -5,6 +5,8 @@ on:
branches: [main, master]
pull_request:
branches: [main, master]
schedule:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Would love to get to a place where these kinds of small operational changes can be in separate PRs so we can get 'em merged in without waiting for review of new features.

@@ -54,12 +54,26 @@ hub_con

To access data from a hub connection you can use dplyr verbs and construct querying pipelines.

You can use `dplyr`'s `collect()` function:
Copy link
Contributor

@bsweger bsweger Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have collect_hub(), is there any reason someone would want to use dplyr collect()?

As someone with less R proficiency than many folks on the team, I'm left wondering what to do when presented with multiple options like this. Is it worth recommending a default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_hub is mainly a wrapper around dplyr::collect() (a very well known tidyverse function) with some extras. It depends on what they are doing with their data next but there is no reason they must use collect_hub(), it just conveniently outputs a model_out_tbl which many downstream hubverse package functions expect.

I've refactored the article a bit to bring more attention to the benefits of collect_hub and also used it in the connect_hub examples but ultimately collect will work just as well. It just might need an extra step to coerce data to model_out_tbl if used in downstream hubverse functionality

@annakrystalli annakrystalli merged commit 1aa242a into main Apr 1, 2024
9 of 10 checks passed
@annakrystalli annakrystalli deleted the feature/collect branch April 1, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants