Add collect_hub function #18

annakrystalli · 2024-03-27T10:52:00Z

This PR adds a collect_hub() function which wraps dplyr::collect() but also converts the output to a model_out_tbl class object by default where possible. The function also accepts additional arguments that can be passed to as_model_out_tbl().

Th PR resolves:

I've also modified the R CMD Check workflow to run nightly so that we can pick up any issues arising from upgrades in dependencies promptly. Let's test it out and I can roll it out to all our mature packages when ready.

codecov · 2024-03-27T10:56:23Z

Codecov Report

Attention: Patch coverage is 93.33333% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 87.23%. Comparing base (18ff938) to head (ca2eec0).

Files	Patch %	Lines
R/collect_hub.R	90.47%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #18      +/-   ##
==========================================
+ Coverage   86.98%   87.23%   +0.25%     
==========================================
  Files           9       10       +1     
  Lines         676      705      +29     
==========================================
+ Hits          588      615      +27     
- Misses         88       90       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-03-27T12:24:58Z

🚀 Deployed on https://660acaa966b233bc1dc1cffd--hubdata-pr-preview.netlify.app

annakrystalli · 2024-03-27T12:37:39Z

The macos latest build is failing because of the temporary issue described in #15 . Not sure what to do about it. I could fix it by installing from the Apache R Universe version in the workflow. That's not where most users would do though so, until we swapped it back to CRAN, we
a. wouldn't get errors representative of median user experience
b. I won't know when the issue is fixed unless I explictly track when a macOS arrow fix is pushed to CRAN.

Happy to hear people's thoughts.

…ent/handle-null-taskids Replace all null task id properties with required = NA

bsweger · 2024-03-29T13:58:46Z

Leaving this here for other reviewers who might want to fetch this feature branch and give it a spin locally. Where hubverse-infrastructure-test is an S3 bucket with valid hub-config contents and an empty model-output directory:

dplyr::collect() 😢

> load_all()
ℹ Loading hubData
> hub_bad_path <- s3_bucket('hubverse-infrastructure-test')
> hubData::connect_hub(hub_path = hub_bad_path) %>% dplyr::collect()
Error in UseMethod("collect") : 
  no applicable method for 'collect' applied to an object of class "c('hub_connection', 'list')"
In addition: Warning message:
In hubData::connect_hub(hub_path = hub_bad_path) :
  No files of file formats "csv", "parquet", and "arrow" found in model output directory.

hub_collect() 😀

> tbl <- connect_hub(hub_bad_path) %>% collect_hub()
Warning messages:
1: In connect_hub(hub_bad_path) :
  No files of file formats "csv", "parquet", and "arrow" found in model output directory.
2: Hub is empty. No data to collect. Returning `NULL`

bsweger

Thanks, @annakrystalli. Glad we're trying to handle the hard bits on behalf of users!

One or two inline notes, but nothing that would prevent rolling out this improvement.

bsweger · 2024-03-29T14:07:18Z

.github/workflows/R-CMD-check.yaml

@@ -5,6 +5,8 @@ on:
    branches: [main, master]
  pull_request:
    branches: [main, master]
+  schedule:


Makes sense! Would love to get to a place where these kinds of small operational changes can be in separate PRs so we can get 'em merged in without waiting for review of new features.

bsweger · 2024-03-29T14:11:04Z

vignettes/articles/connect_hub.Rmd

@@ -54,12 +54,26 @@ hub_con

 To access data from a hub connection you can use dplyr verbs and construct querying pipelines.

+You can use `dplyr`'s `collect()` function:


Now that we have collect_hub(), is there any reason someone would want to use dplyr collect()?

As someone with less R proficiency than many folks on the team, I'm left wondering what to do when presented with multiple options like this. Is it worth recommending a default?

collect_hub is mainly a wrapper around dplyr::collect() (a very well known tidyverse function) with some extras. It depends on what they are doing with their data next but there is no reason they must use collect_hub(), it just conveniently outputs a model_out_tbl which many downstream hubverse package functions expect.

I've refactored the article a bit to bring more attention to the benefits of collect_hub and also used it in the connect_hub examples but ultimately collect will work just as well. It just might need an extra step to coerce data to model_out_tbl if used in downstream hubverse functionality

…mphasize key features of tools

annakrystalli added 6 commits March 26, 2024 11:35

Run R CMD Check nightly

7bf138d

Add collect_hub function. Resolves #17

069fc0e

update s3_bucket re-export docs

834ec58

Add note about using arrow::to_duckdb to extend available queries

dad7655

reduce project std space number for tabs to confirm to linter

9b32b03

Add note about arow installation troubleshooting. Resolves #15

9bfcbcc

Add more detail to news listing

0d4189a

annakrystalli added 5 commits March 27, 2024 12:58

Break up long line

d61860c

restyle

7466eaf

Add duckdb to Suggests

de6315a

add namespace to exprs

2b82287

add dbplyr to suggests

f9664c9

annakrystalli added 2 commits March 28, 2024 17:39

Replace all null task id properties with required = NA

72c4f4a

Merge pull request #19 from Infectious-Disease-Modeling-Hubs/enhancem…

3f12b29

…ent/handle-null-taskids Replace all null task id properties with required = NA

bsweger approved these changes Mar 29, 2024

View reviewed changes

nickreich mentioned this pull request Mar 29, 2024

add netlify integrataion for preview builds on PRs reichlab/reichlab.github.io#175

Closed

annakrystalli added 2 commits April 1, 2024 17:50

use collect_hub in examples

808b08f

Mention cloud connections (Resolves #22). Minor article refactor to e…

ca2eec0

…mphasize key features of tools

annakrystalli merged commit 1aa242a into main Apr 1, 2024
9 of 10 checks passed

annakrystalli deleted the feature/collect branch April 1, 2024 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add collect_hub function #18

Add collect_hub function #18

annakrystalli commented Mar 27, 2024 •

edited

Loading

codecov bot commented Mar 27, 2024 •

edited

Loading

github-actions bot commented Mar 27, 2024 •

edited

Loading

annakrystalli commented Mar 27, 2024 •

edited

Loading

bsweger commented Mar 29, 2024

bsweger left a comment

bsweger Mar 29, 2024

bsweger Mar 29, 2024 •

edited

Loading

annakrystalli Apr 1, 2024

		@@ -54,12 +54,26 @@ hub_con

		To access data from a hub connection you can use dplyr verbs and construct querying pipelines.

		You can use `dplyr`'s `collect()` function:

Add collect_hub function #18

Add collect_hub function #18

Conversation

annakrystalli commented Mar 27, 2024 • edited Loading

codecov bot commented Mar 27, 2024 • edited Loading

Codecov Report

github-actions bot commented Mar 27, 2024 • edited Loading

annakrystalli commented Mar 27, 2024 • edited Loading

bsweger commented Mar 29, 2024

bsweger left a comment

Choose a reason for hiding this comment

bsweger Mar 29, 2024

Choose a reason for hiding this comment

bsweger Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

annakrystalli Apr 1, 2024

Choose a reason for hiding this comment

annakrystalli commented Mar 27, 2024 •

edited

Loading

codecov bot commented Mar 27, 2024 •

edited

Loading

github-actions bot commented Mar 27, 2024 •

edited

Loading

annakrystalli commented Mar 27, 2024 •

edited

Loading

bsweger Mar 29, 2024 •

edited

Loading