Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builder: additional workflow steps #351

Merged
merged 6 commits into from
Apr 5, 2023
Merged

Conversation

bkmartinjr
Copy link
Contributor

@bkmartinjr bkmartinjr commented Apr 5, 2023

Implements the remaining build workflow steps as a fully automated process.

  1. Pre-build host checks
  2. Build SOMA
  3. Validate SOMA
  4. Create reports
  5. Copy data to S3 bucket
  6. Update the release.json to do the release
  7. Clean up old releases
  8. Copy logs to S3

This PR adds steps 5-8, although step 7 was present as a manually invoked sub-module prior to this PR.

Changes/additions:

Note to reviewers: if you wish to test it, you can do so by creating a build dir, including a config.yaml that overrides the destination buckets. For example, the following config.yaml is a quick way to test with an abbreviated SOMA build:

cellxgene_census_S3_path: "s3://your-test-bucket/cell-census"
logs_S3_path: "s3://your-test-bucket/builder"
test_first_n: 15
host_validation_disable: True
release_cleanup_days: 10

Caveat: this approach requires that at least one release exist in the specified bucket, along with a release.json, etc. The release can be faked (i.e., the soma and h5ads paths can contain any empty file).

@bkmartinjr bkmartinjr marked this pull request as ready for review April 5, 2023 03:03
Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Suggesting we upload reports too and two other very minor suggestions.

"--dryrun",
action=argparse.BooleanOptionalAction,
default=True,
help="Display, but do NOT perform actions. Useful for previewing actions. Default: True",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Display, but do NOT perform actions. Useful for previewing actions. Default: True",
help="Display, but do NOT perform S3 data update actions, including Census data upload, logs upload, and removal of older Census releases. Useful for previewing actions. Default: True",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggested edit does not make any sense in this test main. The only function provided by this CLI is to sync/copy a user-specified directory to S3 (any directory). It is completely agnostic about which directory and which S3 path.

Can you clarify what you are trying to accomplish?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, sorry, wrong place for the edit, this is not top-level entryp oint. If there's a better place to doc what dryrun does overall (config file), that's where I'd put it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm...I'm not sure. It literally does what it says currently - short-circuits the data copy, at the very lowest level. It is primarily intended for testing purposes.

I'll give it some thought.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, don't give it too much thought. :). I simply figured it would be good to document in one place what dryrun does in total, since it affects a few of the pipeline's steps. "Skips all S3 updates" is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the config docs in build_state.py already contained:

"dryrun": False, # if True, will disable copy of data/logs/reports to S3 buckets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded comment, and extended this config to apply to ALL S3 writes (including the write of release.json).

@@ -58,6 +58,9 @@ def do_build(args: CensusBuildArgs, skip_completed_steps: bool = False) -> int:
do_build_soma,
do_validate_soma,
do_create_reports,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also sync reports, and I don't think that's currently handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not in the build process the last time I checked, but is easily added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't, I just realized the omission today while doing the release follow-up tasks. Ty!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

"uri": urlcat(args.config.cellxgene_census_S3_path, args.build_tag, "h5ads/"),
"s3_region": "us-west-2",
},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this release declaration could be moved into make_a_release, which would allow it to be unit tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing all of the data layout assumptions down to the module feels like it pollutes the abstraction. I'll add some tests for the main.do_* methods instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for all of the new workflow steps, called from the top-level

Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you adding the reports sync and additional tests.

"consolidate": True,
"disable_dirty_git_check": True,
"dryrun": False, # if True, will disable copy of data/logs/etc to S3 buckets
"dryrun": False, # if True, will disable copy of data/logs/reports/release.json to S3 buckets. Will NOT disable local build, etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bkmartinjr bkmartinjr merged commit c269115 into main Apr 5, 2023
@bkmartinjr bkmartinjr deleted the bkmartinjr/do-release-staging branch April 5, 2023 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Builder workflow: add release step Builder workflow: add data copy step
2 participants