Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved build workflow #272

Merged
merged 39 commits into from
Mar 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
10b30cd
reorganize census builder
Mar 17, 2023
fedb07f
Merge branch 'main' into bkmartinjr/265-build-workflow
Mar 17, 2023
2399c97
refactor files in build_soma
Mar 17, 2023
c30d38b
fix GHA unit test
Mar 17, 2023
d3cb272
fix GHA unit test
Mar 17, 2023
613cc46
additional refactoring for top-level workflow
Mar 20, 2023
376e852
Merge branch 'main' into bkmartinjr/265-build-workflow
Mar 20, 2023
4ae07c1
add missing package to dependency list
Mar 20, 2023
a8963ee
cleanup host validation config
Mar 20, 2023
76c6bbd
update test CLI for host validation
Mar 20, 2023
bccd4f1
more namespace refactoring
Mar 21, 2023
6dcb088
add reports to workflow
Mar 21, 2023
e8829e0
lint
Mar 21, 2023
2c2dba9
handle default config correctly
Mar 21, 2023
4680c87
fix typo in defaults
Mar 21, 2023
cb0d3f7
fix report typo
Mar 21, 2023
d307616
fix state load issue; enable multi-process by default
Mar 21, 2023
1f70ab9
fix typo in program name
Mar 21, 2023
e40653a
add build resumption
Mar 21, 2023
18e1b67
dockerfile update
Mar 22, 2023
c35554a
docker build refinement
Mar 22, 2023
fe93cb3
refine builder build process
Mar 22, 2023
5da72ec
add GHA for docker image build
Mar 22, 2023
accb53b
update readme
Mar 22, 2023
b9a2606
Merge branch 'main' into bkmartinjr/265-build-workflow
Mar 22, 2023
40b1c90
fix entry point
Mar 22, 2023
86993ee
more readme edits
Mar 22, 2023
2c8e2b0
fix owlready2 installation in docker image
Mar 22, 2023
f103b5c
PR feedback
Mar 22, 2023
756d6f7
PR feedback
Mar 22, 2023
1510369
fix email address in metadata
Mar 22, 2023
74bd3bf
Merge branch 'main' into bkmartinjr/265-build-workflow
Mar 22, 2023
0566b17
add file size integrity check on downloads
Mar 23, 2023
4618fb1
Merge branch 'main' into bkmartinjr/265-build-workflow
Mar 23, 2023
84acafb
add missing broken process pool logger
Mar 23, 2023
4f50e11
tweak developer Makefile for builder
Mar 23, 2023
e4f4c57
clean up comments
Mar 23, 2023
2eb588e
PR feedback
Mar 23, 2023
509ffd2
fix typo
Mar 23, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions .github/workflows/py-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ name: Python cell_census build

on:
pull_request:
paths-ignore:
- "api/r/**"
paths:
- "api/python/**"
- "tools/cell_census_builder/**"
push:
branches: [main]
workflow_dispatch:
Expand Down Expand Up @@ -34,3 +35,28 @@ jobs:
uses: actions/upload-artifact@v3
with:
path: api/python/cell_census/dist/*

build_docker_container:
name: Build Docker image for Census Builder
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install deps
run: |
python -m pip install -U pip setuptools build

- name: Build package
bkmartinjr marked this conversation as resolved.
Show resolved Hide resolved
run: python -m build
working-directory: tools/cell_census_builder/

- name: Build image
run: docker build --build-arg=COMMIT_SHA=$(git rev-parse --short HEAD) -t cell-census-builder .
working-directory: tools/cell_census_builder/
4 changes: 2 additions & 2 deletions .github/workflows/py-unittests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ jobs:
- name: Install dependencies
run: |
python -m pip install -U pip setuptools wheel
pip install -r ./tools/scripts/requirements.txt -r ./tools/scripts/requirements-dev.txt
pip install -e ./tools/
pip install -e ./tools/cell_census_builder/
pip install -r ./tools/scripts/requirements-dev.txt
- name: Test with pytest (builder)
run: |
PYTHONPATH=. coverage run --parallel-mode -m pytest ./tools/cell_census_builder/tests/
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,4 @@ repos:
- numpy
- typing_extensions
- types-setuptools
- types-PyYAML
18 changes: 18 additions & 0 deletions tools/cell_census_builder/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

ARG COMMIT_SHA
ENV COMMIT_SHA=${COMMIT_SHA}

RUN apt update && apt -y full-upgrade && apt -y install python3.10-venv python3-pip awscli

ADD entrypoint.sh /
ADD dist/ /tools/cell_census_builder

RUN python3 -m pip install -U pip Cython wheel build
RUN python3 -m pip install /tools/cell_census_builder/*.whl

WORKDIR /census-build

ENTRYPOINT ["/bin/bash", "/entrypoint.sh"]
24 changes: 24 additions & 0 deletions tools/cell_census_builder/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Build docker image. This Makefile is for convenience in development,
# and as a means to manually build in advance of pushing the image to
# a registry.
#
# COMING SOON: Docker builds for routine use are created by a GHA, and
# will be available in a Docker repository.

# Create the image
.PHONY: image
image: clean
python3 -m build .
docker build --build-arg=COMMIT_SHA=$(git rev-parse --short HEAD) -t cell-census-builder .

# Clean Python build
.PHONY: clean
clean:
rm -rf build dist

# Prune docker cache
.PHONY: prune
prune:
docker system prune -f
if [ "$(docker ps -aq)" ]; then docker rm -f $(docker ps -aq) ; fi
if [ "$(docker images -q)" ]; then docker rmi -f $(docker images -q) ; fi
137 changes: 111 additions & 26 deletions tools/cell_census_builder/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,113 @@
# README

This is a tool to build the SOMA instantiation of the Cell Census schema, as specified in this doc:
This package contains code to build and release the Cell Census in the SOMA format, as specified in the
[data schema](https://github.com/chanzuckerberg/cell-census/blob/main/docs/cell_census_schema.md).

https://docs.google.com/document/d/1GKndzCk9q_1SdYOq3BeCxWgp-o2NSQkEmSBaBPKnNI8/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless running from Batch is right around the corner, worth keeping this link or moving its contents here. E.g. you still need to know how to provision EC2 instance and setup swap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a link to an obsolete schema spec, which I replaced with the correct link. It was not a build process doc, which has never been part of this README (and probably should not IMHO given that it contains internal infra info).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps. once I get all this landed, and refined, I will update the "manual build process" doc, not linked here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, great!

This tool is not intended for end-users - it is used by the CELLxGENE team to periodically create and release all
CELLxGENE data in the above format. The remainder of this document is intended for users of the
build package.

CAVEATS (READ THIS):
Please see the top-level [README](../../README.md) for more information on the Cell Census and
using the Cell Census data.

1. The code is written to the still-rapidly-evolving and **pre-release** Python SOMA API, _and will be subject to change_ as the SOMA API and `tiledbsoma` evolve and stabilize.
2. The schema implemented by this code is still evolving and subject to change.
3. The `cell_census_builder` package requires Python 3.9 or later.
## Overview

## Usage
This package contains sub-modules, each of which automate elements of the Cell Census build and release process.
They are wrapped at the package top-level by by a `__main__` which implements the Cell Census build process,
with standard defaults.

The top-level build can be invoked as follows:

- Create a working directory, e.g., `census-build` or equivalent.
- If any configuration defaults need to be overridden, create a `config.yaml` in the working directory containing the default overrides. _NOTE:_ by default you do not need to create a `config.yaml` file -- the defaults are appropriate to build the full Census.
- Run the build as `python -m cell_census_builder your-working_dir`

This will perform four steps (more will be added the future):

- host validation
- build soma
- validate soma
- build reports (eg., summary)

This will result in the following file tree:

```
working_dir:
|
+-- config.yaml # build config (user provided, read-only)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to provide an example config.yaml that can be copied as necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. later info in this doc points the user at the build_state.py file. And by default you should not specify a config file if doing the standard census build. It is only for dev reasons that you would use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example added, plus verbiage about not needing to provide a config for the standard census build

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to dev-config.yaml?

Copy link
Contributor Author

@bkmartinjr bkmartinjr Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather leave as-is. It is the config, for any purpose. I just have defaults set so that it isn't required in our current workflow, but I can easily imagine situations where it is used in the future.

that said, if you feel strongly, I'm not going to force this issue :-)

+-- state.yaml # build runtime state (eg., census version tag, etc)
+-- build-version # defaults to current date, e.g., 2023-01-20
| +-- soma
| +-- h5ads
+-- logs # log files from various stages
| +-- build.log
| +-- ...
+-- reports
+-- census-summary-VERSION.txt
+-- census-diff-VERSION.txt
```

## Building and using the Docker container

### Prerequisites

You will need:

- Linux - known to work on Ubuntu 20 and 22, and should work fine on most other (modern) Linux distros
- Docker - [primary installation instructions](https://docs.docker.com/engine/install/ubuntu/#installation-methods) and [important post-install configuration](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user)
- Python 3.9+

### Build & run

The standard Census build is expected to be done via a Docker container. To build the required image, do a `git pull` to the version you want to use, and do the following to create a docker image called `cell-census-builder`:

```shell
cd tools/cell_census_builder
make image
```

To use the container to build the _full_ census, with default options, pick a working directory (e.g., /tmp/census-build), and:

```shell
mkdir /tmp/census-build
chmod ug+s /tmp/census-build # optional, but makes permissions handling simpler
docker run --mount type=bind,source="/tmp/census-build",target='/census-build' cell-census-builder
```

### Build configuration options

This is primarily for the use of package developers. The defaults are suitable for the standad Census build, and are defined in the `build_state.py` file.

If you need to override a default, create `config.yaml` in the build working directory and specify the overrides. An example `config.yaml` might look like:

```
verbose: 2 # debug level logging
consolidate: false # disable TileDB consolidation
```

### Commands to cleanup local Docker state on your ec2 instance (while building an image)

Docker keeps around intermediate layers/images and if your machine doesn't have enough memory, you might run into issues. You can blow away these cached layers/images by running the following commands.

```shell
docker system prune
docker rm -f $(docker ps -aq)
docker rmi -f $(docker images -q)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move into a Makefile target for convenience?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent idea.


## Module-specific notes

### `host_validation` module

Module which provides a set of checks that the current host machine has the requisite capabilities
to build the census (e.g., free disk space). Raises exception (non-zero process exit) if host is
unable to meet base requirements.

Stand-alone usage: `python -m cell_census_builder.host_validation`

### `build_soma` module

Stand-alone use: `python -m cell_census_builder.build_soma ...`

TL;DR:

Expand All @@ -25,26 +122,26 @@ The build process:
- Step 3: Write the axis dataframes for each experiment, filtering the datasets and cells to include (serialized iteration of dataset H5ADs).
- Step 4: Write the X layers for each experiment (parallelized iteration of filtered dataset H5ADs).
- Step 5: Write datasets manifest and summary info.
- (Optional) Consolidate TileDB data
- (Optional) Consolidate TileDB data
- (Optional) Validate the entire Cell Census, re-reading from storage.

Modes of operation:
a) (default) creating the entire "cell census" using all files currently in the CELLxGENE repository.
b) creating a smaller "cell census" from a user-provided list of files (a "manifest")

### Mode (a) - creating the full cell census from the entire CELLxGENE (public) corpus:
#### Mode (a) - creating the full cell census from the entire CELLxGENE (public) corpus:

- On a large-memory machine with _ample_ free (local) disk (eg, 3/4 TB or more) and swap (1 TB or more)
- To create a cell census at `<census_path>`, execute:
> $ python -m cell_census_builder -mp --max-workers 12 <census_path> build
- Tips:
- `-v` to view info-level logging during run, or `-v -v` for debug-level logging
- `--test-first-n <#>` to test build on a subset of datasets
- `--build-tag $(date +'%Y%m%d_%H%M%S')` to produce non-conflicting census build directories during testing
- Tips:
- `-v` to view info-level logging during run, or `-v -v` for debug-level logging
- `--test-first-n <#>` to test build on a subset of datasets
- `--build-tag $(date +'%Y%m%d_%H%M%S')` to produce non-conflicting census build directories during testing

If you run out of memory, reduce `--max-workers`. You can also try a higher number if you have lots of CPU & memory.

### Mode (b) - creating a cell census from a user-provided list of H5AD files:
#### Mode (b) - creating a cell census from a user-provided list of H5AD files:

- Create a manifest file, in CSV format, containing two columns: dataset_id, h5ad_uri. Example:
```csv
Expand All @@ -55,15 +152,3 @@ If you run out of memory, reduce `--max-workers`. You can also try a higher numb
You can specify a file system path or a URI in the second field
- To create a cell census at `<census_path>`, execute:
> $ python -m cell_census_builder <census_path> build --manifest <the_manifest_file.csv>

### Other info

There are more options discoverable via the `--help` command line option.

Note on required host resources:

- all H5AD files not on the local disk will be downloaded/cached locally. There must be
sufficient local file system space. Location of cache can be controlled with the
environment variable `FSSPEC_CACHE_DIR`
- each H5AD will be read into memory, in its entirety. Sufficient RAM must be present to
allow for this (and to do so for multiple H5ADs concurrently if you use the `--multi-process` option)
Empty file.
3 changes: 3 additions & 0 deletions tools/cell_census_builder/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

python3 -m cell_census_builder .
72 changes: 72 additions & 0 deletions tools/cell_census_builder/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
[build-system]
requires = ["setuptools>=45", "setuptools_scm[toml]>=6.2"]
build-backend = "setuptools.build_meta"

[project]
name = "cell_census_builder"
dynamic = ["version"]
description = "Build Cell Census"
authors = [
{ name = "Chan Zuckerberg Initiative", email = "[email protected]" }
]
license = { text = "MIT" }
readme = "README.md"
requires-python = ">= 3.9, < 3.11" # Python 3.11 is pending numba support
classifiers = [
"Development Status :: 2 - Pre-Alpha",
"Intended Audience :: Developers",
"Intended Audience :: Information Technology",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Topic :: Scientific/Engineering :: Bio-Informatics",
"Operating System :: POSIX :: Linux",
"Operating System :: MacOS :: MacOS X",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
]
dependencies= [
"typing_extensions",
"pyarrow",
"pandas",
"anndata>=0.8",
"numpy",
"cell_census==0.10.0",
"scipy",
"fsspec",
"s3fs",
"requests",
"aiohttp",
"Cython", # required by owlready2
"wheel", # required by owlready2
"owlready2",
"gitpython",
"attrs>=22.2.0",
"psutil",
"pyyaml",
]

[tool.setuptools.packages.find]
where = ["src"]
include = ["cell_census_builder*"] # package names should match these glob patterns (["*"] by default)
exclude = ["tests*"] # exclude packages matching these glob patterns (empty by default)

[tool.setuptools_scm]
root = "../.."

[tool.black]
line-length = 120
target_version = ['py39']

[tool.mypy]
show_error_codes = true
ignore_missing_imports = true
warn_unreachable = true
strict = true
plugins = "numpy.typing.mypy_plugin"

[tool.ruff]
select = ["E", "F", "B", "I"]
ignore = ["E501", "E402", "C408", ]
line-length = 120
target-version = "py39"
7 changes: 7 additions & 0 deletions tools/cell_census_builder/src/cell_census_builder/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from importlib import metadata

try:
__version__ = metadata.version("cell_census_builder")
except metadata.PackageNotFoundError:
# package is not installed
__version__ = "0.0.0-unknown"
Loading