Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ci quality workflows #1423

Merged

Conversation

enzymezoo-code
Copy link
Contributor

This lays the foundation for a testing suite. Pytest was chosen for its simplicity, extensibility, and scalability.

The file tests/inference/test_inference.py is an e2e test that generates images and asserts that they are not blank.

The plan is for these automated tests is to orchestrate them through GitHub workflows, ensuring code changes are thoroughly tested prior to merging.

Currently, a few of these tests fail (eg. sampler="uni_pc_bh2" and scheduler="exponential"). We will either need to make these tests pass or decide to limit which samplers+schedulers are fully supported.

@enzymezoo-code
Copy link
Contributor Author

Another issue:

The comfy/cli_args.py file will parse args on import, so this cannot use any built-in pytest args as it is.

Example error message:

 > pytest --co
tests/inference/test_inference.py:20: in <module>
    from comfy.samplers import KSampler
comfy/samplers.py:5: in <module>
    from comfy import model_management
comfy/model_management.py:3: in <module>
    from comfy.cli_args import args
comfy/cli_args.py:97: in <module>
    args = parser.parse_args()
../../../anaconda3/lib/python3.9/argparse.py:1827: in parse_args
    self.error(msg % ' '.join(argv))
../../../anaconda3/lib/python3.9/argparse.py:2581: in error
    self.exit(2, _('%(prog)s: error: %(message)s\n') % args)
../../../anaconda3/lib/python3.9/argparse.py:2568: in exit
    _sys.exit(status)
E   SystemExit: 2
collecting ... usage: pytest [-h] [--listen [IP]] [--port PORT] [--enable-cors-header [ORIGIN]] [--extra-model-paths-config PATH [PATH ...]] [--output-directory OUTPUT_DIRECTORY] [--temp-directory TEMP_DIRECTORY] [--auto-launch]
              [--disable-auto-launch] [--cuda-device DEVICE_ID] [--cuda-malloc | --disable-cuda-malloc] [--dont-upcast-attention] [--force-fp32 | --force-fp16] [--fp16-vae | --fp32-vae | --bf16-vae]
              [--directml [DIRECTML_DEVICE]] [--disable-ipex-optimize] [--preview-method [none,auto,latent2rgb,taesd]] [--use-split-cross-attention | --use-quad-cross-attention | --use-pytorch-cross-attention]
              [--disable-xformers] [--gpu-only | --highvram | --normalvram | --lowvram | --novram | --cpu] [--disable-smart-memory] [--dont-print-server] [--quick-test-for-ci] [--windows-standalone-build]
              [--disable-metadata]
pytest: error: unrecognized arguments: --co

Best solution would be to fix cli_args.py so it would not parse args on import.

Since test_inference.py only uses this import statement to get a list of samplers and schedulers, I will hard-code those lists for now.

@M1kep
Copy link
Contributor

M1kep commented Sep 5, 2023

What sort of runtime speeds are you seeing in GHA?

I see some of the test workflows are using 40 steps, I feel these runs could take hours?

@guill
Copy link
Contributor

guill commented Sep 6, 2023

Right now, it looks like these tests only check that the workflow completed (and returned a non-black result). There's definitely value in that, but I'm not sure it'll be able to serve as a basis for a full testing suite. In order to get full coverage, we'll want:

  1. Comparison of image results with 'expected' image results (possibly via a perceptual hash to avoid bloating the repository).
  2. Manual review of image results when they differ.
  3. The ability to determine whether caching functioned properly.

IMO, before merging this, we should have simple examples of those test cases. (In particular, I'm concerned about this code's ability to generalize to 2 without creating a second entirely parallel testing system.)

@ltdrdata
Copy link
Collaborator

ltdrdata commented Sep 6, 2023

Right now, it looks like these tests only check that the workflow completed (and returned a non-black result). There's definitely value in that, but I'm not sure it'll be able to serve as a basis for a full testing suite. In order to get full coverage, we'll want:

  1. Comparison of image results with 'expected' image results (possibly via a perceptual hash to avoid bloating the repository).
  2. Manual review of image results when they differ.
  3. The ability to determine whether caching functioned properly.

IMO, before merging this, we should have simple examples of those test cases. (In particular, I'm concerned about this code's ability to generalize to 2 without creating a second entirely parallel testing system.)

We should be able to set up a structure to compare reference images and result images by pixel differences.

enzymezoo-code and others added 3 commits September 6, 2023 15:50
* Add image comparison tests

* Comparison tests do not pass with empty metadata

* Ensure tests are run in correct order

* Save image files  with test name

* Update tests readme
@enzymezoo-code
Copy link
Contributor Author

Right now, it looks like these tests only check that the workflow completed (and returned a non-black result). There's definitely value in that, but I'm not sure it'll be able to serve as a basis for a full testing suite. In order to get full coverage, we'll want:

1. Comparison of image results with 'expected' image results (possibly via a perceptual hash to avoid bloating the repository).

2. Manual review of image results when they differ.

3. The ability to determine whether caching functioned properly.

IMO, before merging this, we should have simple examples of those test cases. (In particular, I'm concerned about this code's ability to generalize to 2 without creating a second entirely parallel testing system.)

You're right! A full set of tests needs all those things.

Update:

  1. I've added image comparisons with tests/compare/test_quality.py
  2. Manual review: test_quality.py creates a metrics.md file, which shows which tests passed and which did not. It also creates a folder of grid images to compare "ground truth" to the newly generated image.
  3. Determine whether caching functioned properly: @guill Do you have suggestions on where to start with this?

@enzymezoo-code
Copy link
Contributor Author

@M1kep Good suggestion. I've reduced the default step count in the tests to 20.

I estimate they'll take about a half hour to run through all the samplers and schedule combinations (on a 3090 with xformers only).

@guill
Copy link
Contributor

guill commented Sep 7, 2023

Right now, it looks like these tests only check that the workflow completed (and returned a non-black result). There's definitely value in that, but I'm not sure it'll be able to serve as a basis for a full testing suite. In order to get full coverage, we'll want:

1. Comparison of image results with 'expected' image results (possibly via a perceptual hash to avoid bloating the repository).

2. Manual review of image results when they differ.

3. The ability to determine whether caching functioned properly.

IMO, before merging this, we should have simple examples of those test cases. (In particular, I'm concerned about this code's ability to generalize to 2 without creating a second entirely parallel testing system.)

You're right! A full set of tests needs all those things.

Update:

  1. I've added image comparisons with tests/compare/test_quality.py
  2. Manual review: test_quality.py creates a metrics.md file, which shows which tests passed and which did not. It also creates a folder of grid images to compare "ground truth" to the newly generated image.
  3. Determine whether caching functioned properly: @guill Do you have suggestions on where to start with this?

Awesome! This is coming along great.

Right now, detecting caching requires listening to the websocket events. You can do it one of two ways:

  1. Use the execution_cached message to detect which nodes are cached.
  2. Use the executing message to detect which nodes are not cached.

If I can be a little selfish, I would vote for the latter. (In my execution model refactor PR, we won't actually be able to know whether nodes within components will be able to be cached until those components are instantiated.)

@guill
Copy link
Contributor

guill commented Sep 7, 2023

Also, this doesn't necessarily need to be decided in this PR, but I'd like to propose implementing tests using a more code-oriented method of creating graphs rather than exporting the json from the front-end. I think it's going to be a lot easier to deal with in git history, a lot easier to parameterize (such as where you're currently hard-coding the sampler node IDs), and a lot easier for us to reuse common subgraphs to keep tests terse.

As an example, here's what your current test json would look like using the GraphBuilder from my branch (https://github.com/guill/ComfyUI/blob/node_expansion/comfy/graph_utils.py):

graph = graph_utils.GraphBuilder(prefix="")

# Initial Sampling
loader = graph.node("CheckpointLoaderSimple", ckpt_name="sd_xl_base_1.0.safetensors")
latent = graph.node("EmptyLatentImage", width=1024, height=1024, batch_size=1)
prompt = graph.node("CLIPTextEncode", text="a photo of a cat", clip=loader.out(1))
# We can have comments too!
# For example, we're explicitly using this instead of another CLIPTextEncode node to test _____.
negative_prompt = graph.node("ConditioningZeroOut", conditioning=prompt.out(0))
sampler = graph.node("KSamplerAdvanced",
    add_noise="enable",
    noise_seed=42,
    steps=20,
    cfg=7.5,
    sampler_name=SAMPLERS[i],
    scheduler="normal",
    start_at_step=0,
    end_at_step=32,
    return_with_leftover_noise="enable",
    model=loader.out(0),
    positive=prompt.out(0),
    negative=negative_prompt.out(0),
    latent_image=latent.out(0))

# Refining
rloader = graph.node("CheckpointLoaderSimple", ckpt_name="sd_xl_refiner_1.0.safetensors")
rprompt = graph.node("CLIPTextEncode", text="a photo of a cat", clip=rloader.out(1))
rnegative_prompt = graph.node("CLIPTextEncode", text="", clip=rloader.out(1))
rsampler = graph.node("KSamplerAdvanced",
    add_noise="disable",
    noise_seed=42,
    steps=20,
    cfg=7.5,
    sampler_name=SAMPLERS[i],
    scheduler="normal",
    start_at_step=32,
    end_at_step=10000,
    return_with_leftover_noise="disable",
    model=rloader.out(0),
    positive=rprompt.out(0),
    negative=rnegative_prompt.out(0),
    latent_image=sampler.out(0))
decode = graph.node("VAEDecode", samples=rsampler.out(0), vae=loader.out(2))

comfy_client.get_images(graph=graph.finalize(), save=False)

(Dry-coded, so sorry if there are any missing commas or the like.)

The GraphBuilder itself is pretty small/simple, and could probably just be pulled into this PR if we want to go that route.

* Add build test github workflow
@comfyanonymous
Copy link
Owner

3039b08

This should fix the issue with the args.

@guill
Copy link
Contributor

guill commented Sep 16, 2023

@enzymezoo-code Do you plan on continuing to push this PR through? If not, I may try to do so myself -- I'm looking to get some unit tests running against my branch (with comparisons against mainline) and this PR looks like a great start.

@enzymezoo-code
Copy link
Contributor Author

@comfyanonymous Can we merge this in?

@comfyanonymous comfyanonymous changed the base branch from master to temp September 19, 2023 03:17
@comfyanonymous comfyanonymous merged commit 26cd840 into comfyanonymous:temp Sep 19, 2023
HorusElohim pushed a commit to HorusElohim/PyfyUI that referenced this pull request Sep 22, 2023
* Add inference tests

* Clean up

* Rename test graph file

* Add readme for tests

* Separate server fixture

* test file name change

* Assert images are generated

* Clean up comments

* Add __init__.py so tests can run with command line `pytest`

* Fix command line args for pytest

* Loop all samplers/schedulers in test_inference.py

* Ci quality workflows compare (comfyanonymous#1)

* Add image comparison tests

* Comparison tests do not pass with empty metadata

* Ensure tests are run in correct order

* Save image files  with test name

* Update tests readme

* Reduce step counts in tests to ~halve runtime

* Ci quality workflows build (comfyanonymous#2)

* Add build test github workflow
rklaffehn pushed a commit to rklaffehn/ComfyUI that referenced this pull request Nov 14, 2023
* Add inference tests

* Clean up

* Rename test graph file

* Add readme for tests

* Separate server fixture

* test file name change

* Assert images are generated

* Clean up comments

* Add __init__.py so tests can run with command line `pytest`

* Fix command line args for pytest

* Loop all samplers/schedulers in test_inference.py

* Ci quality workflows compare (#1)

* Add image comparison tests

* Comparison tests do not pass with empty metadata

* Ensure tests are run in correct order

* Save image files  with test name

* Update tests readme

* Reduce step counts in tests to ~halve runtime

* Ci quality workflows build (#2)

* Add build test github workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants