Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] resnet50 benchmarking #29096

Merged
merged 10 commits into from
Oct 17, 2022
Merged

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Oct 5, 2022

Signed-off-by: Sihan Wang [email protected]

Why are these changes needed?

  • Resnet50 is being used following the MLCommon
  • Besides MLCommon only testing on the model inference, we also integrate with the data download and tensor converting from image step, which is more similar to the real world use case.
  • CPU is for downloading and tensor conversion, GPU is for model inference in the release tests. Observing a big latency with big data transmission if all images sent to single replica, so it is intentionally split the images into different replicas for doing the downloading and tensor conversion. It boost the throughput quite much.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sihanwang41 sihanwang41 changed the title [Serve] restnet 50 benchmarking [Serve] restnet50 benchmarking Oct 5, 2022
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just had a few questions.

Do you mind adding a short description in a comment at the top of the benchmark.py? It could just be what you have in the PR description

release/serve_tests/workloads/serve_resnet_benchmark.py Outdated Show resolved Hide resolved

async def fetch(session):
async with session.get(
"http://localhost:8000/", json=input_uris * int(data_size / len(input_uris))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're repeating the images here. Probably a dumb question but the inference doesn't do any caching of the results anywhere right? If it did, the benchmark wouldn't be correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we don't do any caching. Caching results is not under this scope.

release/serve_tests/workloads/serve_resnet_benchmark.py Outdated Show resolved Hide resolved
release/serve_tests/workloads/serve_resnet_benchmark.py Outdated Show resolved Hide resolved

save_test_results(
{test_name: result},
default_output_file="/tmp/serve_resent_benchmark.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how the release test infra finds this file? It might have to be named /tmp/release_test_out.json or use the env var TEST_OUTPUT_JSON in order for the "fetch results" step to work, do you mind double checking this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 please use TEST_OUTPUT_JSON and follow other files for the JSON schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it can be shown in our perf dashboard

Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we send uris or should we send the image directly? what is this benchmark exactly built to test/evaluate?

release/serve_tests/workloads/serve_resnet_benchmark.py Outdated Show resolved Hide resolved
release/serve_tests/workloads/serve_resnet_benchmark.py Outdated Show resolved Hide resolved
release/serve_tests/compute_tpl_gpu_node.yaml Outdated Show resolved Hide resolved
)

async def _get_tensor_from_img(self, uri: str):
return await asyncio.coroutine(self.utils.prepare_input_from_uri)(uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if prepare_input_from_uri is blocking, making them async won't help here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

release/serve_tests/workloads/serve_resnet_benchmark.py Outdated Show resolved Hide resolved
@sihanwang41
Copy link
Contributor Author

should we send uris or should we send the image directly? what is this benchmark exactly built to test/evaluate?

Test: CPU + GPU + Resnet performance

I think it is more practical to download image and convert them to tensor inside the deployment code instead of passing image directly from http request.

@simon-mo
Copy link
Contributor

simon-mo commented Oct 6, 2022

I see. if downloading is on the critical path, we should definite put them in s3

Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see Archit's comment about TEST_OUTPUT_JSON.

@simon-mo
Copy link
Contributor

simon-mo commented Oct 7, 2022

We can merge this after a successful release test demo run.

@simon-mo
Copy link
Contributor

simon-mo commented Oct 7, 2022

(as a stretch goal, a smoke test would be preferred because so we can run it in CI as well: https://github.com/ray-project/ray/blob/master/release/BUILD#L5-L39)

@c21 c21 added the Ray 2.1 label Oct 7, 2022
@c21
Copy link
Contributor

c21 commented Oct 8, 2022

Can it be merged?

@simon-mo
Copy link
Contributor





release/serve_tests/workloads/serve_resnet_benchmark.py:85:65: F841 local variable 'fp' is assigned to but never used
 

<br class="Apple-interchange-newline">

Lint failed

Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
@richardliaw richardliaw changed the title [Serve] restnet50 benchmarking [Serve] resnet50 benchmarking Oct 14, 2022
@sihanwang41 sihanwang41 added the release-blocker P0 Issue that blocks the release label Oct 14, 2022
Signed-off-by: Sihan Wang <[email protected]>
@simon-mo simon-mo merged commit 08fbdfb into ray-project:master Oct 17, 2022
sihanwang41 added a commit to sihanwang41/ray that referenced this pull request Oct 20, 2022
sihanwang41 added a commit to sihanwang41/ray that referenced this pull request Oct 20, 2022
rickyyx pushed a commit that referenced this pull request Oct 21, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants