DO-NOT-MERGE: discussion of common helm chart's probe/resources #431

lianhao · 2024-09-13T04:47:44Z

Description

The summary of the proposed changes as long as the relevant motivation and context.

Issues

List the issue or RFC link this PR is working on. If there is no such link, please mark it as n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds new functionality)
Breaking change (fix or feature that would break existing design and interface)

Dependencies

List the newly introduced 3rd party dependency if exists.

Tests

Describe the tests that you ran to verify your changes.

for more information, see https://pre-commit.ci

lianhao · 2024-09-13T04:50:37Z

discussion_of_helm_probe_and_resource.md

+
+- Common components' value files include different probe timing sections for CPU and for accelerators
+
+- Their deployment templates select one based on .Values.accelDevice value (empty for CPU)


I'll come up with a PR which selects better probe timings (and custom metrics) based on *.accelDevice values (hopefully early next week).

Agree that we need to set different values for different device, especially for CPU some models startup are really slow.
What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?

What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?

Intent of the probes

The intent of startup and liveness probes is to restart deadlocked (or otherwise frozen) pods, on assumption that restart gets them working again. I.e. startup and liveness probes make sense if there are deadlock/freezing problems with the service, and startup actually gets rid of the problem (eventually scheduler backs off from re-starting the failing instance, which at least gets rid of its resource usage).

Readiness probes are intended to avoid routing traffic to services that are temporarily down.

See: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Issues with too short timings

Issues that I have seen with current OPEA probe values, when multiple instances get scheduled on same Xeon:

Startup: Scaled up instances never serve any queries, they get restarted because their startup does not finish in time. Meaning that they just consume CPU on re-startups and slow down the already working inference service instances instead of doing useful work

Liveness: Stressed services that are working fine, just a bit slow, get restarted, which similarly to startup probes, just makes the situation worse

Readiness: Probe failures keep all instances mostly in non-Ready state i.e. Service object does not route any queries to them, instead they get buffered to single overworked instance, which increases the latencies

Situation was noticeably improved by increasing the probe timings.

I haven't seem any deadlocks, so at least in my setup both startup & liveness probes are just harmful. Readiness probes can be useful though, if they're fine-tuned to balance traffic from overworked (non-Ready) instances to ones with free capacity (in Ready state).

Issues with too long timings

If restart can fix the issue, and timings are too long, fixing the issue is unnecessarily delayed. But if no traffic is routed to Pod due to its non-Ready state, that's just potential performance decrease, not a functional issue.

However, if pod keeps in Ready state despite Liveness probe failing, it means more queries being routed to non-working pod and being lost when it is eventually restarted => Liveness test should not be something that can fail although readiness test succeeds, and as a subset of that requirement, Liveness probe timeouts should not be shorter than Readiness ones ( they could be longer though)

when multiple instances get scheduled on same Xeon:

Partly this is a problem of current OPEA Helm charts missing suitable per-model resource requests / affinity / scheduler topology / NRI policies. If each model + service arg + device combo would specify well enough their resource needs, and they were isolated enough from each other when running on CPU, there would be less need for increasing the default probe timings for CPU so much.

lianhao · 2024-09-13T04:51:40Z

discussion_of_helm_probe_and_resource.md

+
+- GMC device variant manifests are generated for all relevant components, not just TGI
+
+(I don't think probe timings would need to be fine-tuned based on which model is used.)


Is this true? A large model may need more time to be ready, compared to a small model

While there naturally is difference between them, I was thinking that for models fitting to a single Gaudi, current default probe values could be enough.

If warmup/startup is extraordinarily slow with some larger model, probe timings can always be overridden in the <component>/<device>/<model>.yaml file, but that can come later, after the separate per-model resource usage files are there.

PS. CPU side is more problematic, because there are so many additional variables affecting the perf (starting from underlying node HW differences, isolation and scheduling policies in effect), but there's not much that can be done. Having separate files for every different config would not practical (too many files), but there can be some additional documentation on what kind of (manual) fine-tuning user may need to do.

The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation.
However, for a dedicated AIExample, we'll have default model/tgi, the default probe timing can be set accordingly and remind in the docs for the possibility of tuning.

The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation.

AFAIK there's a magnitude of performance difference between accelerator devices (real ones, not e.g. iGPUs and small NPUs) and CPUs for inferencing.

@yongfengdu Are you saying that if you use largest model (fitting to single device) listed in OPEA docs on Gaudi TGI, and smallest model listed in OPEA docs on on Xeon TGI, latter starts faster because there's such a large difference between those models?

lianhao · 2024-09-13T04:57:12Z

discussion_of_helm_probe_and_resource.md

+
+- Their deployment templates select one based on .Values.accelDevice value (empty for CPU)
+
+- All <device>-values.yaml files set appropriate <subchart>.accelDevice value (not just ChatQnA)


So we can use .accelDevice to control all the k8s resource spec variant, and top level subchart only have to set that .accelDevice ?

lianhao · 2024-09-13T05:10:20Z

discussion_of_helm_probe_and_resource.md

+  -f common/teirerank/gaudi/bge-reranker-base.yaml
+  -f common/tei/cpu/bge-base.yaml
+  -f common/data-prep/gpu/values.yaml
+  (These would provide values with subchart prefix/heading so they can be used from top-level charts)


From top level chart's perspective, a common tgi chart can be used twice, i.e. one for chat completion, another for guardrails, so providing subchart prefix/heading in common/tgi/gaudi/neural-chat-7b.yaml may not be enough, we may need to ask end user to use '--set-file subchart_prefix= common/tgi/gaudi/neural-chat-7b.yaml' in the top level chart deployment.

One more question @yongfengdu , will this affect repo chart if the end users try to install helm chart not from source code, but from chart repo?

There can just be two files that are identical except for the subchart name (separate files in case somebody wants to use different models for them:

common/tgi/gaudi/neural-chat-7b.yaml

common/tgi/gaudi/neural-chat-7b-guardrails.yaml

As they're otherwise identical, generating/updating the extra files should be automated:

for dev in cpu gaudi; do models=$(ls $dev/ | grep -v guardrails | sed s/.yaml//) for model in $models; do sed s/tgi:/tgi-guardrails:/ $dev/$model.yaml > $dev/$model-guardrails.yaml done done

Btw. Are there other components than TGI, that may be used under multiple names within the same chart?

yes, we actually reuse the the llm-uservice chart for both codegen/docsum with different images set in the top lelve chart. Also we're thinking of reusing llm-uservice chart for all the different variants for llm test-generation/. But since those services are quite simple, we can make them follow the accelDevice way to have cmmon chart's value files contains all the variants' config, and top level chart only have to specify the variant name.

Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases.

GMC needs separate manifest files to be generated for each of these subservices, so that it can apply correct resources requests when their models differ and/or are changed.

=> The script generating the GMC manifests files could compose their names based on the subchart name used within the Helm yaml override file, and what model is specified there. But for consistency it may better if every file name (at least for these components) includes the subchart name used within it (not just for some of the files). For example:

common/tgi/gaudi/tgi.neural-chat-7b.yaml

common/tgi/gaudi/tgi-guardrails.neural-chat-7b.yaml

Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases.

That's just 2 examples. Multiple services will be llm-uservice alike, but most of them will not depends on any specific model, they're intermediate services which talks to multiple different backends which is related to specific models.

Using helm repo usage should be tested, as I know, helm package will package all the files in the directory, but with tar.gz format, not sure if we can specify files in a tar ball.

Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users. I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml).
Keep the default values.yaml as simple as it could be, for users to start using quickly with suboptimal settings.

BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases.

Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users.

Currently user needs to:

Change models to ones he prefers

Run the current non-working OPEA Helm chart

Profile resource usage, and test suitable probe timings

Update resource usage + probe timing values to Helm charts to get working OPEA install

With this proposal, OPEA project would do those steps for the user, he just needs to select files matching the models he's interested on. How that would be harder for the user?

(Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)

I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml). Keep the default values.yaml as simple as it could be, for users to start using quickly with suboptimal settings.

Top level chart can naturally have some default <device>-values.yaml duplicating device/model files content for its subcharts. However, IMHO they should be autogenerated from component files, so that there's only single place that needs to be updated whenever given (model version and corresponding service args, resource values) etc are changed.

BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases.

I see that they use the same image, just with with different model, but why teirerank one does not have Gaudi values file?

I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml).

@yongfengdu That won't work for GMC. Because GMC is used for changing the model, it will need also to have information about other things that need to be changed with the model (args, resource requests etc).

Mismatching or missing pod resources mean:

Too small resource limits: pods will be CPU throttled and outright killed as they go over their memory limits

Too large resource requests: pods won't fit to nodes, or nodes are badly utilized (good utilization is important production cluster criteria)

No requests/limits: pod has worst QoS (other pods have precedence over it): https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/

(Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)

Note that at least TEI can use much more CPU side memory, that what it uses on device side: huggingface/text-embeddings-inference#280

In next TEI release, it should be possible to affect this by limiting number of CPUs available for TEI: huggingface/text-embeddings-inference#410

eero-t · 2024-09-13T15:18:34Z

Btw. Here are parameters used by NIM Helm installations: https://github.com/NVIDIA/nim-deploy/blob/main/helm/nim-llm/README.md#parameters

DO-NOT-MERGE: discussion of common helm chart's probe/resources

f244ce4

lianhao requested a review from yongfengdu September 13, 2024 04:47

[pre-commit.ci] auto fixes from pre-commit.com hooks

514c4d6

for more information, see https://pre-commit.ci

lianhao commented Sep 13, 2024

View reviewed changes

lianhao mentioned this pull request Sep 13, 2024

HPA improvements #386

Merged

1 task

eero-t mentioned this pull request Sep 30, 2024

Support alternative metrics on accelerated TGI / TEI instances #454

Merged

1 task

eero-t mentioned this pull request Oct 8, 2024

helm/manifest: Sync HPA related K8S probe settings #459

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO-NOT-MERGE: discussion of common helm chart's probe/resources #431

DO-NOT-MERGE: discussion of common helm chart's probe/resources #431

lianhao commented Sep 13, 2024

lianhao Sep 13, 2024

eero-t Sep 13, 2024 •

edited

Loading

yongfengdu Sep 14, 2024

eero-t Sep 16, 2024 •

edited

Loading

eero-t Sep 16, 2024 •

edited

Loading

lianhao Sep 13, 2024

eero-t Sep 13, 2024 •

edited

Loading

yongfengdu Sep 14, 2024

eero-t Sep 16, 2024

lianhao Sep 13, 2024

lianhao Sep 13, 2024

eero-t Sep 13, 2024

lianhao Sep 13, 2024 •

edited

Loading

eero-t Sep 13, 2024 •

edited

Loading

lianhao Sep 14, 2024

yongfengdu Sep 14, 2024

eero-t Sep 16, 2024 •

edited

Loading

eero-t Sep 16, 2024 •

edited

Loading

eero-t Sep 18, 2024

eero-t commented Sep 13, 2024


		- Common components' value files include different probe timing sections for CPU and for accelerators

		- Their deployment templates select one based on .Values.accelDevice value (empty for CPU)


		- GMC device variant manifests are generated for all relevant components, not just TGI

		(I don't think probe timings would need to be fine-tuned based on which model is used.)


		- Their deployment templates select one based on .Values.accelDevice value (empty for CPU)

		- All <device>-values.yaml files set appropriate <subchart>.accelDevice value (not just ChatQnA)

DO-NOT-MERGE: discussion of common helm chart's probe/resources #431

Are you sure you want to change the base?

DO-NOT-MERGE: discussion of common helm chart's probe/resources #431

Conversation

lianhao commented Sep 13, 2024

Description

Issues

Type of change

Dependencies

Tests

Choose a reason for hiding this comment

eero-t Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eero-t Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

eero-t Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eero-t Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianhao Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

eero-t Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eero-t Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

eero-t Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eero-t commented Sep 13, 2024

eero-t Sep 13, 2024 •

edited

Loading

eero-t Sep 16, 2024 •

edited

Loading

eero-t Sep 16, 2024 •

edited

Loading

eero-t Sep 13, 2024 •

edited

Loading

lianhao Sep 13, 2024 •

edited

Loading

eero-t Sep 13, 2024 •

edited

Loading

eero-t Sep 16, 2024 •

edited

Loading

eero-t Sep 16, 2024 •

edited

Loading