Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO-NOT-MERGE: discussion of common helm chart's probe/resources #431

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lianhao
Copy link
Collaborator

@lianhao lianhao commented Sep 13, 2024

Description

The summary of the proposed changes as long as the relevant motivation and context.

Issues

List the issue or RFC link this PR is working on. If there is no such link, please mark it as n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)

Dependencies

List the newly introduced 3rd party dependency if exists.

Tests

Describe the tests that you ran to verify your changes.


- Common components' value files include different probe timing sections for CPU and for accelerators

- Their deployment templates select one based on .Values.accelDevice value (empty for CPU)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Copy link
Contributor

@eero-t eero-t Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll come up with a PR which selects better probe timings (and custom metrics) based on *.accelDevice values (hopefully early next week).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we need to set different values for different device, especially for CPU some models startup are really slow.
What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?

Copy link
Contributor

@eero-t eero-t Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?

Intent of the probes

The intent of startup and liveness probes is to restart deadlocked (or otherwise frozen) pods, on assumption that restart gets them working again. I.e. startup and liveness probes make sense if there are deadlock/freezing problems with the service, and startup actually gets rid of the problem (eventually scheduler backs off from re-starting the failing instance, which at least gets rid of its resource usage).

Readiness probes are intended to avoid routing traffic to services that are temporarily down.

See: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Issues with too short timings

Issues that I have seen with current OPEA probe values, when multiple instances get scheduled on same Xeon:

  • Startup: Scaled up instances never serve any queries, they get restarted because their startup does not finish in time. Meaning that they just consume CPU on re-startups and slow down the already working inference service instances instead of doing useful work
  • Liveness: Stressed services that are working fine, just a bit slow, get restarted, which similarly to startup probes, just makes the situation worse
  • Readiness: Probe failures keep all instances mostly in non-Ready state i.e. Service object does not route any queries to them, instead they get buffered to single overworked instance, which increases the latencies

Situation was noticeably improved by increasing the probe timings.

I haven't seem any deadlocks, so at least in my setup both startup & liveness probes are just harmful. Readiness probes can be useful though, if they're fine-tuned to balance traffic from overworked (non-Ready) instances to ones with free capacity (in Ready state).

Issues with too long timings

If restart can fix the issue, and timings are too long, fixing the issue is unnecessarily delayed. But if no traffic is routed to Pod due to its non-Ready state, that's just potential performance decrease, not a functional issue.

However, if pod keeps in Ready state despite Liveness probe failing, it means more queries being routed to non-working pod and being lost when it is eventually restarted => Liveness test should not be something that can fail although readiness test succeeds, and as a subset of that requirement, Liveness probe timeouts should not be shorter than Readiness ones ( they could be longer though)

Copy link
Contributor

@eero-t eero-t Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when multiple instances get scheduled on same Xeon:

Partly this is a problem of current OPEA Helm charts missing suitable per-model resource requests / affinity / scheduler topology / NRI policies. If each model + service arg + device combo would specify well enough their resource needs, and they were isolated enough from each other when running on CPU, there would be less need for increasing the default probe timings for CPU so much.


- GMC device variant manifests are generated for all relevant components, not just TGI

(I don't think probe timings would need to be fine-tuned based on which model is used.)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? A large model may need more time to be ready, compared to a small model

Copy link
Contributor

@eero-t eero-t Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there naturally is difference between them, I was thinking that for models fitting to a single Gaudi, current default probe values could be enough.

If warmup/startup is extraordinarily slow with some larger model, probe timings can always be overridden in the <component>/<device>/<model>.yaml file, but that can come later, after the separate per-model resource usage files are there.

PS. CPU side is more problematic, because there are so many additional variables affecting the perf (starting from underlying node HW differences, isolation and scheduling policies in effect), but there's not much that can be done. Having separate files for every different config would not practical (too many files), but there can be some additional documentation on what kind of (manual) fine-tuning user may need to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation.
However, for a dedicated AIExample, we'll have default model/tgi, the default probe timing can be set accordingly and remind in the docs for the possibility of tuning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation.

AFAIK there's a magnitude of performance difference between accelerator devices (real ones, not e.g. iGPUs and small NPUs) and CPUs for inferencing.

@yongfengdu Are you saying that if you use largest model (fitting to single device) listed in OPEA docs on Gaudi TGI, and smallest model listed in OPEA docs on on Xeon TGI, latter starts faster because there's such a large difference between those models?


- Their deployment templates select one based on .Values.accelDevice value (empty for CPU)

- All <device>-values.yaml files set appropriate <subchart>.accelDevice value (not just ChatQnA)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can use .accelDevice to control all the k8s resource spec variant, and top level subchart only have to set that .accelDevice ?

-f common/teirerank/gaudi/bge-reranker-base.yaml
-f common/tei/cpu/bge-base.yaml
-f common/data-prep/gpu/values.yaml
(These would provide values with subchart prefix/heading so they can be used from top-level charts)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From top level chart's perspective, a common tgi chart can be used twice, i.e. one for chat completion, another for guardrails, so providing subchart prefix/heading in common/tgi/gaudi/neural-chat-7b.yaml may not be enough, we may need to ask end user to use '--set-file subchart_prefix= common/tgi/gaudi/neural-chat-7b.yaml' in the top level chart deployment.

One more question @yongfengdu , will this affect repo chart if the end users try to install helm chart not from source code, but from chart repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can just be two files that are identical except for the subchart name (separate files in case somebody wants to use different models for them:

  • common/tgi/gaudi/neural-chat-7b.yaml
  • common/tgi/gaudi/neural-chat-7b-guardrails.yaml

As they're otherwise identical, generating/updating the extra files should be automated:

for dev in cpu gaudi; do
   models=$(ls $dev/ | grep -v guardrails | sed s/.yaml//)
   for model in $models; do
     sed s/tgi:/tgi-guardrails:/ $dev/$model.yaml > $dev/$model-guardrails.yaml
  done
done

Btw. Are there other components than TGI, that may be used under multiple names within the same chart?

Copy link
Collaborator Author

@lianhao lianhao Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we actually reuse the the llm-uservice chart for both codegen/docsum with different images set in the top lelve chart. Also we're thinking of reusing llm-uservice chart for all the different variants for llm test-generation/. But since those services are quite simple, we can make them follow the accelDevice way to have cmmon chart's value files contains all the variants' config, and top level chart only have to specify the variant name.

Copy link
Contributor

@eero-t eero-t Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases.

GMC needs separate manifest files to be generated for each of these subservices, so that it can apply correct resources requests when their models differ and/or are changed.

=> The script generating the GMC manifests files could compose their names based on the subchart name used within the Helm yaml override file, and what model is specified there. But for consistency it may better if every file name (at least for these components) includes the subchart name used within it (not just for some of the files). For example:

  • common/tgi/gaudi/tgi.neural-chat-7b.yaml
  • common/tgi/gaudi/tgi-guardrails.neural-chat-7b.yaml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases.

That's just 2 examples. Multiple services will be llm-uservice alike, but most of them will not depends on any specific model, they're intermediate services which talks to multiple different backends which is related to specific models.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using helm repo usage should be tested, as I know, helm package will package all the files in the directory, but with tar.gz format, not sure if we can specify files in a tar ball.

Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users. I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml).
Keep the default values.yaml as simple as it could be, for users to start using quickly with suboptimal settings.

BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases.

Copy link
Contributor

@eero-t eero-t Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users.

Currently user needs to:

  1. Change models to ones he prefers
  2. Run the current non-working OPEA Helm chart
  3. Profile resource usage, and test suitable probe timings
  4. Update resource usage + probe timing values to Helm charts to get working OPEA install

With this proposal, OPEA project would do those steps for the user, he just needs to select files matching the models he's interested on. How that would be harder for the user?

(Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)

I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml). Keep the default values.yaml as simple as it could be, for users to start using quickly with suboptimal settings.

Top level chart can naturally have some default <device>-values.yaml duplicating device/model files content for its subcharts. However, IMHO they should be autogenerated from component files, so that there's only single place that needs to be updated whenever given (model version and corresponding service args, resource values) etc are changed.

BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases.

I see that they use the same image, just with with different model, but why teirerank one does not have Gaudi values file?

Copy link
Contributor

@eero-t eero-t Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml).

@yongfengdu That won't work for GMC. Because GMC is used for changing the model, it will need also to have information about other things that need to be changed with the model (args, resource requests etc).

Mismatching or missing pod resources mean:

  • Too small resource limits: pods will be CPU throttled and outright killed as they go over their memory limits
  • Too large resource requests: pods won't fit to nodes, or nodes are badly utilized (good utilization is important production cluster criteria)
  • No requests/limits: pod has worst QoS (other pods have precedence over it): https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)

Note that at least TEI can use much more CPU side memory, that what it uses on device side: huggingface/text-embeddings-inference#280

In next TEI release, it should be possible to affect this by limiting number of CPUs available for TEI: huggingface/text-embeddings-inference#410

@lianhao lianhao mentioned this pull request Sep 13, 2024
1 task
@eero-t
Copy link
Contributor

eero-t commented Sep 13, 2024

Btw. Here are parameters used by NIM Helm installations: https://github.com/NVIDIA/nim-deploy/blob/main/helm/nim-llm/README.md#parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants