-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO-NOT-MERGE: discussion of common helm chart's probe/resources #431
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
Let's use this file for a discussion of how to set up the Helm probe and resource in common helm chart, e.g. tgi, tei, etc. | ||
|
||
Below is the proposal from @eero-t : | ||
|
||
[1] What's IMHO needed to fix things for probe timings: | ||
|
||
- Common components' value files include different probe timing sections for CPU and for accelerators | ||
|
||
- Their deployment templates select one based on .Values.accelDevice value (empty for CPU) | ||
|
||
- All <device>-values.yaml files set appropriate <subchart>.accelDevice value (not just ChatQnA) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we can use .accelDevice to control all the k8s resource spec variant, and top level subchart only have to set that .accelDevice ? |
||
|
||
- GMC device variant manifests are generated for all relevant components, not just TGI | ||
|
||
(I don't think probe timings would need to be fine-tuned based on which model is used.) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this true? A large model may need more time to be ready, compared to a small model There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While there naturally is difference between them, I was thinking that for models fitting to a single Gaudi, current default probe values could be enough. If warmup/startup is extraordinarily slow with some larger model, probe timings can always be overridden in the PS. CPU side is more problematic, because there are so many additional variables affecting the perf (starting from underlying node HW differences, isolation and scheduling policies in effect), but there's not much that can be done. Having separate files for every different config would not practical (too many files), but there can be some additional documentation on what kind of (manual) fine-tuning user may need to do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
AFAIK there's a magnitude of performance difference between accelerator devices (real ones, not e.g. iGPUs and small NPUs) and CPUs for inferencing. @yongfengdu Are you saying that if you use largest model (fitting to single device) listed in OPEA docs on Gaudi TGI, and smallest model listed in OPEA docs on on Xeon TGI, latter starts faster because there's such a large difference between those models? |
||
|
||
[2] What's IMHO needed to fix resource requests: | ||
|
||
- Current sub-optimal component arguments are optimized, and resulting resource requirements are profiled, for all relevant models | ||
|
||
- For example on SPR, CPU TGI data type can be set to bfloat16, which halves its memory usage | ||
- Observability support + suitable Grafana dashboards will help with profiling | ||
|
||
- Instead of subcomponent model & corresponding resources being specified in top-level chart, helm install command uses suitable model+resource+args file from given component, like this: | ||
-f common/tgi/gaudi/neural-chat-7b.yaml | ||
-f common/teirerank/gaudi/bge-reranker-base.yaml | ||
-f common/tei/cpu/bge-base.yaml | ||
-f common/data-prep/gpu/values.yaml | ||
(These would provide values with subchart prefix/heading so they can be used from top-level charts) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From top level chart's perspective, a common tgi chart can be used twice, i.e. one for chat completion, another for guardrails, so providing subchart prefix/heading in common/tgi/gaudi/neural-chat-7b.yaml may not be enough, we may need to ask end user to use '--set-file subchart_prefix= common/tgi/gaudi/neural-chat-7b.yaml' in the top level chart deployment. One more question @yongfengdu , will this affect repo chart if the end users try to install helm chart not from source code, but from chart repo? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There can just be two files that are identical except for the subchart name (separate files in case somebody wants to use different models for them:
As they're otherwise identical, generating/updating the extra files should be automated:
Btw. Are there other components than TGI, that may be used under multiple names within the same chart? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we actually reuse the the llm-uservice chart for both codegen/docsum with different images set in the top lelve chart. Also we're thinking of reusing llm-uservice chart for all the different variants for llm test-generation/. But since those services are quite simple, we can make them follow the accelDevice way to have cmmon chart's value files contains all the variants' config, and top level chart only have to specify the variant name. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases. GMC needs separate manifest files to be generated for each of these subservices, so that it can apply correct resources requests when their models differ and/or are changed. => The script generating the GMC manifests files could compose their names based on the subchart name used within the Helm yaml override file, and what model is specified there. But for consistency it may better if every file name (at least for these components) includes the subchart name used within it (not just for some of the files). For example:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's just 2 examples. Multiple services will be llm-uservice alike, but most of them will not depends on any specific model, they're intermediate services which talks to multiple different backends which is related to specific models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using helm repo usage should be tested, as I know, helm package will package all the files in the directory, but with tar.gz format, not sure if we can specify files in a tar ball. Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users. I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml). BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Currently user needs to:
With this proposal, OPEA project would do those steps for the user, he just needs to select files matching the models he's interested on. How that would be harder for the user? (Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)
Top level chart can naturally have some default
I see that they use the same image, just with with different model, but why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@yongfengdu That won't work for GMC. Because GMC is used for changing the model, it will need also to have information about other things that need to be changed with the model (args, resource requests etc). Mismatching or missing pod resources mean:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Note that at least TEI can use much more CPU side memory, that what it uses on device side: huggingface/text-embeddings-inference#280 In next TEI release, it should be possible to affect this by limiting number of CPUs available for TEI: huggingface/text-embeddings-inference#410 |
||
|
||
- There could also be a global option for ignoring (CPU side) resource requests that can be used when things need to be re-profiled | ||
|
||
- GMC applies resource specs generated from these, when user changes model | ||
|
||
If there are combinations of above which are common between different top-level charts, I would suggest Makefile merging relevant ones to common files (e.g. -f common/gaudi-model-defaults.yaml), to avoid duplicating things that may need to be updated whenever args, model, or image versions get updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll come up with a PR which selects better probe timings (and custom metrics) based on
*.accelDevice
values (hopefully early next week).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we need to set different values for different device, especially for CPU some models startup are really slow.
What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent of the probes
The intent of startup and liveness probes is to restart deadlocked (or otherwise frozen) pods, on assumption that restart gets them working again. I.e. startup and liveness probes make sense if there are deadlock/freezing problems with the service, and startup actually gets rid of the problem (eventually scheduler backs off from re-starting the failing instance, which at least gets rid of its resource usage).
Readiness probes are intended to avoid routing traffic to services that are temporarily down.
See: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Issues with too short timings
Issues that I have seen with current OPEA probe values, when multiple instances get scheduled on same Xeon:
Situation was noticeably improved by increasing the probe timings.
I haven't seem any deadlocks, so at least in my setup both startup & liveness probes are just harmful. Readiness probes can be useful though, if they're fine-tuned to balance traffic from overworked (non-Ready) instances to ones with free capacity (in Ready state).
Issues with too long timings
If restart can fix the issue, and timings are too long, fixing the issue is unnecessarily delayed. But if no traffic is routed to Pod due to its non-Ready state, that's just potential performance decrease, not a functional issue.
However, if pod keeps in Ready state despite Liveness probe failing, it means more queries being routed to non-working pod and being lost when it is eventually restarted => Liveness test should not be something that can fail although readiness test succeeds, and as a subset of that requirement, Liveness probe timeouts should not be shorter than Readiness ones ( they could be longer though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partly this is a problem of current OPEA Helm charts missing suitable per-model resource requests / affinity / scheduler topology / NRI policies. If each model + service arg + device combo would specify well enough their resource needs, and they were isolated enough from each other when running on CPU, there would be less need for increasing the default probe timings for CPU so much.