Add tokenizer config path argument #2110

RawthiL · 2023-12-07T17:46:25Z

The possibility of modifying two out of the tree config files was added in PR 1761, however the tokenizer configuration was left out (not sure why). I believe that we should enable the user to select a custom tokenizer_configs.yaml file by means of a new parameter:

    --tokenizer-configs-paths tokenizer_configs.yaml

In my particular case this is required to enable the separation of the tokenizer endpoint from the model endpoint. By means of the PR, the user will be able to do the following:

1- Set a tokenizer and model deployment config:

    --tokenizer-configs-paths tokenizer_configs.yaml
    --model-deployment-paths model_deployments.yaml

2- Assign different urls to each endpoint:
(model_deployments.yaml)

model_deployments:
  - name: neurips/local
    model_name: neurips/local
    tokenizer_name: neurips/local
    max_sequence_length: 2048
    client_spec:
      class_name: "helm.proxy.clients.http_model_client.HTTPModelClient"
      args: {
        base_url: "http://192.168.1.1:8080"
      }
    window_service_spec:
      class_name: "helm.benchmark.window_services.http_model_window_service.HTTPModelWindowService"
      args: {}

(tokenizer_configs.yaml)

tokenizer_configs:
  - name: neurips/local
    tokenizer_spec:
      class_name: "helm.proxy.tokenizers.http_model_tokenizer.HTTPModelTokenizer"
      args: {
        base_url: "http://192.168.1.2:8080"
      }
    end_of_text_token: "<|endoftext|>"
    prefix_token: "<|endoftext|>"

Note: The other way of changing the URL is through the HELM_HTTP_MODEL_BASE_URL environment variable, but this changes both the model and the tokenizer URL which not what I required.

yifanmai · 2023-12-07T19:28:34Z

Thanks for this suggestion! This PR looks good to me.

Note that we have another PR #1996 that removes the --*-paths flags - the user should then put their user-specific tokenizer_configs.yaml under ./prod_env/ (which can be configured using the --local-path argument) - would this work for your use case, or would you still prefer having the --tokenizer-configs-path flag?

RawthiL · 2023-12-07T19:54:47Z

Oh, I have not looked at PR #1996, it seems much more natural.

Also I found after submitting this PR that PR #1761 was working somehow by chance I think. The custom config passed to model-deployment-paths was being accepted because I suspect that at some point, before entering the main loop, the register_deployments_if_not_already_registered() function was being called:

helm-run \
    --conf-paths /home/helm/config/test_helm.conf \
    --suite v2 \
    --max-eval-instances 10 \
    --model-metadata-paths /home/helm/config/models_metadata.yaml \
    --tokenizer-configs-paths /home/helm/config/tokenizer_configs.yaml \
    --model-deployment-paths /home/helm/config/model_deployments.yaml

---> Reading model deployments from /home/helm/.local/lib/python3.8/site-packages/helm/config/model_deployments.yaml... # This is before entering the main loop!
main {
  Reading model deployments from /home/helm/config/model_deployments.yaml...
  Reading tokenizer configs from /home/helm/config/tokenizer_configs.yaml...
  Read 1 run entries from /home/helm/config/test_helm.conf

--->  Reading tokenizer configs from /home/helm/.local/lib/python3.8/site-packages/helm/config/tokenizer_configs.yaml... # This is after reading custom config, at register_helm_configurations()

  1 entries produced 1 run specs
  run_specs {
    RunSpec(....)
  } [0.0s]

Also the model-metadata-paths parameter was only being registered when register_model_deployment() was being called inside register_model_deployments_from_path.
All this meant that this PR is failing because the tokenizer is not set at registered at any point (TOKENIZERS_REGISTERED remains False) and then the configuration is overridden when register_helm_configurations() is called.

I was struggling with this and about to ask why this behavior is so strange, why are the flags TOKENIZERS_REGISTERED, DEPLOYMENTS_REGISTERED and METADATAS_REGISTERED not set to True after the config is read from an specific path (register_xxxxxxx_from_path functions)...
It would be great if I can get some clarification on why this works like that or if this is some non expected behavior.

Anyway, I will test PR #1996 , and see it it works for me.

JosselinSomervilleRoberts · 2023-12-08T01:24:03Z

@RawthiL that is indeed a weird behavior. Can you let me know if #1996 fixes your problem? Also, we are still thinking about the best way to do this with @yifanmai so feel free to leave any feedback!

RawthiL · 2023-12-08T12:27:25Z

Hi @JosselinSomervilleRoberts , I think that PR #1996 approach is more simple, less verbose. Sadly, I was not able to execute the helm-run command using its code, there was an error. I left a comment there.

Anyways, PR #1996 will still suffer from the behavior that I commented earlier. If someone does the following:

register_tokenizer_configs_from_path(some_path)  # to register a custom config on `some_path` (for any reason).
register_helm_configurations() # to load the rest of the configuration files

In this case the tokenizer config on some_path will be overridden by the one in the default path when register_helm_configurations is called. The last modifications in this PR solve this by setting the flags TOKENIZERS_REGISTERED, DEPLOYMENTS_REGISTERED and METADATAS_REGISTERED at the moment of loading, not inside the register_xxxxxx_if_not_already_registered functions.
If I'm not overlooking other mechanic, I think this feature is useful, regardless if the per-file paths are kept or not.

yifanmai · 2023-12-20T01:58:21Z

@RawthiL I just merged #2142 which also addresses this issue. Could you try it out and see if it works for you?

Otherwise, feel free to add the --tokenizer-configs-paths, --model-deployment-paths, and --model-metadata-paths flags in this pull request - you may have to copy some code from an old commit in main, because the other two flags are not in main currently. I'll be happy to help review and merge it.

yifanmai · 2023-12-20T02:02:24Z

For the record, you'll need to run from main to get the fixes i.e. pip install crfm-helm@git+https://github.com/stanford-crfm/helm.git@main

I'm planning on releasing v0.4.0 tomorrow, which will also have the fixes.

RawthiL · 2023-12-20T14:40:00Z

@yifanmai I have installed the current main and it seems to be working perfectly!

For anyone stepping into this, now you can pass the configuration path to the run command using the local-path :

helm-run \
    --conf-paths /home/helm/config/test_helm.conf \
    --suite test-2 \
    --local-path /home/helm/config

where /home/helm/config contains:

models_metadata.yaml
tokenizer_configs.yaml
model_deployments.yaml

Now (e3cf155) the behavior is as expected, the config is loaded only inside the main function and overridden by the one in the given path:

...
main {
  Reading tokenizer configs from /home/helm/.local/lib/python3.8/site-packages/helm/config/tokenizer_configs.yaml...
  Reading model deployments from /home/helm/.local/lib/python3.8/site-packages/helm/config/model_deployments.yaml...
  Reading tokenizer configs from /home/helm/config/tokenizer_configs.yaml...
  Reading model deployments from /home/helm/config/model_deployments.yaml...
  WARNING: Could not find model metadata for model custom-test/local of model deployment custom-test/local
  WARNING: Could not find model metadata for model custom-test-other/local of model deployment custom-test-other/local
  Read 2 run entries from /home/helm/config/test_helm.conf
...

I will close this as the current method is better.
Thanks for the updates!

added tokenizer config path argument

afd3a93

RawthiL changed the title ~~added tokenizer config path argument~~ Add tokenizer config path argument Dec 7, 2023

RawthiL marked this pull request as draft December 7, 2023 17:48

RawthiL marked this pull request as ready for review December 7, 2023 18:00

RawthiL marked this pull request as draft December 7, 2023 18:26

yifanmai mentioned this pull request Dec 7, 2023

Add private configs #1996

Merged

setting register flag at the time of reading file

dbee57e

RawthiL closed this Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer config path argument #2110

Add tokenizer config path argument #2110

RawthiL commented Dec 7, 2023 •

edited

Loading

yifanmai commented Dec 7, 2023

RawthiL commented Dec 7, 2023

JosselinSomervilleRoberts commented Dec 8, 2023

RawthiL commented Dec 8, 2023

yifanmai commented Dec 20, 2023

yifanmai commented Dec 20, 2023

RawthiL commented Dec 20, 2023

Add tokenizer config path argument #2110

Add tokenizer config path argument #2110

Conversation

RawthiL commented Dec 7, 2023 • edited Loading

yifanmai commented Dec 7, 2023

RawthiL commented Dec 7, 2023

JosselinSomervilleRoberts commented Dec 8, 2023

RawthiL commented Dec 8, 2023

yifanmai commented Dec 20, 2023

yifanmai commented Dec 20, 2023

RawthiL commented Dec 20, 2023

RawthiL commented Dec 7, 2023 •

edited

Loading