Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenizer config path argument #2110

Closed

Conversation

RawthiL
Copy link

@RawthiL RawthiL commented Dec 7, 2023

The possibility of modifying two out of the tree config files was added in PR 1761, however the tokenizer configuration was left out (not sure why). I believe that we should enable the user to select a custom tokenizer_configs.yaml file by means of a new parameter:

    --tokenizer-configs-paths tokenizer_configs.yaml

In my particular case this is required to enable the separation of the tokenizer endpoint from the model endpoint. By means of the PR, the user will be able to do the following:

1- Set a tokenizer and model deployment config:

    --tokenizer-configs-paths tokenizer_configs.yaml
    --model-deployment-paths model_deployments.yaml

2- Assign different urls to each endpoint:
(model_deployments.yaml)

model_deployments:
  - name: neurips/local
    model_name: neurips/local
    tokenizer_name: neurips/local
    max_sequence_length: 2048
    client_spec:
      class_name: "helm.proxy.clients.http_model_client.HTTPModelClient"
      args: {
        base_url: "http://192.168.1.1:8080"
      }
    window_service_spec:
      class_name: "helm.benchmark.window_services.http_model_window_service.HTTPModelWindowService"
      args: {}

(tokenizer_configs.yaml)

tokenizer_configs:
  - name: neurips/local
    tokenizer_spec:
      class_name: "helm.proxy.tokenizers.http_model_tokenizer.HTTPModelTokenizer"
      args: {
        base_url: "http://192.168.1.2:8080"
      }
    end_of_text_token: "<|endoftext|>"
    prefix_token: "<|endoftext|>"

Note: The other way of changing the URL is through the HELM_HTTP_MODEL_BASE_URL environment variable, but this changes both the model and the tokenizer URL which not what I required.

@RawthiL RawthiL changed the title added tokenizer config path argument Add tokenizer config path argument Dec 7, 2023
@RawthiL RawthiL marked this pull request as draft December 7, 2023 17:48
@RawthiL RawthiL marked this pull request as ready for review December 7, 2023 18:00
@RawthiL RawthiL marked this pull request as draft December 7, 2023 18:26
@yifanmai
Copy link
Collaborator

yifanmai commented Dec 7, 2023

Thanks for this suggestion! This PR looks good to me.

Note that we have another PR #1996 that removes the --*-paths flags - the user should then put their user-specific tokenizer_configs.yaml under ./prod_env/ (which can be configured using the --local-path argument) - would this work for your use case, or would you still prefer having the --tokenizer-configs-path flag?

@yifanmai yifanmai mentioned this pull request Dec 7, 2023
@RawthiL
Copy link
Author

RawthiL commented Dec 7, 2023

Oh, I have not looked at PR #1996, it seems much more natural.

Also I found after submitting this PR that PR #1761 was working somehow by chance I think. The custom config passed to model-deployment-paths was being accepted because I suspect that at some point, before entering the main loop, the register_deployments_if_not_already_registered() function was being called:

helm-run \
    --conf-paths /home/helm/config/test_helm.conf \
    --suite v2 \
    --max-eval-instances 10 \
    --model-metadata-paths /home/helm/config/models_metadata.yaml \
    --tokenizer-configs-paths /home/helm/config/tokenizer_configs.yaml \
    --model-deployment-paths /home/helm/config/model_deployments.yaml

---> Reading model deployments from /home/helm/.local/lib/python3.8/site-packages/helm/config/model_deployments.yaml... # This is before entering the main loop!
main {
  Reading model deployments from /home/helm/config/model_deployments.yaml...
  Reading tokenizer configs from /home/helm/config/tokenizer_configs.yaml...
  Read 1 run entries from /home/helm/config/test_helm.conf

--->  Reading tokenizer configs from /home/helm/.local/lib/python3.8/site-packages/helm/config/tokenizer_configs.yaml... # This is after reading custom config, at register_helm_configurations()

  1 entries produced 1 run specs
  run_specs {
    RunSpec(....)
  } [0.0s]

Also the model-metadata-paths parameter was only being registered when register_model_deployment() was being called inside register_model_deployments_from_path.
All this meant that this PR is failing because the tokenizer is not set at registered at any point (TOKENIZERS_REGISTERED remains False) and then the configuration is overridden when register_helm_configurations() is called.

I was struggling with this and about to ask why this behavior is so strange, why are the flags TOKENIZERS_REGISTERED, DEPLOYMENTS_REGISTERED and METADATAS_REGISTERED not set to True after the config is read from an specific path (register_xxxxxxx_from_path functions)...
It would be great if I can get some clarification on why this works like that or if this is some non expected behavior.

Anyway, I will test PR #1996 , and see it it works for me.

@JosselinSomervilleRoberts
Copy link
Contributor

@RawthiL that is indeed a weird behavior. Can you let me know if #1996 fixes your problem? Also, we are still thinking about the best way to do this with @yifanmai so feel free to leave any feedback!

@RawthiL
Copy link
Author

RawthiL commented Dec 8, 2023

Hi @JosselinSomervilleRoberts , I think that PR #1996 approach is more simple, less verbose. Sadly, I was not able to execute the helm-run command using its code, there was an error. I left a comment there.

Anyways, PR #1996 will still suffer from the behavior that I commented earlier. If someone does the following:

register_tokenizer_configs_from_path(some_path)  # to register a custom config on `some_path` (for any reason).
register_helm_configurations() # to load the rest of the configuration files

In this case the tokenizer config on some_path will be overridden by the one in the default path when register_helm_configurations is called. The last modifications in this PR solve this by setting the flags TOKENIZERS_REGISTERED, DEPLOYMENTS_REGISTERED and METADATAS_REGISTERED at the moment of loading, not inside the register_xxxxxx_if_not_already_registered functions.
If I'm not overlooking other mechanic, I think this feature is useful, regardless if the per-file paths are kept or not.

@yifanmai
Copy link
Collaborator

@RawthiL I just merged #2142 which also addresses this issue. Could you try it out and see if it works for you?

Otherwise, feel free to add the --tokenizer-configs-paths, --model-deployment-paths, and --model-metadata-paths flags in this pull request - you may have to copy some code from an old commit in main, because the other two flags are not in main currently. I'll be happy to help review and merge it.

@yifanmai
Copy link
Collaborator

For the record, you'll need to run from main to get the fixes i.e. pip install crfm-helm@git+https://github.com/stanford-crfm/helm.git@main

I'm planning on releasing v0.4.0 tomorrow, which will also have the fixes.

@RawthiL
Copy link
Author

RawthiL commented Dec 20, 2023

@yifanmai I have installed the current main and it seems to be working perfectly!

For anyone stepping into this, now you can pass the configuration path to the run command using the local-path :

helm-run \
    --conf-paths /home/helm/config/test_helm.conf \
    --suite test-2 \
    --local-path /home/helm/config

where /home/helm/config contains:

  • models_metadata.yaml
  • tokenizer_configs.yaml
  • model_deployments.yaml

Now (e3cf155) the behavior is as expected, the config is loaded only inside the main function and overridden by the one in the given path:

...
main {
  Reading tokenizer configs from /home/helm/.local/lib/python3.8/site-packages/helm/config/tokenizer_configs.yaml...
  Reading model deployments from /home/helm/.local/lib/python3.8/site-packages/helm/config/model_deployments.yaml...
  Reading tokenizer configs from /home/helm/config/tokenizer_configs.yaml...
  Reading model deployments from /home/helm/config/model_deployments.yaml...
  WARNING: Could not find model metadata for model custom-test/local of model deployment custom-test/local
  WARNING: Could not find model metadata for model custom-test-other/local of model deployment custom-test-other/local
  Read 2 run entries from /home/helm/config/test_helm.conf
...

I will close this as the current method is better.
Thanks for the updates!

@RawthiL RawthiL closed this Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants