[Bugfix]: serialize config instances by value when using --trust-remote-code #6751

tjohnson31415 · 2024-07-24T15:26:07Z

It is not currently possible to run vLLM with a model that requires --trust-remote-code if the server spans multiple nodes. The server will crash with an error when it attempts to communicate the dynamically generated configuration class to a remote Ray worker. The crux of the error is

ray.exceptions.RaySystemError: System error: No module named 'transformers_modules'

This error arises due to the dynamic transformers_modules module generated by Transformers when it loads the remote code configuration for the model. In a multi-node context, this module is generated on the head node when transformers.AutoConfig.from_pretrained() is called, but it isn't generated on the other nodes.

This is a very similar issue to what was resolved in #871 and #4285, but now in a multi-node context. As noted in #871, the failure to import occurs when Ray attempts to communicate the ModelConfig object containing hf_config and hf_text_config referencing the dynamically imported config class from transformers_modules to the worker node. The fix in #871 became the util function init_cached_hf_modules that runs transformers.dynamic_module_utils on each worker during the initialization of the WorkerWrapperBase. This generates the dynamic module base in ~/.cache/huggingface/modules (which does need to happen once on each node) and also modifies the module search path to include the cache directory (which needs to happen in every worker), but it does not generate transformers_modules. Use of init_cached_hf_modules fixed the single node case due to the modification to the module import path, but doesn't fix the multi-node case.

A work around would be to run vllm or transformers.AutoConfig.from_pretrained on each node manually to generate the modules (or get the generated module files onto each node some other way).

The implementation proposed in this PR is to utilize a feature in the cloudpickle library that allows the config objects to be serialized by-value instead of by-reference so that the custom config class does not need to be importable in the remote workers.
See https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs

Doing this also obviates the need for init_cached_hf_modules().

A similar error is reported in #6607, even without multiple GPUs. In that case, the failure occurs when using --trust-remote-code with --engine-use-ray. The fix proposed here resolves this issue as well.

Alternatives Considered for multi-node (do not fix to the --engine-use-ray case):

serialize configs as instances of classes that aren't dynamically imported, eg. PretrainedConfig or Dict
figure out how to copy files from the head node to all other nodes via Ray
add code to run AutoConfig.from_pretrained() on each worker to use transformers to generate the dynamic module

FIX #3593
FIX #4169
FIX #6263
FIX #6607
FIX #8553
Also fixes the issue raised in #4986 (comment)

github-actions · 2024-07-24T15:26:19Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-07-25T06:58:33Z

Great observation! In general we should not expect any new code when we use multi-node serving. HF dynamically downloaded code and module is a pain in this case. I think converting the object to a dict makes sense to me. Does it have any side effect? Or we can just do it in all the cases?

tjohnson31415 · 2024-07-25T14:35:48Z

converting the object to a dict makes sense to me. Does it have any side effect? Or we can just do it in all the cases?

Yeah, this is something that we'll need to look at further. If the custom config class only adds new attributes and default values (eg. no new methods used by the modeling code), then this this conversion to a PretrainedConfig while preserving extra attributes should be fine. With the assumption that downstream code mostly treats the config objects as a plain-old-data class with read-only attributes. I think that is the standard case, but I haven't done much investigation into edge cases.

justinthelaw · 2024-08-01T15:59:21Z

Will this be merged in the next release? It'd be great to have engine Ray support for Phi3 and others with remote code.

Signed-off-by: Travis Johnson <[email protected]>

…module Signed-off-by: Travis Johnson <[email protected]>

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 · 2024-08-01T18:13:14Z

In #6607 (comment), the No module named 'transformers_modules' error can also occur when using --engine-use-ray with --trust-remote-code. In that case, the engine worker does not have its python path updated to load the dynamic modules. The idea in this PR would resolve that issue as well if the conversion to PretrainedConfig happens earlier. I've updated the PR to include the fix for this.

I'm also still working out how to test that the conversion to PretrainedConfig preserves all relevant attributes.

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 · 2024-08-01T23:12:14Z

In my testing, I found that most attributes of the custom config could be attached to the PretrainedConfig, but that some configurations are expected to be class attributes and those would not be preserved (eg. attribute_map, keys_to_ignore_at_inference).

I did some more investigation into the serialization and found a much better solution: the cloudpickle serialization library used by Ray supports serializing instances of dynamic classes by-value:
https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs

This means that the transformers_modules class no longer needs to be importable on remote workers. Using this feature simply requires registering the dynamic modules for by-value serialization instead of the default by-reference serialization. I'll note that cloudpickle considers the feature experimental, but it seems much better than the other options I've tried.

tjohnson31415 · 2024-08-05T17:29:41Z

@youkaichao @rkooo567 This PR is now ready for review. Please take a look :)

youkaichao · 2024-08-05T17:34:51Z

do you happen to know how does the serialization by value work?

tjohnson31415 · 2024-08-05T17:47:16Z

do you happen to know how does the serialization by value work?

Not in any detail, but I assume that it somehow serializes the class definition along with the instance data in what it communicates between the workers.

njhill · 2024-09-18T15:25:06Z

vllm/transformers_utils/config.py

+        # See: https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs
+        try:
+            import transformers_modules
+            ray.cloudpickle.register_pickle_by_value(transformers_modules)


Can use import cloudpickle rather than ray.cloudpickle

tjohnson31415 mentioned this pull request Jul 24, 2024

[Bug]: Phi-3-mini does not work when using Ray #6607

Open

tjohnson31415 added 4 commits August 1, 2024 11:30

nit: spelling

9deca91

Signed-off-by: Travis Johnson <[email protected]>

fix: convert to PretrainedConfig to avoid issue transferring dynamic …

ba6d2e6

…module Signed-off-by: Travis Johnson <[email protected]>

cleanup: remove init_cached_hf_modules

2a0a859

Signed-off-by: Travis Johnson <[email protected]>

refactor: move PretrainedConfig conversion to get_config

602c0d3

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 force-pushed the fix-distributed-trust-remote-code branch from e25a0f0 to 602c0d3 Compare August 1, 2024 17:52

use cloudpickle by-value serialization instead of PretrainedConfig

03e21d9

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 changed the title ~~[Bugfix]: use PretrainedConfig to communicate config objects with trust remote code~~ [Bugfix]: serialize config instances by value when using --trust-remote-code Aug 1, 2024

tjohnson31415 marked this pull request as ready for review August 1, 2024 23:13

rkooo567 self-requested a review August 8, 2024 04:40

youkaichao mentioned this pull request Sep 9, 2024

[Bug]: Phi-3-V with vllm serve Pickling function error #8288

Closed

1 task

njhill reviewed Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix]: serialize config instances by value when using --trust-remote-code #6751

[Bugfix]: serialize config instances by value when using --trust-remote-code #6751

tjohnson31415 commented Jul 24, 2024 •

edited by DarkLight1337

Loading

github-actions bot commented Jul 24, 2024

youkaichao commented Jul 25, 2024

tjohnson31415 commented Jul 25, 2024 •

edited

Loading

justinthelaw commented Aug 1, 2024

tjohnson31415 commented Aug 1, 2024 •

edited

Loading

tjohnson31415 commented Aug 1, 2024 •

edited

Loading

tjohnson31415 commented Aug 5, 2024

youkaichao commented Aug 5, 2024

tjohnson31415 commented Aug 5, 2024

njhill Sep 18, 2024

[Bugfix]: serialize config instances by value when using --trust-remote-code #6751

Are you sure you want to change the base?

[Bugfix]: serialize config instances by value when using --trust-remote-code #6751

Conversation

tjohnson31415 commented Jul 24, 2024 • edited by DarkLight1337 Loading

github-actions bot commented Jul 24, 2024

youkaichao commented Jul 25, 2024

tjohnson31415 commented Jul 25, 2024 • edited Loading

justinthelaw commented Aug 1, 2024

tjohnson31415 commented Aug 1, 2024 • edited Loading

tjohnson31415 commented Aug 1, 2024 • edited Loading

tjohnson31415 commented Aug 5, 2024

youkaichao commented Aug 5, 2024

tjohnson31415 commented Aug 5, 2024

njhill Sep 18, 2024

Choose a reason for hiding this comment

tjohnson31415 commented Jul 24, 2024 •

edited by DarkLight1337

Loading

tjohnson31415 commented Jul 25, 2024 •

edited

Loading

tjohnson31415 commented Aug 1, 2024 •

edited

Loading

tjohnson31415 commented Aug 1, 2024 •

edited

Loading