Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix]: serialize config instances by value when using --trust-remote-code #6751

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tjohnson31415
Copy link
Contributor

@tjohnson31415 tjohnson31415 commented Jul 24, 2024

It is not currently possible to run vLLM with a model that requires --trust-remote-code if the server spans multiple nodes. The server will crash with an error when it attempts to communicate the dynamically generated configuration class to a remote Ray worker. The crux of the error is

ray.exceptions.RaySystemError: System error: No module named 'transformers_modules'

This error arises due to the dynamic transformers_modules module generated by Transformers when it loads the remote code configuration for the model. In a multi-node context, this module is generated on the head node when transformers.AutoConfig.from_pretrained() is called, but it isn't generated on the other nodes.

This is a very similar issue to what was resolved in #871 and #4285, but now in a multi-node context. As noted in #871, the failure to import occurs when Ray attempts to communicate the ModelConfig object containing hf_config and hf_text_config referencing the dynamically imported config class from transformers_modules to the worker node. The fix in #871 became the util function init_cached_hf_modules that runs transformers.dynamic_module_utils on each worker during the initialization of the WorkerWrapperBase. This generates the dynamic module base in ~/.cache/huggingface/modules (which does need to happen once on each node) and also modifies the module search path to include the cache directory (which needs to happen in every worker), but it does not generate transformers_modules. Use of init_cached_hf_modules fixed the single node case due to the modification to the module import path, but doesn't fix the multi-node case.

A work around would be to run vllm or transformers.AutoConfig.from_pretrained on each node manually to generate the modules (or get the generated module files onto each node some other way).


The implementation proposed in this PR is to utilize a feature in the cloudpickle library that allows the config objects to be serialized by-value instead of by-reference so that the custom config class does not need to be importable in the remote workers.
See https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs

Doing this also obviates the need for init_cached_hf_modules().

A similar error is reported in #6607, even without multiple GPUs. In that case, the failure occurs when using --trust-remote-code with --engine-use-ray. The fix proposed here resolves this issue as well.

Alternatives Considered for multi-node (do not fix to the --engine-use-ray case):

  • serialize configs as instances of classes that aren't dynamically imported, eg. PretrainedConfig or Dict
  • figure out how to copy files from the head node to all other nodes via Ray
  • add code to run AutoConfig.from_pretrained() on each worker to use transformers to generate the dynamic module

FIX #3593
FIX #4169
FIX #6263
FIX #6607
FIX #8553
Also fixes the issue raised in #4986 (comment)

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@youkaichao
Copy link
Member

Great observation! In general we should not expect any new code when we use multi-node serving. HF dynamically downloaded code and module is a pain in this case. I think converting the object to a dict makes sense to me. Does it have any side effect? Or we can just do it in all the cases?

@tjohnson31415
Copy link
Contributor Author

tjohnson31415 commented Jul 25, 2024

converting the object to a dict makes sense to me. Does it have any side effect? Or we can just do it in all the cases?

Yeah, this is something that we'll need to look at further. If the custom config class only adds new attributes and default values (eg. no new methods used by the modeling code), then this this conversion to a PretrainedConfig while preserving extra attributes should be fine. With the assumption that downstream code mostly treats the config objects as a plain-old-data class with read-only attributes. I think that is the standard case, but I haven't done much investigation into edge cases.

@justinthelaw
Copy link

Will this be merged in the next release? It'd be great to have engine Ray support for Phi3 and others with remote code.

@tjohnson31415 tjohnson31415 force-pushed the fix-distributed-trust-remote-code branch from e25a0f0 to 602c0d3 Compare August 1, 2024 17:52
@tjohnson31415
Copy link
Contributor Author

tjohnson31415 commented Aug 1, 2024

In #6607 (comment), the No module named 'transformers_modules' error can also occur when using --engine-use-ray with --trust-remote-code. In that case, the engine worker does not have its python path updated to load the dynamic modules. The idea in this PR would resolve that issue as well if the conversion to PretrainedConfig happens earlier. I've updated the PR to include the fix for this.

I'm also still working out how to test that the conversion to PretrainedConfig preserves all relevant attributes.

@tjohnson31415 tjohnson31415 changed the title [Bugfix]: use PretrainedConfig to communicate config objects with trust remote code [Bugfix]: serialize config instances by value when using --trust-remote-code Aug 1, 2024
@tjohnson31415
Copy link
Contributor Author

tjohnson31415 commented Aug 1, 2024

In my testing, I found that most attributes of the custom config could be attached to the PretrainedConfig, but that some configurations are expected to be class attributes and those would not be preserved (eg. attribute_map, keys_to_ignore_at_inference).

I did some more investigation into the serialization and found a much better solution: the cloudpickle serialization library used by Ray supports serializing instances of dynamic classes by-value:
https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs

This means that the transformers_modules class no longer needs to be importable on remote workers. Using this feature simply requires registering the dynamic modules for by-value serialization instead of the default by-reference serialization. I'll note that cloudpickle considers the feature experimental, but it seems much better than the other options I've tried.

@tjohnson31415 tjohnson31415 marked this pull request as ready for review August 1, 2024 23:13
@tjohnson31415
Copy link
Contributor Author

@youkaichao @rkooo567 This PR is now ready for review. Please take a look :)

@youkaichao
Copy link
Member

do you happen to know how does the serialization by value work?

@tjohnson31415
Copy link
Contributor Author

do you happen to know how does the serialization by value work?

Not in any detail, but I assume that it somehow serializes the class definition along with the instance data in what it communicates between the workers.

# See: https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs
try:
import transformers_modules
ray.cloudpickle.register_pickle_by_value(transformers_modules)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use import cloudpickle rather than ray.cloudpickle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants