-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support huggingface snapshot locations in pyfunc log_model artifacts parameter #9362
Conversation
Signed-off-by: Serena Ruan <[email protected]>
Documentation preview for f08b88a will be available here when this CircleCI job completes successfully. More info
|
Can we add a pyfunc local serving test to one (not both) of the added tests to validate that loading the pyfunc model with "snapshot" location works for the serving container? (a manual test for Model Serving with this save mode is worthwhile to validate as well :) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Serena for the PR!
The code looks good to me, dropped some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
When building a model for model serving, the resolution of the hub repository is failing. Do we need to add fallback logic to ensure that if the snapshot location cannot be resolved that we fetch the model? Serving build failure reports the error message only:
While loading as pyfunc from a different environment gives a bit more insight into what is going on (the tokenizer load stage cannot resolve a local path that doesn't exist): HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/tmp/bert-tiny'. Use `repo_type` argument if needed.
---------------------------------------------------------------------------
HFValidationError Traceback (most recent call last)
File <command-3159396656015369>:2
1 # loaded_pyfunc_model = mlflow.pyfunc.load_model(model_uri=pyfunc_model_uri)
----> 2 loaded_pyfunc_model = mlflow.pyfunc.load_model(model_uri="dbfs:/databricks/mlflow-tracking/3159396656015360/edaa28ec924948359a329712e4589443/artifacts/question_answering_model")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py:630, in load_model(model_uri, suppress_warnings, dst_path)
628 _add_code_from_conf_to_system_path(local_path, conf, code_key=CODE)
629 data_path = os.path.join(local_path, conf[DATA]) if (DATA in conf) else local_path
--> 630 model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
631 predict_fn = conf.get("predict_fn", "predict")
632 return PyFuncModel(model_meta=model_meta, model_impl=model_impl, predict_fn=predict_fn)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/mlflow/pyfunc/model.py:342, in _load_pyfunc(model_path)
337 artifacts[saved_artifact_name] = os.path.join(
338 model_path, saved_artifact_info[CONFIG_KEY_ARTIFACT_RELATIVE_PATH]
339 )
341 context = PythonModelContext(artifacts=artifacts)
--> 342 python_model.load_context(context=context)
343 signature = mlflow.models.Model.load(model_path).signature
344 return _PythonModelPyfuncWrapper(
345 python_model=python_model, context=context, signature=signature
346 )
File /root/.ipykernel/64369/command-3159396656015368-2533779477:14, in load_context(self, context)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:652, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
649 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
651 # Next, let's try to use the tokenizer_config file to get the tokenizer class.
--> 652 tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
653 if "_commit_hash" in tokenizer_config:
654 kwargs["_commit_hash"] = tokenizer_config["_commit_hash"]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:496, in get_tokenizer_config(pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, **kwargs)
434 """
435 Loads the tokenizer configuration from a pretrained model tokenizer configuration.
436
(...)
493 tokenizer_config = get_tokenizer_config("tokenizer-test")
494 ```"""
495 commit_hash = kwargs.get("_commit_hash", None)
--> 496 resolved_config_file = cached_file(
497 pretrained_model_name_or_path,
498 TOKENIZER_CONFIG_FILE,
499 cache_dir=cache_dir,
500 force_download=force_download,
501 resume_download=resume_download,
502 proxies=proxies,
503 use_auth_token=use_auth_token,
504 revision=revision,
505 local_files_only=local_files_only,
506 subfolder=subfolder,
507 _raise_exceptions_for_missing_entries=False,
508 _raise_exceptions_for_connection_errors=False,
509 _commit_hash=commit_hash,
510 )
511 if resolved_config_file is None:
512 logger.info("Could not locate the tokenizer configuration file, will try to use the model config instead.")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/transformers/utils/hub.py:417, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
414 user_agent = http_user_agent(user_agent)
415 try:
416 # Load from URL or cache if already cached
--> 417 resolved_file = hf_hub_download(
418 path_or_repo_id,
419 filename,
420 subfolder=None if len(subfolder) == 0 else subfolder,
421 repo_type=repo_type,
422 revision=revision,
423 cache_dir=cache_dir,
424 user_agent=user_agent,
425 force_download=force_download,
426 proxies=proxies,
427 resume_download=resume_download,
428 use_auth_token=use_auth_token,
429 local_files_only=local_files_only,
430 )
432 except RepositoryNotFoundError:
433 raise EnvironmentError(
434 f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
435 "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
436 "pass a token having permission to this repo with `use_auth_token` or log in with "
437 "`huggingface-cli login` and pass `use_auth_token=True`."
438 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:110, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
105 for arg_name, arg_value in chain(
106 zip(signature.parameters, args), # Args values
107 kwargs.items(), # Kwargs values
108 ):
109 if arg_name in ["repo_id", "from_id", "to_id"]:
--> 110 validate_repo_id(arg_value)
112 elif arg_name == "token" and arg_value is not None:
113 has_token = True
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:158, in validate_repo_id(repo_id)
155 raise HFValidationError(f"Repo id must be a string, not {type(repo_id)}: '{repo_id}'.")
157 if repo_id.count("/") > 1:
--> 158 raise HFValidationError(
159 "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
160 f" '{repo_id}'. Use `repo_type` argument if needed."
161 )
163 if not REPO_ID_REGEX.match(repo_id):
164 raise HFValidationError(
165 "Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are"
166 " forbidden, '-' and '.' cannot start or end the name, max length is 96:"
167 f" '{repo_id}'."
168 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Curious - how long does the new test case take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit concerned about the approach here. We seem to be enabling users to save models where the file dependencies are not contained in the model directory. E.g. if I call mlflow.pyfunc.save_model()
with a snapshot dependency and then upload with mlflow.log_artifact()
, the uploaded model won't be usable because the snapshot dependency isn't part of the model directory and wouldn't have been uploaded.
@BenWilson2 @serena-ruan I was imagining another approach where we enable users to download the snapshot directly to the model directory when mlflow.pyfunc.log_model()
is called, e.g. by passing
artifacts={
"my_hf_model": "huggingface:/..."
}
or something comparable, where MLflow recognizes this special artifact scheme and knows how to download the snapshot directly.
Another notable limitation of the current approach is that it only supports one snapshot per MLflow model, which may be unsuitable for ensemble use cases
I'm a little bit confused, |
Signed-off-by: Serena Ruan <[email protected]>
@serena-ruan the only feature set associated with this approach should be the first element that you mentioned. This is only for users who just want to log, register, and deploy a pre-trained model directly from the hub. With the architecture defined and this special option called, we fetch directly from the hub into the model (each component of the pipeline) artifact directory directly, bypassing loading to the cache location. |
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
This reverts commit 4bddc0e. Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
@@ -377,7 +377,7 @@ def __init__(self, chdr=False, remove_on_exit=True): | |||
self._remove = remove_on_exit | |||
|
|||
def __enter__(self): | |||
self._path = os.path.abspath(tempfile.mkdtemp()) | |||
self._path = os.path.abspath(create_tmp_dir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why we need this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://databricks.atlassian.net/browse/ES-836443
Seems on databricks we should use local_disk0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on databricks, /tmp
storage space cannot be scaled, but /local_disk0
is designed to be scalable. we should save large temporary files to /local_disk0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For small files, using /tmp
is fine, right? I don't think we always need to use /local_disk0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely should not be using /tmp for these very large LLMs. Is there a concern that you have @harupy for using ebs scalable storage within Databricks for everything? It will likely:
- Be a simpler implementation
- Not require size estimation logic that, for LLMs, will require extensive recursive directory inspection (there are a LOT of files in these repos)
- Prevent any potential issues with other usages of the /tmp swap space on the Spark Driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make logic simple , i think we can always use /local_disk0. It don't have performance difference with /tmp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
If users download huggingface models to a snapshot location, then specifying it with
artifacts={"snapshot": snapshot_location}
can avoid downloading the model twice. This should be useful for LLMs.How is this patch tested?
Does this PR change the documentation?
Release Notes
Is this a user-facing change?
To log a LLM hosted by huggingface with PyFunc flavor, it's suggested to specify
artifacts={"snapshot": snapshot_location}
when logging the model, so that model always points to the same snapshot_location, and can avoid downloading the model twice.What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/gateway
: AI Gateway service, Gateway client APIs, third-party Gateway integrationsarea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes