Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support huggingface snapshot locations in pyfunc log_model artifacts parameter #9362

Merged
merged 20 commits into from
Aug 24, 2023

Conversation

serena-ruan
Copy link
Collaborator

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

If users download huggingface models to a snapshot location, then specifying it with artifacts={"snapshot": snapshot_location} can avoid downloading the model twice. This should be useful for LLMs.

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

To log a LLM hosted by huggingface with PyFunc flavor, it's suggested to specify artifacts={"snapshot": snapshot_location} when logging the model, so that model always points to the same snapshot_location, and can avoid downloading the model twice.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/gateway: AI Gateway service, Gateway client APIs, third-party Gateway integrations
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Aug 17, 2023

Documentation preview for f08b88a will be available here when this CircleCI job completes successfully.

More info

@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Aug 17, 2023
@BenWilson2
Copy link
Member

Can we add a pyfunc local serving test to one (not both) of the added tests to validate that loading the pyfunc model with "snapshot" location works for the serving container?

(a manual test for Model Serving with this save mode is worthwhile to validate as well :) )

Copy link
Collaborator

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Serena for the PR!

The code looks good to me, dropped some minor comments.

tests/transformers/test_transformers_model_export.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Show resolved Hide resolved
Copy link
Collaborator

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Serena Ruan <[email protected]>
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
Signed-off-by: Serena Ruan <[email protected]>
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
@BenWilson2
Copy link
Member

When building a model for model serving, the resolution of the hub repository is failing.
An error occurred while loading the model. Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/tmp/bert-tiny'. Use `repo_type` argument if needed..

Do we need to add fallback logic to ensure that if the snapshot location cannot be resolved that we fetch the model?
With the current implementation, when loading in a different environment, the components of the base pipeline cannot be found (as they are looking for the local snapshot directory which is only present in the original environment that was used for logging).

Serving build failure reports the error message only:

An error occurred while loading the model. Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/tmp/bert-tiny'. Use `repo_type` argument if needed..

While loading as pyfunc from a different environment gives a bit more insight into what is going on (the tokenizer load stage cannot resolve a local path that doesn't exist):

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/tmp/bert-tiny'. Use `repo_type` argument if needed.
---------------------------------------------------------------------------
HFValidationError                         Traceback (most recent call last)
File <command-3159396656015369>:2
      1 # loaded_pyfunc_model = mlflow.pyfunc.load_model(model_uri=pyfunc_model_uri)
----> 2 loaded_pyfunc_model = mlflow.pyfunc.load_model(model_uri="dbfs:/databricks/mlflow-tracking/3159396656015360/edaa28ec924948359a329712e4589443/artifacts/question_answering_model")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py:630, in load_model(model_uri, suppress_warnings, dst_path)
    628 _add_code_from_conf_to_system_path(local_path, conf, code_key=CODE)
    629 data_path = os.path.join(local_path, conf[DATA]) if (DATA in conf) else local_path
--> 630 model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
    631 predict_fn = conf.get("predict_fn", "predict")
    632 return PyFuncModel(model_meta=model_meta, model_impl=model_impl, predict_fn=predict_fn)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/mlflow/pyfunc/model.py:342, in _load_pyfunc(model_path)
    337     artifacts[saved_artifact_name] = os.path.join(
    338         model_path, saved_artifact_info[CONFIG_KEY_ARTIFACT_RELATIVE_PATH]
    339     )
    341 context = PythonModelContext(artifacts=artifacts)
--> 342 python_model.load_context(context=context)
    343 signature = mlflow.models.Model.load(model_path).signature
    344 return _PythonModelPyfuncWrapper(
    345     python_model=python_model, context=context, signature=signature
    346 )

File /root/.ipykernel/64369/command-3159396656015368-2533779477:14, in load_context(self, context)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:652, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    649     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    651 # Next, let's try to use the tokenizer_config file to get the tokenizer class.
--> 652 tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
    653 if "_commit_hash" in tokenizer_config:
    654     kwargs["_commit_hash"] = tokenizer_config["_commit_hash"]

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:496, in get_tokenizer_config(pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, **kwargs)
    434 """
    435 Loads the tokenizer configuration from a pretrained model tokenizer configuration.
    436 
   (...)
    493 tokenizer_config = get_tokenizer_config("tokenizer-test")
    494 ```"""
    495 commit_hash = kwargs.get("_commit_hash", None)
--> 496 resolved_config_file = cached_file(
    497     pretrained_model_name_or_path,
    498     TOKENIZER_CONFIG_FILE,
    499     cache_dir=cache_dir,
    500     force_download=force_download,
    501     resume_download=resume_download,
    502     proxies=proxies,
    503     use_auth_token=use_auth_token,
    504     revision=revision,
    505     local_files_only=local_files_only,
    506     subfolder=subfolder,
    507     _raise_exceptions_for_missing_entries=False,
    508     _raise_exceptions_for_connection_errors=False,
    509     _commit_hash=commit_hash,
    510 )
    511 if resolved_config_file is None:
    512     logger.info("Could not locate the tokenizer configuration file, will try to use the model config instead.")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/transformers/utils/hub.py:417, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
    414 user_agent = http_user_agent(user_agent)
    415 try:
    416     # Load from URL or cache if already cached
--> 417     resolved_file = hf_hub_download(
    418         path_or_repo_id,
    419         filename,
    420         subfolder=None if len(subfolder) == 0 else subfolder,
    421         repo_type=repo_type,
    422         revision=revision,
    423         cache_dir=cache_dir,
    424         user_agent=user_agent,
    425         force_download=force_download,
    426         proxies=proxies,
    427         resume_download=resume_download,
    428         use_auth_token=use_auth_token,
    429         local_files_only=local_files_only,
    430     )
    432 except RepositoryNotFoundError:
    433     raise EnvironmentError(
    434         f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
    435         "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
    436         "pass a token having permission to this repo with `use_auth_token` or log in with "
    437         "`huggingface-cli login` and pass `use_auth_token=True`."
    438     )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:110, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    105 for arg_name, arg_value in chain(
    106     zip(signature.parameters, args),  # Args values
    107     kwargs.items(),  # Kwargs values
    108 ):
    109     if arg_name in ["repo_id", "from_id", "to_id"]:
--> 110         validate_repo_id(arg_value)
    112     elif arg_name == "token" and arg_value is not None:
    113         has_token = True

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-5d5bb896-e863-4e4f-a816-a9c6c53541ae/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:158, in validate_repo_id(repo_id)
    155     raise HFValidationError(f"Repo id must be a string, not {type(repo_id)}: '{repo_id}'.")
    157 if repo_id.count("/") > 1:
--> 158     raise HFValidationError(
    159         "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
    160         f" '{repo_id}'. Use `repo_type` argument if needed."
    161     )
    163 if not REPO_ID_REGEX.match(repo_id):
    164     raise HFValidationError(
    165         "Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are"
    166         " forbidden, '-' and '.' cannot start or end the name, max length is 96:"
    167         f" '{repo_id}'."
    168     )

Copy link
Collaborator

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Curious - how long does the new test case take?

mlflow/pyfunc/model.py Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
tests/transformers/test_transformers_model_export.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit concerned about the approach here. We seem to be enabling users to save models where the file dependencies are not contained in the model directory. E.g. if I call mlflow.pyfunc.save_model() with a snapshot dependency and then upload with mlflow.log_artifact(), the uploaded model won't be usable because the snapshot dependency isn't part of the model directory and wouldn't have been uploaded.

@BenWilson2 @serena-ruan I was imagining another approach where we enable users to download the snapshot directly to the model directory when mlflow.pyfunc.log_model() is called, e.g. by passing

artifacts={
    "my_hf_model": "huggingface:/..."
}

or something comparable, where MLflow recognizes this special artifact scheme and knows how to download the snapshot directly.

Another notable limitation of the current approach is that it only supports one snapshot per MLflow model, which may be unsuitable for ensemble use cases

@serena-ruan
Copy link
Collaborator Author

serena-ruan commented Aug 21, 2023

A bit concerned about the approach here. We seem to be enabling users to save models where the file dependencies are not contained in the model directory. E.g. if I call mlflow.pyfunc.save_model() with a snapshot dependency and then upload with mlflow.log_artifact(), the uploaded model won't be usable because the snapshot dependency isn't part of the model directory and wouldn't have been uploaded.

@BenWilson2 @serena-ruan I was imagining another approach where we enable users to download the snapshot directly to the model directory when mlflow.pyfunc.log_model() is called, e.g. by passing

artifacts={
    "my_hf_model": "huggingface:/..."
}

or something comparable, where MLflow recognizes this special artifact scheme and knows how to download the snapshot directly.

Another notable limitation of the current approach is that it only supports one snapshot per MLflow model, which may be unsuitable for ensemble use cases

I'm a little bit confused, artifacts={"my_hf_model": "huggingface:/..."}this means if users want to log a huggingface model, they should not create the model first, but instead logging the id of the model into artifacts directly? Then the downloading model process is triggered by mlflow.log_model because we resolve the path of 'huggingface:/...' and then download the model into model artifacts directory.
But if users already have the model in local, what we should do is avoid downloading it twice so we should move the content in the existing downloaded location to the model artifact directory, then in this case we should let them specify the downloaded location instead of a schema like 'huggingface:/...' to redownload the model?

Signed-off-by: Serena Ruan <[email protected]>
@BenWilson2
Copy link
Member

@serena-ruan the only feature set associated with this approach should be the first element that you mentioned. This is only for users who just want to log, register, and deploy a pre-trained model directly from the hub. With the architecture defined and this special option called, we fetch directly from the hub into the model (each component of the pipeline) artifact directory directly, bypassing loading to the cache location.
One way to avoid the download to the cache is to specify local_dir_use_symlinks=False in the writing logic during each of the component elements saving with a local_dir specified when calling huggingface_hub.hf_hub_download().
In order to prevent a download iff the version already exists in the local system cache, we can use huggingface_hub.try_to_load_from_cache() to prevent a download (it will instead return the path to the file in the local cache from which the directory can be copied directly).
I do not think that we should be moving files, though.
We'll need to be clear in the docs and examples about what this feature is for (logging a model that a user effectively doesn't want to store a local version in their cache) and how it isn't much of a time saver if a user wants to test out the model / components / pipeline locally.

mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
Signed-off-by: Serena Ruan <[email protected]>
@serena-ruan
Copy link
Collaborator Author

  1. downloading the model in models artifacts directory:
image image image

Signed-off-by: Serena Ruan <[email protected]>
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
This reverts commit 4bddc0e.

Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
@@ -377,7 +377,7 @@ def __init__(self, chdr=False, remove_on_exit=True):
self._remove = remove_on_exit

def __enter__(self):
self._path = os.path.abspath(tempfile.mkdtemp())
self._path = os.path.abspath(create_tmp_dir())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we need this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://databricks.atlassian.net/browse/ES-836443
Seems on databricks we should use local_disk0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on databricks, /tmp storage space cannot be scaled, but /local_disk0 is designed to be scalable. we should save large temporary files to /local_disk0

Copy link
Member

@harupy harupy Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For small files, using /tmp is fine, right? I don't think we always need to use /local_disk0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should not be using /tmp for these very large LLMs. Is there a concern that you have @harupy for using ebs scalable storage within Databricks for everything? It will likely:

  1. Be a simpler implementation
  2. Not require size estimation logic that, for LLMs, will require extensive recursive directory inspection (there are a LOT of files in these repos)
  3. Prevent any potential issues with other usages of the /tmp swap space on the Spark Driver

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make logic simple , i think we can always use /local_disk0. It don't have performance difference with /tmp

Copy link
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Serena Ruan <[email protected]>
Signed-off-by: Serena Ruan <[email protected]>
@serena-ruan serena-ruan merged commit 1dd338d into mlflow:master Aug 24, 2023
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants