Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertex AI - model versioning doesn't work with CreateAutoMLTextTrainingJobOperator #37400

Closed
1 of 2 tasks
devinmnorris opened this issue Feb 13, 2024 · 4 comments · Fixed by #38417
Closed
1 of 2 tasks
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@devinmnorris
Copy link

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.12.0

Apache Airflow version

2.6.3

Operating System

Ubuntu 22.04.3 LTS

Deployment

Docker-Compose

Deployment details

No response

What happened

When creating AutoML Text Training jobs using CreateAutoMLTextTrainingJobOperator and providing the resource name or model ID of an existing model to the parent_model parameter, an entirely new model with Version 1 shows up in Vertex AI Model Registry.

What you think should happen instead

Since we provided an argument to parent_model, the model uploaded by the job should be a version of the existing parent model.
image

How to reproduce

If your model registry already has an existing model to use as the parent model, skip to step 3. Otherwise:

  1. Train the initial model
  2. Get the initial model's resource name
  3. Train a new model, specifying parent_model=initial_model_resource_name
def get_parent_model(project_id: str):
    from google.cloud import aiplatform

    aiplatform.init(project=project_id)
    models = [m for m in aiplatform.Model.list()]
    models.sort(key=lambda m: m.version_update_time, reverse=True)

    return models[0].resource_name


with DAG as dag:
    initial_model = CreateAutoMLTextTrainingJobOperator(
        task_id="create_auto_ml_training_job-1",
        project_id=PROJECT_ID,
        region=REGION,
        display_name="automl-training-job-1",
        training_fraction_split=0.8,
        test_fraction_split=0.2,
        dataset_id=DATASET_ID,
        prediction_type="classification",
    )

    initial_model_resource_name = PythonVirtualenvOperator(
        task_id="initial_model_resource_name",
        python_callable=get_parent_model,
        requirements=["google-cloud-aiplatform"],
        op_kwargs={
            "project_id": PROJECT_ID,
        },
    )

    model_version_2 = CreateAutoMLTextTrainingJobOperator(
        task_id="create_auto_ml_training_job-2",
        project_id=PROJECT_ID,
        region=REGION,
        display_name="automl-training-job-2",
        parent_model=initial_model_resource_name.output,
        training_fraction_split=0.8,
        test_fraction_split=0.2,
        dataset_id=DATASET_ID,
        prediction_type="classification",
    )

    initial_model >> initial_model_resource_name >> model_version_2

Anything else

This problem only occurs when using the CreateAutoMLTextTrainingJobOperator, and not with the Vertex AI SDK for Python. For example, we were able to implement model versioning successfully using something like:

google-cloud-aiplatform==1.41.0

from google.cloud import aiplatform

aiplatform.init(project=PROJECT, location=LOCATION)

text_dataset = aiplatform.TextDataset(DATASET_ID)

job = aiplatform.AutoMLTextTrainingJob(
    display_name=display_name,
    prediction_type="classification",
    multi_label=False,
)

model = job.run(
    dataset=text_dataset,
    model_display_name=model_display_name,
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    parent_model=PARENT_MODEL_ID,
    is_default_version=is_default_version,
)

model.wait()

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@devinmnorris devinmnorris added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Feb 13, 2024
Copy link

boring-cyborg bot commented Feb 13, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@Lee-W Lee-W added the provider:google Google (including GCP) related issues label Feb 14, 2024
@eladkal
Copy link
Contributor

eladkal commented Feb 14, 2024

cc @MaksYermak can you take a look?
I think you completed the system tests for Vertex AI. if the tests passes then maybe we are missing some coverage with this bug?

@eladkal eladkal added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Feb 14, 2024
@VladaZakharova
Copy link
Contributor

Hi @devinmnorris !
Regarding your example from section "Anything else", can you please provide the value from PARENT_MODEL_ID parameter?
As far as i see from the implementation we have, the operator indeed takes only model_id as input parameter, not the resource_name

@devinmnorris
Copy link
Author

Hi @VladaZakharova :)

We tried the SDK and the Operator approach using both:

  • model_id i.e., 1234567890
  • resource_name i.e., projects/1234/locations/us-central1/models/1234567890

It seems that either work when using the SDK, and neither work when using the Operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants