Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryInsertJobOperator is broken on any type of job except query #23826

Closed
2 tasks done
vaaalik opened this issue May 20, 2022 · 24 comments · Fixed by #24165
Closed
2 tasks done

BigQueryInsertJobOperator is broken on any type of job except query #23826

vaaalik opened this issue May 20, 2022 · 24 comments · Fixed by #24165
Labels
area:providers kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@vaaalik
Copy link
Contributor

vaaalik commented May 20, 2022

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==7.0.0

Apache Airflow version

2.2.5

Operating System

MacOS 12.2.1

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

We are using BigQueryInsertJobOperator to load data from parquet files in Google Cloud Storage with this kind of configuration:

BigQueryInsertJobOperator(
        task_id="load_to_bq",
        configuration={
            "load": {
                "writeDisposition": "WRITE_APPEND",
                "createDisposition": "CREATE_IF_NEEDED",
                "destinationTable": destination_table,
                "sourceUris": source_files
                "sourceFormat": "PARQUET"
            }
        }

After upgrade to apache-airflow-providers-google==7.0.0 all load jobs are now broken. I believe that problem lies in this line:

table = job.to_api_repr()["configuration"]["query"]["destinationTable"]

So it's trying to get the destination table from query job config and makes it impossible to use any other type of job.

What you think should happen instead

No response

How to reproduce

Use BigQueryInsertJobOperator to submit any type of job except query

Anything else

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/google/cloud/operators/bigquery.py", line 2170, in execute
    table = job.to_api_repr()["configuration"]["query"]["destinationTable"]
KeyError: 'query'

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vaaalik vaaalik added area:providers kind:bug This is a clearly a bug labels May 20, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented May 20, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@raphaelauv
Copy link
Contributor

#23165 @wojsamjan

@potiuk
Copy link
Member

potiuk commented May 22, 2022

Do you think #23165 fixes this issue ? Can we close it @raphaelauv ?

@raphaelauv
Copy link
Contributor

No It's the reason of the issue

@potiuk
Copy link
Member

potiuk commented May 22, 2022

Ah OK. That was not clear @wojsamjan -> can you please take a look and fix it ?

@uranusjr uranusjr added the provider:google Google (including GCP) related issues label May 23, 2022
@kardoo
Copy link

kardoo commented May 31, 2022

I have exactly the issue.
I moved all my queries from operator BigQueryExecuteQueryOperator to BigQueryInsertJobOperator because it will be deprecated and now all queries no longer work.
What could be the best approach? Wait for a fix? Make a custom operator? I cannot find a workarround.
Thanks

@raphaelauv
Copy link
Contributor

try this

from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from typing import Any
from google.api_core.exceptions import Conflict
from airflow.exceptions import AirflowException
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook
from airflow.providers.google.cloud.links.bigquery import BigQueryTableLink


class FixBigQueryInsertJobOperator(BigQueryInsertJobOperator):

    def execute(self, context: Any):
        hook = BigQueryHook(
            gcp_conn_id=self.gcp_conn_id,
            delegate_to=self.delegate_to,
            impersonation_chain=self.impersonation_chain,
        )
        self.hook = hook
        job_id = self._job_id(context)
        try:
            job = self._submit_job(hook, job_id)
            self._handle_job_error(job)
        except Conflict:
            # If the job already exists retrieve it
            job = hook.get_job(
                project_id=self.project_id,
                location=self.location,
                job_id=job_id,
            )
            if job.state in self.reattach_states:
                # We are reattaching to a job
                job.result(timeout=self.result_timeout, retry=self.result_retry)
                self._handle_job_error(job)
            else:
                # Same job configuration so we need force_rerun
                raise AirflowException(
                    f"Job with id: {job_id} already exists and is in {job.state} state. If you "
                    f"want to force rerun it consider setting `force_rerun=True`."
                    f"Or, if you want to reattach in this scenario add {job.state} to `reattach_states`"
                )

        if "query" in job.to_api_repr()["configuration"]:
            if "destinationTable" in job.to_api_repr()["configuration"]["query"]:
                table = job.to_api_repr()["configuration"]["query"]["destinationTable"]
                BigQueryTableLink.persist(
                    context=context,
                    task_instance=self,
                    dataset_id=table["datasetId"],
                    project_id=table["projectId"],
                    table_id=table["tableId"],
                )
        self.job_id = job.job_id
        return job.job_id

@kardoo
Copy link

kardoo commented May 31, 2022

Although I was avoiding this solution because I would have to refactor hundreds of queries, I guess its the best solution.
I wonder, if a fix arrives quickly enough so I could avoid this work. :)
Thank you so much for sharing.

@eladkal
Copy link
Contributor

eladkal commented May 31, 2022

This is open source project - the fix will arrive when someone decides to spend the time on fixing it.
If you volunteer I'd be happy assign this issue to you and assist with code review.

@kardoo
Copy link

kardoo commented May 31, 2022

My apologies if I was misunderstood. If I have time in the next couple of weeks, I will let you know.

@takuma11248250
Copy link

hi.
I have exactly the issue.
When do you expect this issue to be resolved?

@potiuk
Copy link
Member

potiuk commented Jun 3, 2022

hi. I have exactly the issue. When do you expect this issue to be resolved?

Have you read comments above @takuma11248250 ?

This is open source project - the fix will arrive when someone decides to spend the time on fixing it.
If you volunteer I'd be happy assign this issue to you and assist with code review.

This will be solved and released when someone solves it. You will not get answers on when it will be fixed but by contributing and providing fixes and testing you can help with speeding it up.

Let me revert the question. Do you expect to help with it by providing all details to help us solve it ?

Can we cound you to observe the issue and when we release an RC you will help with testing @takuma11248250 ?

@potiuk
Copy link
Member

potiuk commented Jun 3, 2022

And one more thing @takuma11248250 - if you need it fast actually you can help - I just merged the fix by @raphaelauv that should address it. This is an open-source project and you can easily apply the fix by cherry-picking this change.

When can we axpect that you apply it to your test system and report that the fix worked?

@takuma11248250
Copy link

@potiuk
Thank you for contact. Glad to see the fix is merge. We have now switched to a different operator to handle this. I am using google's cloud composer and will verify when it is reflected in that. thank you.

@potiuk
Copy link
Member

potiuk commented Jun 4, 2022

@potiuk Thank you for contact. Glad to see the fix is merge. We have now switched to a different operator to handle this. I am using google's cloud composer and will verify when it is reflected in that. thank you.

You do not have to wait for Composer. When we release an RC (soon) you can install it manually and test it by adding specifically apache-airflow-provider-google==X.Y.Zrc1 to your requirements even in Composer. Can we count on your @takuma11248250 to test it and react with "yeah, we tested it, it works" when we release the RC? I am happy to tag you int the issue when we gather feedback when providers are released. In fact, I am going to do it regardless of your answer.

I would really appreciate it - since you are interested in knowing when it will be released, you are the perfect source of information whether the source has been released. Failing to test it will basically be problematic for you in case it is not fixed, because the next wave of providers is going to be released in about month, but bt testing the RC you have a chance to confirm that the fix worked for you. By not doing it, you are risking that the fix will not fix your problem so you will have to wait at least another month or maybe longer.

But you can prevent that from happening by investing a little time and testing the change while it is in RC stage.

@gilangardya
Copy link

Hi @potiuk I'm having the same problem with my Composer, I can help test the RC version once it's released.

@potiuk
Copy link
Member

potiuk commented Jun 6, 2022

RC will be out likely tomorrow :)

@potiuk
Copy link
Member

potiuk commented Jun 6, 2022

Or maybe even today

@potiuk
Copy link
Member

potiuk commented Jun 7, 2022

The RC is out #24289 @takuma11248250 @gilangardya I'd love your "test status" in the #24289 - that's where everyone else will be posting theirs.

@DrStriky
Copy link

DrStriky commented Jun 7, 2022

I installed the new 8.0.0rc1 on google composer and it seems to have fixed the problem.

Thx for your help @raphaelauv

@gilangardya
Copy link

@potiuk I've tested it and it's working as expected, the test details are in test status #24289 (comment)

Thank you @raphaelauv, and everyone 👍

@MazrimT
Copy link

MazrimT commented Jun 9, 2022

I know this is already closed and fix on its way, but would it be possible for someone who tested this on gcp to explain how you tested? I can not find a way to override operators versions and would very much like to be able to help out testing stuff like this in the future.

@DrStriky
Copy link

DrStriky commented Jun 9, 2022

@MazrimT yeah sure

On the Google composer you can define pypi packages.
In this tab you need to add

apache-airflow-providers-google ==8.0.0rc2

This should install this specific version, and override the composer internal version of that package

@MazrimT
Copy link

MazrimT commented Jun 9, 2022

@DrStriky I have no idea what I've been doing before, I thought I tried exactly that but it didn't work, now it worked :) Thank you for your patience with a gcp beginner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants