Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 422 invalid value error caused by long k8s pod name #13299

Merged
merged 3 commits into from
Jan 29, 2021

Conversation

houqp
Copy link
Member

@houqp houqp commented Dec 24, 2020

K8S pod name follows DNS_SUBDOMAIN naming convention, which can be
broken down into one or more DNS_LABELs separated by ..

While the max length of pod name (DNS_SUBDOMAIN) is 253, each label
component (DNS_LABEL) of a name cannot be longer than 63. Pod names
generated by k8s executor right now only contains one label, which means
the total effective name length cannot be greater than 63.

This patch concats uuid to pod_id using . to generate the pod anem,
thus extending the max name length to 63 + len(uuid).

Reference: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/design/identifiers.md
Relevant discussion: kubernetes/kubernetes#79351 (comment)

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Dec 24, 2020
@@ -367,24 +367,6 @@ def _annotations_to_key(self, annotations: Dict[str, str]) -> Optional[TaskInsta

return TaskInstanceKey(dag_id, task_id, execution_date, try_number)

@staticmethod
def _make_safe_pod_id(safe_dag_id: str, safe_task_id: str, safe_uuid: str) -> str:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing since this method is not being used anywhere.

@houqp houqp force-pushed the qp_k8sname branch 6 times, most recently from ea91ff0 to 498382e Compare December 24, 2020 07:37
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@houqp houqp force-pushed the qp_k8sname branch 3 times, most recently from 0359279 to 1e54988 Compare December 25, 2020 09:41
@houqp
Copy link
Member Author

houqp commented Dec 25, 2020

Is the k8s image test supposed to be flaky? I am seeing failures from it in my other unrelated PRs as well as master.

@potiuk
Copy link
Member

potiuk commented Dec 25, 2020

Not THAT flaky I think . This looks like legit problem

@potiuk
Copy link
Member

potiuk commented Dec 25, 2020

Byt YEAH.. looks like master has the same problems :(

============ 32 failed, 23 passed, 2 warnings in 474.24s (0:07:54) =============

Something to fix in master then :(

@potiuk
Copy link
Member

potiuk commented Dec 25, 2020

Hopefully this one will fix it : #13316

@potiuk
Copy link
Member

potiuk commented Dec 26, 2020

Should be fixed with latest fix @houqp :). Can you please rebase and check?

@pageldev
Copy link
Contributor

Nice, this probably closes #13189

@potiuk
Copy link
Member

potiuk commented Dec 26, 2020

Main issue has been solved, but I think the two failing errors need to be fixed in this PR @houqp :(

@houqp
Copy link
Member Author

houqp commented Dec 27, 2020

yes, this should fix #13189 @grepthat. thanks @potiuk for the quick fix, let me dig into what's going on with the integration test. It's odd that we have been running with this patch for couple days without any issue.

@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.

@houqp houqp force-pushed the qp_k8sname branch 2 times, most recently from 4a18216 to ec156bd Compare December 27, 2020 05:01
@houqp
Copy link
Member Author

houqp commented Dec 27, 2020

@potiuk looks like the kind cluster is not picking up the code change in my branch. I am able to reproduce this locally with breeze and the task pod names are created with - instead of . as uuid prefix. I looked into the scheduler pod config, it's using apache/airflow:master-python3.6-kubernetes as the container image. I also double checked the /home/airflow/.local/lib/python3.6/site-packages/airflow/kubernetes/pod_generator.py file in scheduler pod and it indeed doesn't have my change.

For any executor code change, what other changes do I need to make in order to get the kind cluster pick up my code?

@potiuk
Copy link
Member

potiuk commented Dec 27, 2020

This should work out-of-the-box @houqp. There was a recent change though where the production images are built from packages rather than directly from sources. But the packages are locally prepared using the PR sources, so it should - in-principle - work fine. But I will take a look.

Kind request: rhis change #13323 should vastly help in being able to analyse it much faster. It introduces grouping of the logs so that it will be much easier to analyse any problems.

If you can take a look and we merge this one and you rebase your change on top, that would be much easier to analyse it :)

@houqp
Copy link
Member Author

houqp commented Jan 8, 2021

@ashb double checked the watcher code, it's using schedule job id label as the filter: kwargs = {'label_selector': f'airflow-worker={scheduler_job_id}'}, so changing pod name should have no impact to the watcher.

@houqp
Copy link
Member Author

houqp commented Jan 8, 2021

@dimberman would appreciate your review on this as well :)

@pageldev
Copy link
Contributor

pageldev commented Jan 14, 2021

Jumping in on this since I've been looking into the long pod name problem as well 👋

Will KubernetesExecutor::try_adopt_task_instances still work? It will try to recreate a pod_id from the database entries as far as I understand, without making the dag or task id safe first, therefore yielding a wrong pod_id?

@houqp
Copy link
Member Author

houqp commented Jan 18, 2021

@grepthat yes, it will still work because it only uses {'label_selector': f'airflow-worker={scheduler_job_id}'} to filter pods.

@pageldev
Copy link
Contributor

@houqp In adopt_launched_task a pod id is reconstructed from the 'dag_id' and 'task_id' labels of the pod (that were made safe previously) and then checked if that reconstructed pod id is in the pod_ids list.

dag_id = pod.metadata.labels['dag_id']
task_id = pod.metadata.labels['task_id']
pod_id = create_pod_id(dag_id=dag_id, task_id=task_id)
if pod_id not in pod_ids:

The pod_ids list passed to adopt_launched_task is a list of pod ids reconstructed from the raw 'dag_id' and 'task_id' (without making the IDs safe beforehand)

pod_ids = {
create_pod_id(dag_id=ti.dag_id, task_id=ti.task_id): ti for ti in tis if ti.external_executor_id
}

Will this not yield problems?

@houqp
Copy link
Member Author

houqp commented Jan 18, 2021

@grepthat i see what you meant. yes, that pod_ids dict construction code will need to be changed to use label safe values as well.

Qingping Hou added 2 commits January 19, 2021 16:26
K8S pod names follows DNS_SUBDOMAIN naming convention, which can be
broken down into one or more DNS_LABEL separated by `.`.

While the max length of pod name (DNS_SUBDOMAIN) is 253, each label
component (DNS_LABEL) of a the name cannot be longer than 63. Pod names
generated by k8s executor right now only contains one label, which means
the total effective name length cannot be greater than 63.

This patch concats uuid to pod_id using `.` to generate the pod anem,
thus extending the max name length to 63 + len(uuid).

Reference: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/design/identifiers.md
Relevant discussion: kubernetes/kubernetes#79351 (comment)
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@houqp
Copy link
Member Author

houqp commented Jan 20, 2021

@grepthat @ashb @kaxil @dimberman @brandondtb pushed a commit to add more label sanitizing, ready for another round of review.

@kaxil
Copy link
Member

kaxil commented Jan 21, 2021

Ping @dimberman

@kaxil
Copy link
Member

kaxil commented Jan 27, 2021

@dimberman Can you take a look at this one

@kaxil
Copy link
Member

kaxil commented Jan 27, 2021

@grepthat Can you also please take a look too :) thanks

@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jan 27, 2021
@pageldev
Copy link
Contributor

@kaxil @houqp Looks good 👍 I checked this on a test DAG with a long task name (via nested Task Groups). Attached is the DAG as a reference:

process_long_taskname.py
from airflow import DAG
from datetime import timedelta, datetime
from airflow.operators.bash_operator import BashOperator
from airflow.utils.task_group import TaskGroup

dag = DAG(
    'process_long_task',
    default_args= {
        'owner': 'airflow',
        'depends_on_past': False,
        'retries' : 0,
        'start_date': datetime(1970, 1, 1),
        'retry_delay': timedelta(seconds=30),
    },
    description='',
    schedule_interval=None,
    catchup=False,
)

TG_survey00000 = TaskGroup(
    "TG_survey00000", 
    tooltip="", 
    dag=dag
)

TG_incremental_adjustment_survey00000_f608c63d9b = TaskGroup(
    "TG_incremental_adjustment_survey00000_f608c63d9b", 
    tooltip="", 
    parent_group=TG_survey00000,
    dag=dag
)

TG_msac_10_survey00000_1c1a34cf10 = TaskGroup(
    "TG_msac_10_survey00000_1c1a34cf10", 
    tooltip="", 
    parent_group=TG_incremental_adjustment_survey00000_f608c63d9b,
    dag=dag
)

TG_adjuster_786931747d = TaskGroup(
    "TG_bundle_adjuster_786931747d", 
    tooltip="", 
    parent_group=TG_msac_10_survey00000_1c1a34cf10,
    dag=dag
)

TG_color_0_521cd0b3f7 = TaskGroup(
    "TG_color_0_521cd0b3f7", 
    tooltip="", 
    parent_group=TG_adjuster_786931747d,
    dag=dag
)

T_finalize_5b57782bb2 = BashOperator(
    task_id='T_finalize_5b57782bb2',
    bash_command='echo "executing nested task && sleep 10"',
    dag=dag,
    task_group=TG_color_0_521cd0b3f7
)

@kaxil kaxil merged commit 862443f into apache:master Jan 29, 2021
kaxil pushed a commit that referenced this pull request Jan 29, 2021
K8S pod names follows DNS_SUBDOMAIN naming convention, which can be
broken down into one or more DNS_LABEL separated by `.`.

While the max length of pod name (DNS_SUBDOMAIN) is 253, each label
component (DNS_LABEL) of a the name cannot be longer than 63. Pod names
generated by k8s executor right now only contains one label, which means
the total effective name length cannot be greater than 63.

This patch concats uuid to pod_id using `.` to generate the pod anem,
thus extending the max name length to 63 + len(uuid).

Reference: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/design/identifiers.md
Relevant discussion: kubernetes/kubernetes#79351 (comment)

(cherry picked from commit 862443f)
@houqp houqp deleted the qp_k8sname branch January 30, 2021 08:25
kaxil pushed a commit that referenced this pull request Feb 4, 2021
K8S pod names follows DNS_SUBDOMAIN naming convention, which can be
broken down into one or more DNS_LABEL separated by `.`.

While the max length of pod name (DNS_SUBDOMAIN) is 253, each label
component (DNS_LABEL) of a the name cannot be longer than 63. Pod names
generated by k8s executor right now only contains one label, which means
the total effective name length cannot be greater than 63.

This patch concats uuid to pod_id using `.` to generate the pod anem,
thus extending the max name length to 63 + len(uuid).

Reference: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/design/identifiers.md
Relevant discussion: kubernetes/kubernetes#79351 (comment)

(cherry picked from commit 862443f)
kaxil pushed a commit to astronomer/airflow that referenced this pull request Apr 12, 2021
K8S pod names follows DNS_SUBDOMAIN naming convention, which can be
broken down into one or more DNS_LABEL separated by `.`.

While the max length of pod name (DNS_SUBDOMAIN) is 253, each label
component (DNS_LABEL) of a the name cannot be longer than 63. Pod names
generated by k8s executor right now only contains one label, which means
the total effective name length cannot be greater than 63.

This patch concats uuid to pod_id using `.` to generate the pod anem,
thus extending the max name length to 63 + len(uuid).

Reference: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/design/identifiers.md
Relevant discussion: kubernetes/kubernetes#79351 (comment)

(cherry picked from commit 862443f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants