Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Beam operators - refactor operator - common Dataflow logic #976

Closed
wants to merge 22 commits into from

Conversation

TobKed
Copy link
Member

@TobKed TobKed commented Feb 2, 2021


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@github-actions
Copy link

github-actions bot commented Feb 2, 2021

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest master or amend the last commit of the PR, and push it with --force-with-lease.

@TobKed TobKed force-pushed the add-apache-beam branch 3 times, most recently from 8e1d421 to 1a4844e Compare February 3, 2021 18:18
XD-DENG and others added 20 commits February 3, 2021 18:30
- Use template literals instead of '+' for forming strings, when applicable
- remove unused variables (gantt.html)
- remove unused function arguments, when applicable
For the regular providers, Vast majority is in `1.0.1` version and it is
only documentation update - but this way we will have a consistent set
of documentation (including commit history) as well as when we release
in PyPI, the READMES will be much smaller and link to the documentation.

We have two new providers (version 1.0.0):

* neo4j
* apache.beam

There are few providers with changes:

Breaking changes (2.0.0)

* google
* slack

Feature changes (1.1.0):

* amazon
* exasol
* http
* microsoft.azure
* openfaas
* sftp
* snowflake
* ssh

There were also few providers with 'real' bugfixes (1.0.1):

* apache.hive
* cncf.kubernetes
* docker
* elasticsearch
* exasol
* mysql
* openfaas
* papermill
* presto
* sendgrid
* sqlite

The ''backport packages" documentation is prepared only for those
providers that had actual bugfix/features/breaking changes:

```
amazon apache.hive cncf.kubernetes docker elasticsearch exasol google
http microsoft.azure mysql openfaas papermill presto sendgrid sftp
slack snowflake sqlite ssh
```

Only those will be generated with `2021.2.5` calver version.
Fixes the issue wherein regardless of what role anonymous users are assigned (via the `AUTH_ROLE_PUBLIC` env var), they can't see any DAGs.

Current behavior causes:
Anonymous users are handled as a special case by Airflow's DAG-related security methods (`.has_access()` and `.get_accessible_dags()`). Rather than checking the `AUTH_ROLE_PUBLIC` value to check for role permissions, the methods reject access to view or edit any DAGs.

Changes in this PR:
Rather than hardcoding permission rules inside the security methods, this change checks the `AUTH_ROLE_PUBLIC` value and gives anonymous users all permissions linked to the designated role. 

**This places security in the hands of the Airflow users. If the value is set to `Admin`, anonymous users will have full admin functionality.**

This also changes how the `Public` role is created. Currently, the `Public` role is created automatically by Flask App Builder. This PR explicitly declares `Public` as a default role with no permissions in `security.py`. This change makes it easier to test.

closes: apache#13340
…pache#14032)

In the case of OperationalError (caused deadlocks, network blips), the scheduler will now retry those methods 3 times.

closes apache#11899
closes apache#13668
closes apache#14050

We were not de-serializing `BaseOperator.sla` properly, hence
we were returning float instead of `timedelta` object.

Example: 100.0 instead of timedelta(seconds=100)

And because we had a check in _manage_sla in `SchedulerJob` and `DagFileProcessor`,
we were skipping SLA.

SchedulerJob:
https://github.com/apache/airflow/blob/88bdcfa0df5bcb4c489486e05826544b428c8f43/airflow/jobs/scheduler_job.py#L1766-L1768

DagFileProcessor:
https://github.com/apache/airflow/blob/88bdcfa0df5bcb4c489486e05826544b428c8f43/airflow/jobs/scheduler_job.py#L395-L397
This fixes the test test_should_load_plugins_from_property, which is currently quarantined as a "Heisentest".

Current behavior:
The test currently fails because the records that it expects to find in the logger are not present.

Cause:
While the test sets the logger as "DEBUG", it doesn't specify which logger to update. Python loggers are namespaced (typically based on the current file's path), but this has to be defined explicitly. In the absence of a specified logger, any attempts to lookup will return the BaseLogger instance.

The test is therefore updating the log level for the base logger, but when the test runs, the plugins_manager.py file defines a namespaced logger log = logging.getLogger(__name__) used throughout the file. Since a different logger is used, the original log level, in this case INFO, is used. INFO is a higher level than DEBUG, so the calls to log.debug() get filtered out, and when the test looks for log records it finds an empty list.

Fix:
Just specify which logger to update when modifying the log level in the test.
And pytest 6 removed a class that the rerunfailures plugin was using, so
we have to upgrade that too.
…pache#14067)

Only `Admin` or `Op` roles should have permissions to view Configurations.

Previously, Users with `User` or `Viewer` role were able to get/view configurations using
the REST API or in the Webserver. From Airflow 2.0.1, only users with `Admin` or `Op` role would be able
to get/view Configurations.
…pache#13826)

* Update pod-template-file.kubernetes-helm-yaml

* Fix ssh-key access issue

This change allows dags.gitSync.containerName to read ssh-key from file system.
Similar to this https://github.com/varunvora/airflow/blob/ce0e6280d2ea39838e9f0617625cd07a757c3461/chart/templates/scheduler/scheduler-deployment.yaml#L92
It solves apache#13680 issue for private repositories.

Co-authored-by: Denis Krivenko <[email protected]>
* Add instruction for running docs locally

* Fix RST syntax

* Update docs/README.rst

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Kaxil Naik <[email protected]>
`attachable` is only a property of compose version 3.1 files, but we are
on 2.2 still.

This was failing on self-hosted runners with an error
`networks.example.com value Additional properties are not allowed
('attachable' was unexpected)`
@TobKed TobKed force-pushed the add-apache-beam-dataflow-refactor branch from 642366b to dfa5232 Compare February 5, 2021 11:45
@TobKed
Copy link
Member Author

TobKed commented Feb 5, 2021

I moved work on it to: apache#14094

@TobKed TobKed closed this Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.