Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docs): tutorial for writing a custom transformer #2959

Merged
merged 10 commits into from
Jul 28, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion metadata-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -948,7 +948,9 @@ sink:

## Transformations

See the [transformers guide](./transformers.md).
If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub.

Check out the [transformers guide](./transformers.md) for more info!.
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

## Using as a library

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# see https://datahubproject.io/docs/metadata-ingestion/transformers for original tutorial
kevinhu marked this conversation as resolved.
Show resolved Hide resolved
from datahub.configuration.common import ConfigModel


Expand Down
16 changes: 5 additions & 11 deletions metadata-ingestion/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ transformers:

:::tip

If you'd like to add more complex logic for assigning tags, you can use the more generic [`add_dataset_tags` transformer](./src/datahub/ingestion/transformer/add_dataset_tags.py), which calls a user-provided function to determine the tags for each dataset.
If you'd like to add more complex logic for assigning tags, you can use the more generic add_dataset_tags transformer, which calls a user-provided function to determine the tags for each dataset.
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

:::

Expand All @@ -47,12 +47,6 @@ transformers:
- "urn:li:corpGroup:groupname"
```
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

:::tip

If you'd like to add more complex logic for assigning ownership, you can use the more generic [`add_dataset_ownership` transformer](./src/datahub/ingestion/transformer/add_dataset_ownership.py), which calls a user-provided function to determine the ownership of each dataset. See below for a guide on how to set this up.

:::

## Writing a custom transformer from scratch

In the above couple of examples, we use classes that have already been implemented in the ingestion framework. However, it’s common for more advanced cases to pop up where custom code is required, for instance if you'd like to utilize conditional logic or rewrite properties. In such cases, we can add our own modules and define the arguments it takes as a custom transformer.
Expand Down Expand Up @@ -188,7 +182,9 @@ def transform_one(self, mce: MetadataChangeEventClass) -> MetadataChangeEventCla

### Installing the package

Now that we've defined the transformer, we can set up the package to install it and make it visible to the ingestion framework. To do so, create a `setup.py` in the same directory:
Now that we've defined the transformer, we need to make it visible to DataHub. This can be done by making sure the Python file is available as a local import.
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

Alternatively, create a `setup.py` in the same directory as our transform script to make it visible globally. After installing this package (e.g. with `python setup.py` or `pip install -e .`), our module will be installed and importable as `custom_transform_example`.

```python
from setuptools import find_packages, setup
Expand All @@ -198,14 +194,12 @@ setup(
version="1.0",
packages=find_packages(),
# if you don't already have DataHub installed, add it under install_requires
# install_requires=["acryl-datahub"]
# install_requires=["acryl-datahub"]
)
```

### Running the transform

After installing this package (e.g. with `python setup.py` or `pip install -e .`), our module will be installed and importable.

```yaml
transformers:
- type: "custom_transform_example.AddCustomOwnership"
Expand Down