Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-329] delete check #12

Merged
merged 5 commits into from
Feb 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions src/glue/jobs/s3_to_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,16 @@ def get_metadata(basename: str) -> dict:
if metadata["type"] == "HealthKitV2Samples":
metadata["subtype"] = basename_components[1]
if (
metadata["type"] == "HealthKitV2Samples"
and basename_components[-2] == "Deleted"
):
metadata["type"] = "HealthKitV2Samples_Deleted"
metadata["type"]
in [
"HealthKitV2Samples",
"HealthKitV2Heartbeat",
"HealthKitV2Electrocardiogram",
"HealthKitV2Workouts",
]
and basename_components[-2] == "Deleted"
):
metadata["type"] = "{}_Deleted".format(metadata["type"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I definitely prefer to use f-string when possible, but personally, the double/single quote is annoying

f"{metadata['type']}_Deleted"

we can...

meta_type = metadata['type']
f"{meta_type}_Deleted"

I am also ok with .format in this specific scenario.

Copy link
Contributor Author

@rxu17 rxu17 Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I have seen it, but thought that in this scenario, it would look a bit cleaner to have .format.

logger.debug("metadata = %s", metadata)
return metadata

Expand Down
4 changes: 4 additions & 0 deletions tests/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01

RUN pip3 install pytest-datadir
ENTRYPOINT ["bash", "-l"]
59 changes: 56 additions & 3 deletions tests/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,62 @@
### Running tests
Tests are defined in the `tests` folder in this project. Use pipenv to install the
[pytest](https://docs.pytest.org/en/latest/) and run tests.
Tests are defined in the `tests` folder in this project.

#### Running tests using Docker
All tests can be run inside a Docker container which includes all the necessary
Glue/Spark dependencies and simulates the environment which the Glue jobs
will be run in. A Dockerfile is included in the `tests` directory

To run tests locally, first configure your AWS credentials, then launch and attach
to the docker container (see following commands)

Run the following commands to run tests for the s3_to_json script (in develop).

1. Navigate to the directory with the Dockerfile

```shell script
cd tests
```

2. Build the docker image from the Dockerfile

```shell script
docker build -t <some_name_for_container> .
```

3. Run the newly built image:

```shell script
docker run --rm -it \
-v ~/.aws:/home/glue_user/.aws \
-v ~/recover/:/home/glue_user/workspace/recover \
-e DISABLE_SSL=true -p 4040:4040 -p 18080:18080 <some_name_for_container>
```

4. Navigate to your repo in the image

```shell script
cd <repo name>
```

5. Finally run the following (now that you are inside the running container)
to execute the tests:

```shell script
python3 -m pytest
```

#### Running tests using pipenv
Use [pipenv](https://pipenv.pypa.io/en/latest/index.html) to install the
[pytest](https://docs.pytest.org/en/latest/) and run tests locally outside of
a Docker image.

Note that you can only run the lambda function tests using a pipenv locally because
you'll run into an error with pytest with other tests since `test_s3_to_json.py`
has to be run in a Dockerfile.

Run the following command from the repo root to run tests for the lambda function (in develop).
You can run this locally or inside the docker image.

```shell script
$ python -m pytest tests/lambda_function/ -v
python3 -m pytest tests/test_s3_to_glue_lambda.py -v
```
Empty file removed tests/__init__.py
Empty file.
Empty file.
Empty file.
54 changes: 54 additions & 0 deletions tests/test_s3_to_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import os
import io
import json
import zipfile

import boto3
import pytest

from src.glue.jobs import s3_to_json


class TestS3ToJsonS3:
def test_get_metadata_type(self):
assert (
s3_to_json.get_metadata("HealthKitV2Samples_AppleExerciseTime_20201022-20211022.json")["type"]
== "HealthKitV2Samples"
)

assert (
s3_to_json.get_metadata(
"HealthKitV2Statistics_20201022-20211022.json"
)["type"]
== "HealthKitV2Statistics"
)

# these tests test that the healthkit sample data will
# contain deleted in its type
assert (
s3_to_json.get_metadata(
"HealthKitV2Samples_Deleted_20201022-20211022.json"
)["type"]
== "HealthKitV2Samples_Deleted"
)

assert (
s3_to_json.get_metadata(
"HealthKitV2Heartbeat_Samples_Deleted_20201022-20211022.json"
)["type"]
== "HealthKitV2Heartbeat_Deleted"
)

assert (
s3_to_json.get_metadata(
"HealthKitV2Electrocardiogram_Samples_Deleted_20201022-20211022.json"
)["type"]
== "HealthKitV2Electrocardiogram_Deleted"
)

assert (
s3_to_json.get_metadata(
"HealthKitV2Workouts_Deleted_20201022-20211022.json"
)["type"]
== "HealthKitV2Workouts_Deleted"
)