Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-492] Update Glue table schemas to match crawler findings #63

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

philerooski
Copy link
Contributor

No description provided.

@philerooski philerooski requested a review from a team as a code owner July 6, 2023 20:56
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:02 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:02 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:02 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:02 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:06 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:06 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:20 — with GitHub Actions Inactive
@philerooski philerooski temporarily deployed to develop July 6, 2023 21:22 — with GitHub Actions Inactive
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this schema change for the production data affect test runs with the pilot data given that the pilot data doesn't have these updated fields/new fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here fall into a few different categories:

  • New fields -- these changes don't affect the pilot data. The fields aren't in the data, so they aren't included in the parquet.
  • Overdue changes -- In some cases, specifically in EnrolledParticipants.EventDates and SymptomLog.Value, we hadn't updated the types after the transformations we do to these fields in the S3 to JSON job. They are string in the original data, but they're string-encoded JSON. We load them as dicts and write them as JSON objects in the ndjson, so the schemas reflect the object structure now.

So the pilot data isn't affected by these changes. It's fair to ask why was nothing breaking before because of the overdue changes? TBH, I don't think Glue entirely respects the data types we use in our tables. In some cases it just uses whatever data type it detects and can load the data with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 see my comment above

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 I'm going to pre-approve but I have the same question as rixing. Thanks for pushing this through!

@philerooski philerooski merged commit ab9331e into main Jul 7, 2023
@philerooski philerooski deleted the etl-492 branch July 7, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants