Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase JSON to Parquet timeout #94

Merged
merged 1 commit into from
Dec 12, 2023
Merged

Increase JSON to Parquet timeout #94

merged 1 commit into from
Dec 12, 2023

Conversation

philerooski
Copy link
Contributor

This is approximately the largest timeout we could use while keeping the entire workflow time under ~22-24 hours.

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -49,7 +49,7 @@ Parameters:
TimeoutInMinutes:
Type: Number
Description: The job timeout in minutes (integer).
Default: 720
Default: 1200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we're getting json to parquet jobs that run over 12 hours right now? For the bigger data types like HealthKit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FitbitIntraday. HealthKitV2Samples is our second largest data type, but it only takes ~2 hours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to specify TimeoutInMinutes for each data type so we know generally how long each data type should be taking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... if it ain't broke don't fix it? Doing it for each data type would add a lot of complexity for no/little benefit. We could do something like what we do for the number of workers where we have a separate LargeJobNumberOfWorkers variable for just the larger jobs, but for now using the same timeout for everything hasn't caused any issues so my vote is for doing nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Do we have a table somewhere of the current average job run times (or at least run time range) and workers per data type? I feel like it might still be good to have in case something is taking abnormally long compared to usual, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. If I thought a job might be running for an abnormally long time I would compare it with previous job run times. My main objection to a having a table with this information is that since the size of the data is always changing (and we may make other changes like # of workers or even architectural changes) the table would need to be updated regularly -- at least monthly. I don't think it would even be referenced monthly.

@philerooski philerooski merged commit e882fee into main Dec 12, 2023
14 checks passed
@philerooski philerooski deleted the etl-583 branch December 12, 2023 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants