Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(taps): Add api costs hook #704

Merged
merged 14 commits into from
Jun 21, 2022
Merged

Conversation

laurentS
Copy link
Contributor

@laurentS laurentS commented Jun 8, 2022

This PR addresses #348 and adds a simple callback mechanism to the sdk that is:

  • totally optional: taps won't see any difference in behaviour (not even logging output) until they implement calculate_api_request_cost in their streams.
  • flexible: each tap can aggregate cost along any number of dimensions/domains. The only constraint is to provide all dimensions on each request (even if associated cost is 0)
  • easy to implement: see Add cost callback for rest calls MeltanoLabs/tap-github#142
  • untested 😅 I'm not sure what the best way to test it is (the implementation in tap-github kind of does that, but it would be good to cover it in the sdk itself).
  • undocumented 🙈 I'll add that once we're all happy with the code

Here is an example of output (at the very end of the tap run) for tap-github with the implementation linked above:

time=2022-06-08 12:24:42 name=tap-github level=INFO message=Total API costs for stream languages: {'rest': 2, 'graphql': 0, 'search': 0}
time=2022-06-08 12:24:42 name=tap-github level=INFO message=Total API costs for stream repositories: {'rest': 2, 'graphql': 0, 'search': 0}
time=2022-06-08 12:24:42 name=tap-github level=INFO message=Total API costs for stream stargazers: {'rest': 0, 'graphql': 7, 'search': 0}

On the downside, the logging above could be improved for better machine handling. log_api_costs could be overridden by tap developers for this. And the way I call the logging method at the end feels a bit hacky, but I couldn't think of a better way to do it.

@laurentS laurentS changed the title Add api costs hook feat(taps): Add api costs hook Jun 8, 2022
@laurentS
Copy link
Contributor Author

laurentS commented Jun 8, 2022

@aaronsteers @edgarrmondragon it looks like the CI setup is not fully there yet for PRs, the job on python 3.10 (py, external) complains about missing credentials for tap-google-analytics, and the pre-commit one cannot find poetry. I'm not sure how to fix that last one.

@edgarrmondragon
Copy link
Collaborator

@laurentS Don't worry about the pre-commit check. The external tests can be retried by an admin (I'll look into making it wait for approval instead of just failing)

@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2022

Codecov Report

Merging #704 (e609492) into main (3cdb614) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #704      +/-   ##
==========================================
+ Coverage   85.26%   85.31%   +0.05%     
==========================================
  Files          34       34              
  Lines        3386     3399      +13     
==========================================
+ Hits         2887     2900      +13     
  Misses        499      499              
Impacted Files Coverage Δ
singer_sdk/streams/core.py 87.33% <100.00%> (+0.16%) ⬆️
singer_sdk/streams/rest.py 87.74% <100.00%> (+0.57%) ⬆️
singer_sdk/tap_base.py 72.89% <100.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cdb614...e609492. Read the comment docs.

@edgarrmondragon edgarrmondragon self-requested a review June 14, 2022 04:11
@edgarrmondragon edgarrmondragon self-assigned this Jun 14, 2022
Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurentS I've added a few suggestions to rename api to sync in the base class, since I can imagine a more generic cost calculation for files, databases, etc.

Other points that might be good to start a discussion on:

  1. Which logger should be used. This is using the tap/stream logger, but I wonder if it makes sense to use a different one so metrics and tap behavior are not mixed in logs (releated to Reduce noise from 'ignored properties' warnings #383 (comment))

  2. Adding some tests to validate cost calculation in the general case.

singer_sdk/streams/core.py Outdated Show resolved Hide resolved
singer_sdk/streams/core.py Outdated Show resolved Hide resolved
singer_sdk/streams/rest.py Outdated Show resolved Hide resolved
singer_sdk/streams/rest.py Outdated Show resolved Hide resolved
singer_sdk/tap_base.py Outdated Show resolved Hide resolved
laurentS and others added 2 commits June 14, 2022 06:49
Co-authored-by: Edgar R. M. <[email protected]>
@laurentS
Copy link
Contributor Author

@laurentS I've added a few suggestions to rename api to sync in the base class, since I can imagine a more generic cost calculation for files, databases, etc.
Ok, makes sense, I've applied your suggestions. Thanks for writing them! I had gone with api since it was all network requests, but it's true the requests could be to anything.

Other points that might be good to start a discussion on:

1. **Which logger should be used**. This is using the tap/stream logger, but I wonder if it makes sense to use a different one so metrics and tap behavior are not mixed in logs (releated to [#383 (comment)](https://github.com/meltano/sdk/issues/383#issuecomment-1154701837))

From my perspective, this would be much better. Right now, the costs are logged at info level, but there is a lot of logging that I have no interest in at that same level. I hesitated to make the costs warnings but semantically, that feels wrong. I like your approach of splitting behaviour logging from metrics (though in general I have no use for the general metrics telling me how long each network request took, or the number of records for each of them). I guess we could add something like the _metric_logging_function to allow users to set the level they want for various parts of logging (behaviour, metrics, costs?).

2. Adding some tests to validate cost calculation in the general case.

I will try to write something for this.

@laurentS
Copy link
Contributor Author

@edgarrmondragon I've updated the code to match your suggestions, and added some testing to validate the summing code.
What do you suggest we do about the logging? If we go with something different as per the comment you linked to, should we do this as part of a separate PR?

singer_sdk/tap_base.py Outdated Show resolved Hide resolved
tests/core/test_streams.py Outdated Show resolved Hide resolved
@edgarrmondragon
Copy link
Collaborator

@edgarrmondragon I've updated the code to match your suggestions, and added some testing to validate the summing code. What do you suggest we do about the logging?

@laurentS thanks for the updates! I left a single suggestion. Other than that, this looks good to merge 😄.

If we go with something different as per the comment you linked to, should we do this as part of a separate PR?

Let's leave that for another PR, another day 👍

Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurentS alright, this lgtm. Thank you!

@edgarrmondragon edgarrmondragon merged commit f75e9d6 into meltano:main Jun 21, 2022
edgarrmondragon added a commit that referenced this pull request Jun 30, 2022
* Add api costs hook

* Correct typing hints for older pythons

* Rename api to sync

Co-authored-by: Edgar R. M. <[email protected]>

* Apply suggestions from code review

Co-authored-by: Edgar R. M. <[email protected]>

* Rename cost methods

* Add sync costs calculation test

* Use a single loop for logging costs

Co-authored-by: Edgar R. M. <[email protected]>

* Update tap_base.py

* Add test for log_sync_costs

Co-authored-by: Edgar R. M. <[email protected]>

* Add missing import

Co-authored-by: Edgar R. M. <[email protected]>
Co-authored-by: Eric Boucher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants