Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Salesforce: fix pagination in REST API streams #9151

Merged
merged 10 commits into from
Jan 18, 2022

Conversation

augan-rymkhan
Copy link
Contributor

@augan-rymkhan augan-rymkhan commented Dec 28, 2021

What

Resolves 9136
Salesforce returns 1000 records in response for some streams. As page_size = 2000 this condition is not met in this case in next_page_token method. So that user gets only the first 1000 records, as the method returns None.

How

The main change is start using native Salesforce pagination in REST API streams.
If the initial query (REST API) returns only part of the results, the end of the response will contain a field called nextRecordsUrl. In such cases, request the next batch of records using nextRecordsUrl and repeat until all records have been retrieved. For more info go here.
The solution will work only for REST API.
BULK API stream should work as before.


This change is Reviewable

@github-actions github-actions bot added the area/connectors Connector related issues label Dec 28, 2021
@augan-rymkhan augan-rymkhan changed the title Source Salesforce: fix next_page_token Source Salesforce: improve pagination in REST API streams Dec 28, 2021
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 28, 2021

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1630501517
❌ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1630501517
🐛 https://gradle.com/s/obs545k4iqipo

@jrhizor jrhizor temporarily deployed to more-secrets December 28, 2021 12:45 Inactive
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 29, 2021

I found that some streams have problem with native pagination on Salesforce side.
For example:
Stream: PermissionSetTabSetting
TotalSize: 17513
Actual returned records: 4780

The problem is totalSize returned in result does not always match to the actual number of records returned.

image

Today I also found that, BULK API reads 4780 records from PermissionSetTabSetting stream too. I assume, TotalSize value might be incorrect. So, I completed the implementation provided in this PR.

@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 30, 2021

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1637295218
❌ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1637295218
🐛 https://gradle.com/s/jsgcbmzounj4m
Python short test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestBasicRead::test_read[inputs1] - docker.errors.Contai...
FAILED test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs1]
FAILED test_incremental.py::TestIncremental::test_two_sequential_reads[inputs1]
FAILED test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs1]
=================== 4 failed, 17 passed in 124.54s (0:02:04) ===================

@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets December 30, 2021 11:24 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 30, 2021 11:25 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets December 30, 2021 11:53 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets December 30, 2021 17:30 Inactive
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 30, 2021

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1638267997
❌ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1638267997
🐛 https://gradle.com/s/4thvlkcpcvjm6
Python short test summary info:

=========================== short test summary info ============================
FAILED test_incremental.py::TestIncremental::test_two_sequential_reads[inputs1]
FAILED test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs0]
FAILED test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs1]
=================== 3 failed, 18 passed in 819.43s (0:13:39) ===================

@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Jan 4, 2022

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1652557572
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1652557572
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      6    92%
	 source_acceptance_test/conftest.py                     109    109     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              242     96    60%
	 source_acceptance_test/tests/test_full_refresh.py       38      0   100%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  54     17    69%
	 source_acceptance_test/utils/compare.py                 62     23    63%
	 source_acceptance_test/utils/connector_runner.py       110     48    56%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  979    404    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                 Stmts   Miss  Cover
	 --------------------------------------------------------
	 source_salesforce/__init__.py            2      0   100%
	 source_salesforce/api.py               122     29    76%
	 source_salesforce/exceptions.py          1      0   100%
	 source_salesforce/rate_limiting.py      22      6    73%
	 source_salesforce/source.py             57     20    65%
	 source_salesforce/streams.py           240    147    39%
	 source_salesforce/utils.py               8      7    12%
	 --------------------------------------------------------
	 TOTAL                                  452    209    54%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                 Stmts   Miss  Cover
	 --------------------------------------------------------
	 source_salesforce/__init__.py            2      0   100%
	 source_salesforce/api.py               122     47    61%
	 source_salesforce/exceptions.py          1      0   100%
	 source_salesforce/rate_limiting.py      22      6    73%
	 source_salesforce/source.py             57     24    58%
	 source_salesforce/streams.py           240     59    75%
	 source_salesforce/utils.py               8      0   100%
	 --------------------------------------------------------
	 TOTAL                                  452    136    70%

@jrhizor jrhizor temporarily deployed to more-secrets January 4, 2022 07:26 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 14, 2022 09:27 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 14, 2022 09:44 Inactive
@augan-rymkhan
Copy link
Contributor Author

/test connector=connectors/source-salesforce

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 14, 2022 09:47 Inactive
@augan-rymkhan augan-rymkhan changed the title Source Salesforce: improve pagination in REST API streams Source Salesforce: fix pagination in REST API streams Jan 14, 2022
@augan-rymkhan
Copy link
Contributor Author

After tests successfully passed, only comments and unit test were added, but the last test run did not started.
Also rate limit is reached, I will run again after limit is reset. PR is ready for code review.

@augan-rymkhan
Copy link
Contributor Author

/test connector=connectors/source-salesforce

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 14, 2022 17:47 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 17, 2022 08:20 Inactive
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Jan 17, 2022

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1706977536
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1706977536
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      6    92%
	 source_acceptance_test/conftest.py                     109    109     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              242     96    60%
	 source_acceptance_test/tests/test_full_refresh.py       38      0   100%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  54     17    69%
	 source_acceptance_test/utils/compare.py                 62     23    63%
	 source_acceptance_test/utils/connector_runner.py       110     48    56%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  979    404    59%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                 Stmts   Miss  Cover
	 --------------------------------------------------------
	 source_salesforce/__init__.py            2      0   100%
	 source_salesforce/api.py               122     29    76%
	 source_salesforce/exceptions.py          1      0   100%
	 source_salesforce/rate_limiting.py      22      6    73%
	 source_salesforce/source.py             57     20    65%
	 source_salesforce/streams.py           243    150    38%
	 source_salesforce/utils.py               8      7    12%
	 --------------------------------------------------------
	 TOTAL                                  455    212    53%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                 Stmts   Miss  Cover
	 --------------------------------------------------------
	 source_salesforce/__init__.py            2      0   100%
	 source_salesforce/api.py               122     47    61%
	 source_salesforce/exceptions.py          1      0   100%
	 source_salesforce/rate_limiting.py      22      6    73%
	 source_salesforce/source.py             57     24    58%
	 source_salesforce/streams.py           243     42    83%
	 source_salesforce/utils.py               8      0   100%
	 --------------------------------------------------------
	 TOTAL                                  455    119    74%

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 17, 2022 08:34 Inactive
Copy link
Contributor

@vitaliizazmic vitaliizazmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide more detail about changes.

return f"/services/data/{self.sf_api.version}/queryAll"

def next_page_token(self, response: requests.Response) -> str:
response_data = response.json()
if len(response_data["records"]) == self.page_size and self.primary_key and self.name not in UNSUPPORTED_FILTERING_STREAMS:
return f"WHERE {self.primary_key} >= '{response_data['records'][-1][self.primary_key]}' "
return response_data.get("nextRecordsUrl")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliizazmic
If REST API query result has more than 1 page, the next page url is present in response body as nextRecordsUrl.

"""
If `next_page_token` is set, subsequent requests use `nextRecordsUrl`.
"""
return next_page_token
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRecordsUrl is relative url, so we can return it here instead original url

"""
If `next_page_token` is set, subsequent requests use `nextRecordsUrl`, and do not include any parameters.
"""
return {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we trying to get 2nd and more pages we don't need to send any params, use just nextRecordsUrl for api call.


if self.primary_key and self.name not in UNSUPPORTED_FILTERING_STREAMS:
query += f"ORDER BY {self.primary_key} ASC LIMIT {self.page_size}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in REST API, we don't need to LIMIT results set, because limit does not work for all streams.
I noticed, that when we limit by 2000 (page_size) some stream always returns 1000 records per page, other stream 465 per page.

@@ -259,6 +266,32 @@ def next_page_token(self, last_record: dict) -> str:
if self.primary_key and self.name not in UNSUPPORTED_FILTERING_STREAMS:
return f"WHERE {self.primary_key} >= '{last_record[self.primary_key]}' "

def request_params(
Copy link
Contributor Author

@augan-rymkhan augan-rymkhan Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, request_params was inherited from SalesfroceStream. But in this PR, parent's method was changed to respect new pagination approach. So I just added this method here not to break BULK API functionality.

@@ -324,13 +358,13 @@ def request_params(
}

stream_date = stream_state.get(self.cursor_field)
start_date = next_page_token or stream_date or self.start_date
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need next_page_token here, because next_page_token is relative URL for the next chunk of results.

@@ -352,3 +386,26 @@ class BulkIncrementalSalesforceStream(BulkSalesforceStream, IncrementalSalesforc
def next_page_token(self, last_record: dict) -> str:
if self.name not in UNSUPPORTED_FILTERING_STREAMS:
return last_record[self.cursor_field]

def request_params(
Copy link
Contributor Author

@augan-rymkhan augan-rymkhan Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, this method was inherited from IncrementalSalesforceStream. In this PR request_params is overriden not to break existing functionality, because parents methods were changed.

return f"/services/data/{self.sf_api.version}/queryAll"

def next_page_token(self, response: requests.Response) -> str:
response_data = response.json()
if len(response_data["records"]) == self.page_size and self.primary_key and self.name not in UNSUPPORTED_FILTERING_STREAMS:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue was here, self.page_size = 2000, but real records count per page was different than 2000, so this method returned None: only the first page was read in this case.

Copy link
Contributor

@sergei-solonitcyn sergei-solonitcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough, I like it.

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Jan 18, 2022
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 18, 2022 14:10 Inactive
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Jan 18, 2022

/publish connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1713155369
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/1713155369

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 18, 2022 14:19 Inactive
@augan-rymkhan augan-rymkhan temporarily deployed to more-secrets January 18, 2022 14:37 Inactive
@augan-rymkhan augan-rymkhan merged commit 62c433e into master Jan 18, 2022
@augan-rymkhan augan-rymkhan deleted the arymkhan/fix_salesforce_pagination branch January 18, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Source Salesforce: REST API reads only 1000 records from some streams
6 participants