-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor BigQuery Destination Integration tests #20851
Conversation
Affected Connector ReportNOTE
|
Connector | Version | Changelog | Publish |
---|
- See "Actionable Items" below for how to resolve warnings and errors.
✅ Destinations (2)
Connector | Version | Changelog | Publish |
---|---|---|---|
destination-bigquery |
1.2.9 |
✅ | ✅ |
destination-bigquery-denormalized |
1.2.9 |
✅ | ✅ |
- See "Actionable Items" below for how to resolve warnings and errors.
👀 Other Modules (1)
- base-normalization
Actionable Items
(click to expand)
Category | Status | Actionable Item |
---|---|---|
Version | ❌ mismatch |
The version of the connector is different from its normal variant. Please bump the version of the connector. |
⚠ doc not found |
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug. |
|
Changelog | ⚠ doc not found |
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug. |
❌ changelog missing |
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog. | |
Publish | ⚠ not in seed |
The connector is not in the seed file (e.g. source_definitions.yaml ), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug. |
❌ diff seed version |
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking my understanding:
- BigQueryDAT - runs the full DAT suite against a standard inserts config
- BigQueryGcsDAT - runs the full DAT suite against a GCS staging config. Extends BigQueryDAT for convenience
- Maybe we should have a root-level
BigQueryDAT
and then inheritBigQueryStandardInsertsDAT
+BigQueryGcsDAT
from it? It might reduce some code duplication between them in the setup method
- Maybe we should have a root-level
- BigQueryDestinationTest - verifies a couple different expected success/failure configs with basic behaviors (
check
,write
) but doesn't run the full DAT process against all of them
?
left a couple small comments+questions but overall this makes sense.
final String tmpConfigAsString = Files.readString(configFile); | ||
final JsonNode tmpConfigJson = Jsons.deserialize(tmpConfigAsString); | ||
final JsonNode tmpCredentialsJson = tmpConfigJson.get(BigQueryConsts.BIGQUERY_BASIC_CONFIG); | ||
Builder<Object, Object> mapBuilder = ImmutableMap.builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: ObjectNode finalConfig = (ObjectNode) Jsons.emptyObject()
and build it directly
|
||
public class BigQueryDestinationTestUtils { | ||
|
||
public static JsonNode createConfig(Path configFile, String datasetId) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand - it looks like this is reading a json blob and shuffling some stuff around. Can we just store the final json blob in GSM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. We should make the secrets stored in GSM have the structure that the tests expect and avoid all this extra shuffling
/** | ||
* Remove all the GCS output from the tests. | ||
*/ | ||
public static boolean tearDownGcs(AmazonS3 s3Client, JsonNode config, Logger LOGGER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (nonblocking): would be cool to have this live in base-java-s3, but idk if our build.gradle are set up to support this easily
|
||
final JsonNode properties = config.get(BigQueryConsts.LOADING_METHOD); | ||
final String gcsBucketName = properties.get(BigQueryConsts.GCS_BUCKET_NAME).asText(); | ||
final String gcs_bucket_path = properties.get(BigQueryConsts.GCS_BUCKET_PATH).asText(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that if we ever want to test the gcs_bucket_path = ""
scenario, we'll want to set up a new bucket for it - which is probably correct, just want to call it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is gcs_bucket_path = ""
a valid config for staging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so? Not aware of any reason why not
try { | ||
dataset = BigQueryDestinationTestUtils.initDataSet(config, bigquery, datasetId); | ||
} catch(Exception ex) { | ||
//ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we fail if we can't create the dataset? Or is this to handle expected errors like permissions problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I got lost in my own refactoring and left this try-catch behind
} | ||
})); | ||
protected void addShutdownHook() { | ||
Runtime.getRuntime().addShutdownHook(new Thread(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd mildly prefer to have this in an @AfterEach
method, but this is fine if that would be complicated (also looks like you inherited this from the existing test structure, so feel free to ignore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. I thought it was already called after each test, because it overwrites tearDown
. I'll add @AfterEach
annotation. The shutdown hooks were there before refactoring and AFAIK, the reason is to execute cleanup even if the test crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure junit takes care of running AfterEach methods, even if tests blow up (except maybe in cases where the entire runtime crashes?)
but yeah, don't worry about this if it looks super messy. Sounds like this has been working correctly as a shutdown hook anyway
throw new IllegalStateException(""" | ||
Json config not found. Must provide path to a big query credentials file, | ||
please add file with creds to | ||
../destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json."""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
../destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json."""); | |
<...>/destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json."""); |
since ..
might get misinterpreted
Looks like we broke DATs for destination-bigquery-denormalized back in November. I opened a separate PR to fix it: #20871 |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
a3f7ed2
to
e38d902
Compare
This comment was marked as resolved.
This comment was marked as resolved.
/test connector=connectors/destination-bigquery
Build PassedTest summary info:
|
e38d902
to
6b7d29a
Compare
/test connector=connectors/destination-bigquery-denormalized
Build PassedTest summary info:
|
@@ -70,7 +70,7 @@ public class BigQueryUtils { | |||
DateTimeFormatter.ofPattern("[yyyy][yy]['-']['/']['.'][' '][MMM][MM][M]['-']['/']['.'][' '][dd][d]" + | |||
"[[' ']['T']HH:mm[':'ss[.][SSSSSS][SSSSS][SSSS][SSS][' '][z][zzz][Z][O][x][XXX][XX][X]]]"); | |||
private static final String USER_AGENT_FORMAT = "%s (GPN: Airbyte)"; | |||
private static final String CHECK_TEST_DATASET_SUFFIX = "_airbyte_check_stage_tmp_" + System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, if this timestamp removing may cause issues during simultaneous tests execution? Ex. same tests for different PRs, local run, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't affect anything since it's implemented back below in checkHashCreateAndDeleteDatasetRole
and CHECK_TEST_DATASET_SUFFIX
is only called within that method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I moved the timestamp from class instantiation time to test execution time is that it was causing name collision when tests run too fast.
} | ||
|
||
protected void tarDownGcs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should be tearDownGcs
to reduce code duplication and move configuration logic from code into config files. This refactoring will make it easier to add more test cases for configuration variations such as data location, accounts with various permission combinations, and account impersonation
6b7d29a
to
e5027e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good and the usage of config files cleans up a lot of the overhead from before too. Thanks for taking on this change
if(s3Client == null) { | ||
return; | ||
} | ||
if(BigQueryUtils.getLoadingMethod(config) != UploadingMethod.GCS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
check can probably be grouped together with the preceding if
statement. Also just to make sure this is clear, is the reason for these empty return
basically to say we shouldn't be logging any warnings? From what I can understand, it seems that if these values don't exist then it will log a warning in the try-catch
block
throw new IllegalStateException(""" | ||
Json config not found. Must provide path to a big query credentials file, | ||
please add file with creds to | ||
<...>/destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json."""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good place to use the constants in a string format instead of having the values be re-type. Something like
throw new IllegalStateException(String.format("""
Json config not found. Must provide path to a big query credentials file,
please add file with creds to
<...>/destination-bigquery/%s""",
CREDENTIALS_WITH_MISSED_CREATE_DATASET_ROLE_PATH);
That said, it appears this hardcoded strings already exist in tests so it's a minor nit and it's not functionally changing anything. The reason for the nit is that it explicitly gives the reader an understanding of where the file name came from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I should actually remove all of this copy pasta and just iterate over the array of paths.
/test connector=connectors/destination-bigquery
|
/test connector=connectors/destination-bigquery-denormalized
Build PassedTest summary info:
|
/test connector=connectors/destination-bigquery
Build PassedTest summary info:
|
/approve-and-merge reason=”to unblock other PRs that modify BigQuery tests" |
public static void beforeAll() throws IOException { | ||
for(Path path : ALL_PATHS) { | ||
if (!Files.exists(path)) { | ||
throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing - I believe this exception will be thrown on the first file to not exist, then exit the loop. It might be more convenient if all missing files are shown in the error message.
final List<Path> missingPaths = ALL_PATHS.stream().filter(path -> !Files.exist(path)).collect(Collectors.toList());
if (!missingPaths.isEmpty()) {
throw new IllegalStateException(
// ...
)
}
|
||
tornDown = false; | ||
addShutdownHook(); | ||
configs = new HashMap<String, JsonNode>() {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if there was motivation to remove the ImmutableMap.builder()
flow here?
try { | ||
dataset = BigQueryDestinationTestUtils.initDataSet(config, bigquery, datasetId); | ||
} catch(Exception ex) { | ||
//ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least log a warning here, otherwise if an exception occurs at this point it would be difficult to debug.
@@ -374,11 +322,17 @@ void testWriteFailure(final DatasetIdResetter resetDatasetId) throws Exception { | |||
} | |||
|
|||
private Set<String> fetchNamesOfTablesInDb() throws InterruptedException { | |||
if(dataset == null || bigquery == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two conditionals on lines 325-327 and 333-335 can be combined. The QueryJobConfiguration
AFAIK is just building a query rather than executing it. Still a little rusty on my Java but I think an Optional
would be good here
Optional<Dataset> potentialDataset = Optional.ofNullable(dataset);
if (potentialDataset.filter(d -> !d.exists()).orElse(false) || bigquery == null) {
return Collections.emptySet();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments but mostly LGTM
This refactoring reduces code duplication and moves configuration logic from code into config files.
The purpose of this change is to make it easier to add more test cases for configuration variations such as
data location, accounts with various permission combinations, and account impersonation. This change was inspired by the impersonation PR. In order to run all the tests with impersonated credentials, we would have had to write a bunch more code. Instead, after this refactoring, we can just add two more credential files (1 - positive test, 2 - negative test) and about ~10 lines of code.
Part of the change is removing
BigQueryGcsDestinationTest
, by replacing it with a gcs-enabled config file (I added new config files to GSM before submitting this PR).