-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[controller] Harden update-store workflow #1091
base: main
Are you sure you want to change the base?
Conversation
9edf55a
to
a019bc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks like a good change! I didn't have time to look at it fully yet, but I left one minor comment so far.
...es/venice-controller/src/main/java/com/linkedin/venice/controller/util/UpdateStoreUtils.java
Outdated
Show resolved
Hide resolved
e42ddde
to
a81e40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a pass. Would recommend having another pair of eyes especially on the following files due to my limited context in the code base
- VeniceParentHelixAdmin
- VeniceHelixAdmin
- UpdateStoreUtils
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/helix/StoragePersonaRepository.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/meta/HybridStoreConfig.java
Outdated
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/meta/PartitionerConfigImpl.java
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/Admin.java
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/Admin.java
Show resolved
Hide resolved
...oller/src/main/java/com/linkedin/venice/controller/init/SystemStoreInitializationHelper.java
Show resolved
Hide resolved
...ntroller/src/main/java/com/linkedin/venice/controller/kafka/consumer/AdminExecutionTask.java
Show resolved
Hide resolved
...es/venice-controller/src/main/java/com/linkedin/venice/controller/util/UpdateStoreUtils.java
Show resolved
Hide resolved
0d3e2c1
to
916b14d
Compare
a296585
to
33d1620
Compare
3fa383a
to
e4ebdbe
Compare
e4ebdbe
to
408d463
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I mostly made sure that the checks and description in the summary matches w/ the code changes.
Final comments on the strategy about releasing the changes
- Do we have branch based releases for complex changes/features?
- It feels merging this w/ master and finding issues during certification can be potentially difficult to deal/mitigate as surface area of the changes is large and reverting such large commit accurately (assuming other changes pile in) might be tricky.
How do plan to certify this w/ our current certification flow?
if (params.getRegionsFilter().isPresent()) { | ||
Set<String> regionsFilter = parseRegionsFilterList(params.getRegionsFilter().get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of this conditional to make this more readable? Perhaps something like
Set<String>regionsFilter = params.getRegionsFilter().map(this::parseRegionsFilterList).orElse(Collections.emptySet());
This way you have a top level early exit in case of regions filter not containing region name.
clusterName, | ||
regionsFilter, | ||
multiClusterConfigs.getRegionName()); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect caller to handle null is it?
addToUpdatedConfigs( | ||
updatedConfigs, | ||
BACKUP_VERSION_RETENTION_MS, | ||
backupVersionRetentionMs, | ||
updatedStore::setBackupVersionRetentionMs); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
NATIVE_REPLICATION_SOURCE_FABRIC, | ||
nativeReplicationSourceFabric, | ||
updatedStore::setNativeReplicationSourceFabric); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
LATEST_SUPERSET_SCHEMA_ID, | ||
latestSupersetSchemaId, | ||
updatedStore::setLatestSuperSetValueSchemaId); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
MIN_COMPACTION_LAG_SECONDS, | ||
minCompactionLagSeconds, | ||
updatedStore::setMinCompactionLagSeconds); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
MAX_COMPACTION_LAG_SECONDS, | ||
maxCompactionLagSeconds, | ||
updatedStore::setMaxCompactionLagSeconds); | ||
addToUpdatedConfigs(updatedConfigs, MAX_RECORD_SIZE_BYTES, maxRecordSizeBytes, updatedStore::setMaxRecordSizeBytes); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
MAX_NEARLINE_RECORD_SIZE_BYTES, | ||
maxNearlineRecordSizeBytes, | ||
updatedStore::setMaxNearlineRecordSizeBytes); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
UNUSED_SCHEMA_DELETION_ENABLED, | ||
unusedSchemaDeletionEnabled, | ||
updatedStore::setUnusedSchemaDeletionEnabled); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
BLOB_TRANSFER_ENABLED, | ||
blobTransferEnabled, | ||
updatedStore::setBlobTransferEnabled); | ||
addToUpdatedConfigs( | ||
updatedConfigs, | ||
STORAGE_NODE_READ_QUOTA_ENABLED, | ||
storageNodeReadQuotaEnabled, | ||
updatedStore::setStorageNodeReadQuotaEnabled); | ||
addToUpdatedConfigs(updatedConfigs, REGULAR_VERSION_ETL_ENABLED, regularVersionETLEnabled, regularVersionETL -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to group this configurations into some sort of logical category? Reason being, we can break this method into smaller methods where each of those smaller methods take care of updating their logical group of configuration.
It definitely makes it more readable and enables developers to make changes within a group of configuration to reason about potential implication and validations.
throw new VeniceHttpException(HttpStatus.SC_BAD_REQUEST, errorMessage, ErrorType.INVALID_CONFIG); | ||
} | ||
|
||
if (updatedStore.isHybrid() && newPartitionCount == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay for non-hybrid stores to have newPartitionCount = 0
is it?
if (newPartitionCount < minPartitionNum && newPartitionCount != 0) { | ||
throw new VeniceHttpException( | ||
HttpStatus.SC_BAD_REQUEST, | ||
"Partition count must be at least " + minPartitionNum + " for store: " + storeName | ||
+ ". If a specific partition count is not required, set it to 0.", | ||
ErrorType.INVALID_CONFIG); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be consistent w/ rest and log the error before throwing the exception?
if (oldStore.getPartitionerConfig() == null) { | ||
originalPartitionerConfig = new PartitionerConfigImpl(); | ||
} else { | ||
originalPartitionerConfig = oldStore.getPartitionerConfig(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
Some of the places below starts with initialization and then checks for null and reinitializes with the default in the event of null
. e.g., ln: 1186, ln: 1197
Although I personally prefer
Optional.ofNullable(...).orElse(defaultInstance)
Feel free to make them consistent w/ the rest at the very least.
public static boolean isIncrementalPushSupported( | ||
boolean multiRegion, | ||
boolean activeActiveReplicationEnabled, | ||
HybridStoreConfig hybridStoreConfig) { | ||
// Only hybrid stores can support incremental push | ||
if (!AdminUtils.isHybrid(hybridStoreConfig)) { | ||
return false; | ||
} | ||
|
||
// If the system is running in multi-region mode, we need to validate the data replication policies | ||
if (!multiRegion) { | ||
return true; | ||
} | ||
|
||
// A/A can always support incremental push | ||
if (activeActiveReplicationEnabled) { | ||
return true; | ||
} | ||
|
||
DataReplicationPolicy dataReplicationPolicy = hybridStoreConfig.getDataReplicationPolicy(); | ||
return dataReplicationPolicy == DataReplicationPolicy.AGGREGATE | ||
|| dataReplicationPolicy == DataReplicationPolicy.NONE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leverage this in the UpdateStoreUtils
for validation of scenarios for incremental push allowed? Seems redudant checks
|
||
sendAdminMessageAndWaitForConsumed(clusterName, storeName, message); | ||
} finally { | ||
releaseAdminMessageLock(clusterName, storeName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we guarantee that we don't run into scenario where the admin is stuck and this never executes? Is there some sort of timed lock releases?
Harden update-store workflow
This PR unifies the
UpdateStore
logic between child controller and parent controllers. We've faced many issues due to the structure of the code (especiallyVeniceParentHelixAdmin
) where a set of store-update operations puts it into a non-healthy state. These are all the changes in this PR:UpdateStoreUtils
that is called by both parent and child controllers to apply the store update and perform the necessary validations.NON_AGGREGATE
data replication policyACTIVE_ACTIVE
data replication policy is only supported when Active-Active replication is enabledDataReplicationPolicy
isAGGREGATE
ordinal
fromBackupStrategy
enum was being used to write to the admin channel. This is problematic when the enums evolve. Added agetValue
function and used that instead.controller.external.superset.schema.generation.enabled
has been added to replacecontroller.parent.external.superset.schema.generation.enabled
because external superset schema generation must be allowed in single-region mode also.controller.parent.external.superset.schema.generation.enabled
has been marked deprecated, but it has not been completely removed yet for backward compatibility reasons.Some side-effects of this change are:
SupersetSchemaGeneratorWithCustomProp
had a bug where if the first schema has a custom prop, the future superset schema generation would fail as Avro doesn't allow overriding custom props. This got caught as the update store logic now also tries creating superset schema if a store enabled RC or WC, or if it previously had a superset schema.There are a few other changes that we should do, but are not done in this PR:
All major operations should only be allowed on the parent controller - Create a store, delete a store, add schemas. We should exclude some system stores from this check like we have for the check allowing batch push to admin in child
Recommendation for Reviewers
I recommend going through the changes in at least two passes. In the first pass, look through all the stuff that has been purely deleted. (Skip
ParentControllerConfigUpdateUtils
as that has been renamed toPrimaryControllerConfigUpdateUtils
and it's contents partially moved toUpdateStoreUtils
. So, GitHub doesn't detect is as a renaming). While doing this pass, you can also glance through various small changes that do not need much pondering over.In the second pass, follow this review order:
VeniceControllerClusterConfig
,VeniceControllerConfig
,VeniceControllerMultiClusterConfig
VeniceControllerService
Admin
UpdateStoreUtils
PrimaryControllerConfigUpdateUtils
AdminUtils
UpdateStoreWrapper
VeniceHelixAdmin
VeniceParentHelixAdmin
SupersetSchemaGeneratorWithCustomProp
How was this PR tested?
Added new tests. Modified existing tests. GH CI
Does this PR introduce any user-facing changes?