Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out documents of unknown types during migrations #104690

Closed
joshdover opened this issue Jul 7, 2021 · 9 comments · Fixed by #105213
Closed

Filter out documents of unknown types during migrations #104690

joshdover opened this issue Jul 7, 2021 · 9 comments · Fixed by #105213
Assignees
Labels
Feature:Migrations project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Contributor

joshdover commented Jul 7, 2021

After #103341 merged, we've noticed many upgrade failures due to documents still existing for types that have been removed from Kibana's codebase. We've had no process for tracking these removals so there are some SO types that have likely had old stale documents in customers' .kibana index for some time.

Prior to #103341, these documents would have caused issues for v2 migrations, causing the migration to fail halfway through the upgrade and leaving the cluster in a state that could not easily be rolled back. While we now fail earlier to enable a smooth rollback, the number of documents with unknown types seems to be much larger than we have anticipated. (in v1 migrations, these documents would not have failed the migration only if they never had any migrations registered for them).

We are attempting to handle this situation by finding all the types that have been removed and adding them to the filter that is used to ignore these outdated documents in #104507. The problem is that this audit may not cover 100% of cases because auditing older branches is not foolproof and in some cases the old branches no longer work at all.

An alternative option may be to simply only migrate known types and ignore all others. This option was not previously considered in the original RFC section that discussed this scenario. Given that the v2 migration architecture retains the previous index, I think this option is now viable since it will not delete the documents of unknown types.

The main issue with this option is that it does not provide an easy path to migrate these documents later should a plugin become enabled that previously used this SO type. For instance, imagine the following scenarios:

Scenario 1: custom plugin

  • User installs Kibana 7.13
  • User adds a custom plugin to 7.13 that registered an custom-type SO type and stores some data
  • User wants to upgrade to 7.14, but custom plugin doesn't yet support it. They decide to upgrade anyway, without the plugin installed.
  • During 7.14 upgrade, the custom-type documents are filtered out and left in the old 7.13 index.
  • User later installs the plugin again for 7.14, but all of their old data is missing

If the plugin did not create a migration in 7.14, the user could simply copy the documents from the 7.13 index to the 7.14 index and everything should work fine. However if the custom plugin did register migrations, the workaround would involve setting up a separate 7.13 cluster with the custom-type documents in the 7.13 index, running an upgrade to 7.14, then exporting those objects and reimporting them into the original 7.14 cluster.

Scenario 2: disable plugin

  • User installs Kibana 7.13, creates some data in SIEM
  • User sets xpack.security_solution.enabled: false
  • User wants to upgrade to 7.14
  • During 7.14 upgrade, the cases documents are filtered out and left in the old 7.13 index.
  • User later enabled the security_solution plugin again for 7.14, but all of their old data is missing

The workaround would involve setting up a separate 7.13 cluster with the cases (and any other related) documents in the 7.13 index, running an upgrade to 7.14, then exporting those objects and reimporting them into the original 7.14 cluster.

If we're ok with just not providing a good user experience here, then I can't come up with a reason not to do this. I don't think this the custom plugin scenario is case we particularly need to support and a (painful) workaround does exist. Providing a good, bulletproof default experience is probably more important. The second scenario for disabling plugins I'm less sure about. At least with the current behavior we very explicitly make them choose to delete these documents.

It's also worth noting that we're moving away from allowing customers to disable most first-party plugins which will help mitigate this issue, but that won't be available until 8.0. See #89584 for discussion.

cc @pgayvallet @kobelb

@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Jul 7, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@lukeelmers
Copy link
Member

the workaround would involve setting up a separate 7.13 cluster with the custom-type documents in the 7.13 index, running an upgrade to 7.14, then exporting those objects and reimporting them

Would running the 7.14 upgrade be necessary? I thought we ran migrations on SO import? Perhaps the workaround be simplified to:

  1. Set up 7.13 cluster
  2. Export objects of the type you want
  3. Import those into the new 7.14 cluster once you've enabled the relevant plugin(s), and migrations should run

This would be another area where #55404 could potentially help... if we returned some sort of warning on ignored SO types, we could prompt the user to export those for future reference before attempting the initial upgrade. Then step (1) in the workaround above wouldn't be necessary.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 7, 2021

These scenarios (and their workarounds) work as long as we don't have references between a disabled and an enabled type during the migration matching the disabled type's convertToMultiNamespaceTypeVersion.

Please refer to #101351 (comment) for the detailed edge case, but in short:

  • we're in version 7.14.0
  • we got two types:
    • foo that has its convertToMultiNamespaceTypeVersion set to 7.15.0
    • bar that has outward references to foo
  • type foo got unregistered by having its owning plugin disabled
  • we migrate to 7.15.0

consequence:

  • The references from bar to foo will not be properly updated during the migration where bar was disabled. However, as the bar documents' coreMigrationVersion will be already set to 7.15.0, when re-enabling the foo type, the references will still not be updated, resulting in broken references

Meaning that the suggested workaround

The workaround would involve setting up a separate 7.13 cluster with the cases (and any other related) documents in the 7.13 index, running an upgrade to 7.14, then exporting those objects and reimporting them into the original 7.14 cluster.

Would not work, as we would need to also alter the references of objects already migrated in our Kibana index.

This is the main problem here.

Now, I would be more than happy to ignore this edge case, and just go with the proposed approach/workarounds, but we need to be aware that this may come back and bite us during the 7.15 and/or 8.0 migration, which are the currently scheduled convertToMultiNamespaceTypeVersion releases.

I think we do need to check the list available here: #100489, check which types are using references to types from other plugins (e.g the references that we could potentially break by disabling a specific plugin before/during the migration), and the risk that their owning plugin can/could be disabled in our customers base.

@joshdover
Copy link
Contributor Author

Would not work, as we would need to also alter the references of objects already migrated in our Kibana index.

This is the main problem here.

Yes, I had forgotten about the references ID rewriting. With all of these moving parts, it seems that the most thorough way to continue supporting this use case would be to implement the suggested solution in #101351 (comment) related to splitting the coreMigrationVersion field into two fields:

This one is way harder to solve. To be honest, I don't really have any good idea to address that, apart from splitting coreMigrationVersion into two things

  • convertMigrationVersion, which would be the version for the convert migrations
  • referencesMigration, which would be a record type->boolean.

Having this per-type boolean would allow to specifically flag which references were converted or not. That way, in the edge previously describe edge case, bar.referencesMigration.foo would still be undefined, meaning that we could theoretically be able to migrate them when the type got re-enabled.

This way we can migrate the references as needed separately from the ID. This is going to add additional complexity to the migration algorithm which I would like to avoid. But if we need to keep supporting this use case, I don't see another good option available.

Regardless of the long-term solution, this seems far out of scope for the 7.14 release. For the 7.14 release we have two options:

  1. Continue with the current implementation and add all types that have ever been registered by first-party plugins to the ignore list so they are skipped and not retains on upgrade.
    • Pros: very low effort, low risk. I'll be done later today with auditing for first-party types all the way back to 6.0.
    • Cons: still has the drawback for any disabled plugins that have data stored in Saved Objects. Though as Luke pointed out above, the workdaround would just involve exporting the objects from the 7.13 index and importing the into 7.14. That seems acceptable to me for an undocumented & 'experimental' feature.
  2. Skip all documents of unknown types from the client-side index step. This would allow these documents of unknown types to be retained in the Kibana index. This would not be a long-term solution but would be acceptable for the 7.14 release since no types are being converted to multi-namespace in this release.
    • Pros: preserves the behavior from 7.13 and prior
    • Cons: will require a higher-risk change to go in after 7.14 FF. It will also require more additional effort in 7.15 to support these unmigrated docs for the multi-namespace migration. Continues to expose us to future data integrity bugs around these orphaned documents. Having to anticipate all of these scenarios and combinations (enabled/disabled/partially migrated) will continue to be a long-term challenge.
  3. Exclude all documents of unknown types from the migration completely. This would ignore these documents and leave them only in the prior index. We would log a warning if unknown documents are found rather than fail the migration.
    • Pros: all the benefits of (1) and some of the benefits of (2). Would allow us to keep the multi-namespace migration as-is.
    • Cons: No nagging feedback would be given to the user and they could unintentionally delete these old indices which may include data from plugins they currently have disabled but want to enable again later. Maybe we could add some UI alert in the future for this?

I strongly prefer option (1) for the long-term benefits, though it could make for a bumpy upgrade for any users who have disabled a plugin. (3) is also attractive but I worry about giving our users a foot-gun. Making a decision here is quite hard without much visibility into how and when our users are using the plugin disabling 'feature'.

@joshdover
Copy link
Contributor Author

I've completed an audit going back to version 5.5 where Saved Objects were added. There is only one additional SO type that needs to be filtered out for option (1) and I'll be opening a PR for this. Here's the results of my audit: https://gist.github.com/joshdover/e3c449d165498da49e8e4a4fb6f2cef5

@joshdover
Copy link
Contributor Author

I summarized these options and tradeoffs in a bit more digestable way. I'm going to dig in to our data tomorrow to see if we can determine an estimated % of clusters that are disabling plugins and see if the most common plugins would be affected by options 1 or 3.

Summary Table

Option Documents for disabled plugins after upgrade Data loss 8.0 upgrade risk Data integrity risk
1: Fail on all documents of disabled plugin types. Ignore types that are no longer in use. Retained in backup index, need to import from backup to get data back
🟡
All deletes are performed explicitly by customer
🟢
Low
🟢
Better
🟢
2: Skip all documents from disabled plugin types from ID rewrite migration for now. Retained in current index, just enable plugin and restart to get data back
🟢
No risk
🟢
Medium
🟡
Worse
🟠
3: Only migrate types for plugins that are enabled at first upgrade. Retained in backup index, need to import from back to get data back
🟡
Documents are left in backup indices which are not deleted. Customer may not realize this.
🟡
Low
🟢
Better
🟢

Detailed table

Option Short-term Impact Long-term Impact 7.14 Risk
1: Fail on all documents of disabled plugin types. Ignore types that are no longer in use. Could frustrate users on the first upgrade to 7.14 who have to make a choice about something they may not understand. Avoids unintentional data loss for users using plugin disabling. Cruft in the Kibana indices starts to get cleaned up, which is important for data integrity. Many larger installations are already plagued with broken references.

Simplifying the testing matrix for that migration would improve our confidence that the ID rewrite will be seamless for the 8.0 release.

Lowest, no changes required for 7.14.
2: Skip all documents from disabled plugin types from ID rewrite migration for now. Same behavior as 7.13 and prior. Documents of disabled plugins will copied but not migrated. Must come up with a long-term solution for 7.15 for the upcoming multi-namespace rollout. Stopgap solution. We continue to support disabling plugins across version upgrades.

Risk of 8.0 ID rewrite upgrade is higher due to complexity. Larger, older installations continue to have problems on upgrades long-term.

Medium, requires a few minor changes to the migration algorithm, but test coverage in this area is very high.
3: Only migrate types for plugins that are enabled at first upgrade. No migrations will fail due to data from disabled plugins.

Data for disabled plugins will not be retained in the index, but be kept in backup index.

Similar to (1), data is cleaned up to only select the data that is currently used by the Kibana config and simplifies testing and implementation of ID rewrite migration. Helps mitigate future data integrity issues. Medium, requires a few minor changes to the migration algorithm, but test coverage in this area is very high.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 9, 2021

This way we can migrate the references as needed separately from the ID. This is going to add additional complexity to the migration algorithm which I would like to avoid. But if we need to keep supporting this use case, I don't see another good option available.

Yea, the added complexity is all but negligible, and I agree that I would gladly avoid this option if possible

  1. Continue with the current implementation and add all types that have ever been registered by first-party plugins to the ignore list so they are skipped and not retains on upgrade.
  2. Skip all documents of unknown types from the client-side index step
  3. Exclude all documents of unknown types from the migration completely. This would ignore these documents and leave them only in the prior index

I agree with your summary and your order of preference.

To explain a little more, I do think 2. is plain dangerous in the mid/long term. Even if the only effectively dangerous scenario is the 'upgrade to from 7.14 or prior to 7.15/8.0 or higher with disabled plugins, such possibilities are still real until we forbid to disabled plugins and corruption WILL occur when trying to reenable the plugins if there are outbound ref from already migrated docs to the disabled type, which seems like reason enough to just not want to try it.

With all the previous versions check you performed the last few days, I feel pretty confident that option 1. should be fine, as we supposedly identified all the previous types that are no longer registered, and as we added a test to make sure that any future removed SO type has to be added to our list, so I think we don't need to do 3.. 1. may sound harsh, but that's imho a mandatory step toward better data integrity, which would not only avoid future migrations' issues, but should also improve the runtime stability of Kibana overall.

Also, I'll add that we're very close to the BC deadline, and implementing 3. in such short notice is a little riskier than if we were not under such time pressure (even if it is doable).

@kobelb
Copy link
Contributor

kobelb commented Jul 9, 2021

My preference is to go with option 2 in the short term. I fully acknowledge that this is the kick the can down the road option, and we're going to have to deal with this problem in the future as soon as we allow saved-object IDs to be regenerated.

My thinking is that this will buy us time to figure out what upgrade scenarios we need to support and evaluate our options against those upgrade scenarios. This also allows us to brainstorm other more complex solutions we can implement. IMO, this is more in line with the make-it-minor mindset where we are willing to invest more effort in solutions that provide our users a seamless upgrade experience.

@joshdover joshdover self-assigned this Jul 12, 2021
@joshdover
Copy link
Contributor Author

In some offline discussions, we decided to go with option (2) for the remainder of the 7.x series and add a few things to make the 8.0 upgrade smoother & improve long-term data integrity:

  • In 7.14, start logging a warning for unknown document types detected during migration, but do not fail the upgrade. Retain these documents in the index and migrate them if the document type becomes available or enabled later.
  • Add a deprecation to the Upgrade Assistant for unknown document types. This deprecation should allow users to either delete the offending docs or help them identify which plugins they need to re-enable to make this warning go away.
  • In 8.0, fail fast when unknown document types are detected and when 1st party plugins are disabled. This will ensure that we do not encounter problems with the ID rewrites required for sharing across spaces features. None of these ID rewrite migrations will happen before the 8.0 release, so this will not cause any problems before that release.

I will have a PR up to fix the 7.14/7.x scenario first, followed by PRs for adding Upgrade Assistant deprecations and the fail-fast logic for the master/8.0 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants