Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Changes in package install format should be applied on Stack upgrades #121099

Closed
5 tasks
Tracked by #108554
joshdover opened this issue Dec 13, 2021 · 5 comments · Fixed by #135485
Closed
5 tasks
Tracked by #108554

[Fleet] Changes in package install format should be applied on Stack upgrades #121099

joshdover opened this issue Dec 13, 2021 · 5 comments · Fixed by #135485
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@joshdover
Copy link
Contributor

joshdover commented Dec 13, 2021

Whenever Fleet makes changes to how package assets are installed into Elasticsearch or Kibana, the changes are not applied to packages that were installed on prior versions of Kibana, until the package is upgraded. This can cause a discrepancy in what we can document and instruct users to do, based on what Stack version they were running when they installed a package instead of which Stack version they are running now.

A simple example would be a case where we add a new field to the _meta object applied to all Elasticsearch assets (index templates, ingest pipelines, etc.)

Examples of recent changes that are affected by this issue:

Future planned changes that will be affected:

Ideally, when the Stack is upgraded, all package assets are installed in a consistent format, regardless of whether or not the package was originally installed before or after the Stack upgrade.

Today, users can work around this limitation by forcing a package to be reinstalled via the API:

curl --request POST \
  --url http://localhost:5601/api/fleet/epm/packages/<name>/<version> \
  --user elastic:changeme
  --header 'content-type: application/json' \
  --header 'kbn-xsrf: x' \
  --data '{"force": true}'

Design

This section summarizes the design that was chosen based on discussion in the Fleet team. See the full discussion of implementation options below for more detail about the rationale.

  • Implement install format version tracking on packages
    • Add a install_format_schema_version to the epm-packages objects
    • Add a INSTALL_FORMAT_SCHEMA_VERSION constant to the Fleet's implementation code that we manually increment when there's a schema change
    • Replace the existing logic to use the new install format field so that we reinstall any package that was not installed on the latest schema version. The new logic should:
      • Query for any epm-packages objects where install_format_schema_version is null or < INSTALL_FORMAT_SCHEMA_VERSION constant in code
      • Reinstall each package found in above query
  • Implement an automated test for detecting code changes where we forgot to increment the schema version
    • Install several packages with the current install schema, ideally ones that contain all assets and as many options as possible
    • Take an Elasticsearch snapshot (or data tarball) to be used for future tests. This snapshot represents the current state to be expected of the current install format schema.
    • On each PR run a test that:
      1. Restores ES from the snapshot above
      2. Starts Kibana and waits for Fleet's setup and upgrade process to complete
      3. Capture the state of those package installs on disk by fetching all Elasticsearch and Kibana assets that were installed
      4. Force re-install each package (on the same version)
      5. Again capture the state of each package and compare to the state captured in step (c), if there's any difference fail the test
        • Note that there may be some tricks to doing this comparison in a stable way to ignore any fields that were added to objects when they were installed or that represent state (such as ignore the create_time field on transforms)
Full design discussion

Types of asset changes that we need to handle

High-level options

Given the number of different ways the package install format could change, we either need to:

  1. Always reinstall a package when there may have been a code change that affects a given package (heavy handed, but likely more foolproof)
  2. Be able to detect when a code change affects a given package, solved either by:
    1. Generating some "intermediate representation" and comparing it against the last intermediate representation. This could be as simple as generating all of the expected assets and then producing a stable hash of the assets. We could then store this hash in the epm-packages object and whenever the expected hash doesn't match the last stored hash, we reinstall the package.
    2. Fetching the current state of all assets and comparing them against what they're expected to be. This would require a lot of network activity with ES and would need special work for each asset type. Handling the fields that are added by ES vs. the fields we actually care about would be challenging.

I believe we should attempt to go with option (1) and first test the performance of this strategy with ~100 packages before attempting any of the strategies under option (2).

Minimizing reinstalls for heavy-handed approach (option 1)

We need to be able to know when a package should be reinstalled to avoid reinstalling all packages on every Kibana boot which could be performance intensive and have other undesirable side-effects like rolling over data streams. Ways to track this:

  1. Add a install_format_schema_version field to epm-packages objects and then trigger a reinstall any time the current schema version > install_format_schema_version
    • Pros: very simple to implement, only triggers reinstalls when we make a change
    • Cons: possible that we forget to increment this number at times and changes are missed, depending on the change this could break new features, but could be worked around with a package reinstall
    • @joshdover recommends we take this approach, which is similar to what we're doing for Fleet's global assets today (see notes), but would extend it to all package assets.
  2. Add a installed_by_kibana_version field to the epm-packages objects and then trigger a reinstall any time the current Kibana version is > installed_by_kibana_version.
    • Pros: very simple to implement
    • Cons: triggers a reinstall on each Kibana release, even patch releases 👎

Detecting install format changes automatically

If we later find that the manual strategy of updating an install format schema version in our code is too prone to human error, we should explore detecting changes in the install format automatically, either at dev/test time or in production.

During development

We could attempt to detect these changes automatically before a PR is merged with some automated tests. This could work like:

  • Install many packages with the current install schema
  • Take an Elasticsearch snapshot (or data tarball) to be used for future tests
  • On each PR run a test that:
    • Restores ES from the last snapshot
    • Starts Kibana and waits for Fleet's setup and upgrade process to complete
    • Capture the state of those package installs on disk by fetching all assets (maybe leverage elastic-package export?)
    • Force re-install each package (on the same version)
    • Again capture the state of each package and compare to the state captured before reinstalling, if there's any difference fail the test

This would catch scenarios where the install format schema version should have been incremented in code but was not.

We likely would not be able to catch all changes with this strategy since there could be some types of changes that only affect some packages and it'd be impractical to try to test this against all packages. However, it's likely that some margin of error here would be acceptable and the vast majority of changes that require a reinstall should be caught with this strategy.

In production (option 2)

Detecting when the package install format has changed in production would be more complicated, require much more effort, and introduce more uncertainty during the upgrade process. However it could potentially catch more cases where we need to reinstall a package and didn't with the more manual approach outlined above (option 1).

I don't recommend pursuing this option unless we find this to be a persistent problem. As Fleet stabilizes, we don't intend to make ongoing changes to the install format and I suspect the test-based strategy above will find cases that we miss.

@joshdover's recommendation

In summary, I recommend we do the following:

  • Add a install_format_schema_version to the epm-packages objects
  • Add a INSTALL_FORMAT_SCHEMA_VERSION constant to the Fleet's implementation code that we manually increment when there's a schema change
  • Replace the existing logic to use the new install format field so that we reinstall any package that was not installed on the latest schema version
  • Implement the automated test outlined above for detecting cases where we forgot to increment the schema version

Notes

  • As part of this change, we should remove our setup logic that reinstalls packages to ensure the Fleet global pipeline is properly attached to package index templates. See [Fleet] Fleet reinstalls non-managed packages on Kibana boot #120363 for more info.
  • We have some existing logic for deciding when to reinstall packages when the Fleet final pipeline is upgraded:
    if (globalAssetsRes.some((asset) => asset.isCreated)) {
    // Update existing index template
    const packages = await getInstallations(soClient);
    await Promise.all(
    packages.saved_objects.map(async ({ attributes: installation }) => {
    if (installation.install_source !== 'registry') {
    logger.error(
    `Package needs to be manually reinstalled ${installation.name} after installing Fleet global assets`
    );
    return;
    }
    await installPackage({
    installSource: installation.install_source,
    savedObjectsClient: soClient,
    pkgkey: pkgToPkgKey({ name: installation.name, version: installation.version }),
    esClient,
    spaceId: DEFAULT_SPACE_ID,
    // Force install the package will update the index template and the datastream write indices
    force: true,
    }).catch((err) => {
    logger.error(
    `Package needs to be manually reinstalled ${installation.name} after installing Fleet global assets: ${err.message}`
    );
    });
    })
    );
    }
  • The logic for installing the Fleet final pipeline is smart enough to know when it needs to be updated, however the default component template is not.
@joshdover joshdover added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Dec 13, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor Author

One concern with implementing this change is the performance issues we've encountered writing templates and pipelines in bulk to Elasticsearch. See #110500 for my initial investigations into this. We could either workaround this issue by having some UI state that blocks users from upgrading or removing packages that are in this migrating state or work with Elasticsearch on elastic/elasticsearch#77505 so that we could include these migrations in our normal Fleet setup flow which will soon block Kibana startup.

@joshdover
Copy link
Contributor Author

@elastic/fleet I've completed a first draft of the design options above along with my recommendation. Would love to get feedback on this in the next week or so.

@joshdover
Copy link
Contributor Author

I've updated the issue with the summarized implementation plan based on the live discussion we had last week. This is now ready for implementation.

@ruflin
Copy link
Member

ruflin commented Jun 7, 2022

The term "reinstall" is used several times. The way I understand reinstall is that all assets (including policy integrations instances) are wiped and the package is installed again. I doubt this is the case as otherwise lots of policies would fall apart. I assume all assets except policies are added again (are assets removed before added again?). Maybe we should refer to it as refresh or something similar?

My main concern around reinstall is how often we rollover data streams. Ideally we should never have to rollover data streams for these changes but just update mappings and settings. I'm aware this is not always possible. But with Fleet become more mature I also expect the number of these changes is actually going down.

A more general note, we should have docs on what we consider just "internals" of Fleet and are not breaking change and what are. For example a change on the content of @custom templates is likely a breaking change as this is owned by the user, changes on the base templates of adding more sub templates should likely not be a breaking change as the resulting mapping of before and after is identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants