-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Automatic Import] Reproducible sampling of log entries #191598
Conversation
… into auto-import/better-errors
… into auto-import/sampling
...create_integration/create_integration_assistant/steps/data_stream_step/sample_logs_input.tsx
Outdated
Show resolved
Hide resolved
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much extra to add, just a quick question, and I do agree with the other comment here: #191598 (comment)
Else its lgtm for me :)
.../components/create_integration/create_integration_assistant/steps/data_stream_step/utils.tsx
Show resolved
Hide resolved
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Module Count
Async chunks
History
To update your PR or re-run it, just comment with: cc @ilyannn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
## Release note Automatic Import now performs reproducible sampling from the list of log entries instead of just truncating them. ## Summary When the user uploads a log sample that is too large for us to handle, we would previously simply truncate it at `MaxLogsSampleRows` entries. With this change, we perform a reproducible random sampling instead. User notification remains the same ("truncated") for now (also it's also translated into different languages). This sampling process: 1. Keeps the first entry as-is for header detection. 2. Selects at random remaining entries from the list. 3. Shuffles the entries other than the first one (even if there are less entries than `MaxLogsSampleRows`). 4. Is reproducible since the random seed is fixed. **Sampling** allows us to extract more information from the user-provided data compared to truncation, while **reproducibility** is important to be able to provide customer support. This brings us another step towards the implementation of elastic/security-team#9844 ### Risk Matrix | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Behaviour of `seedrandom` package changes in the future, breaking the tests | Low | Low | This package is also already used in Kibana | | Users misunderstand how the sampling works and upload non-anonymized data expecting that only the first rows are sent to the LLM | Low | Low | We should change the text in a future PR | --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit 444fc48)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…) (#192507) # Backport This will backport the following commits from `main` to `8.15`: - [[Automatic Import] Reproducible sampling of log entries (#191598)](#191598) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ilya Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-05T16:56:58Z","message":"[Automatic Import] Reproducible sampling of log entries (#191598)\n\n## Release note\r\n\r\nAutomatic Import now performs reproducible sampling from the list of log\r\nentries instead of just truncating them.\r\n\r\n## Summary\r\n\r\nWhen the user uploads a log sample that is too large for us to handle,\r\nwe would previously simply truncate it at `MaxLogsSampleRows` entries.\r\nWith this change, we perform a reproducible random sampling instead.\r\n\r\nUser notification remains the same (\"truncated\") for now (also it's also\r\ntranslated into different languages).\r\n \r\nThis sampling process:\r\n\r\n1. Keeps the first entry as-is for header detection.\r\n2. Selects at random remaining entries from the list.\r\n3. Shuffles the entries other than the first one (even if there are less\r\nentries than `MaxLogsSampleRows`).\r\n4. Is reproducible since the random seed is fixed.\r\n\r\n**Sampling** allows us to extract more information from the\r\nuser-provided data compared to truncation, while **reproducibility** is\r\nimportant to be able to provide customer support.\r\n\r\nThis brings us another step towards the implementation of\r\nhttps://github.com/elastic/security-team/issues/9844\r\n\r\n\r\n### Risk Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes |\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n| Behaviour of `seedrandom` package changes in the future, breaking the\r\ntests | Low | Low | This package is also already used in Kibana |\r\n| Users misunderstand how the sampling works and upload non-anonymized\r\ndata expecting that only the first rows are sent to the LLM | Low | Low\r\n| We should change the text in a future PR |\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"444fc48d8780e0e6eee3d7e3b43c6cae1861fe93","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","backport:prev-minor","v8.16.0","Team:Security-Scalability"],"title":"[Automatic Import] Reproducible sampling of log entries","number":191598,"url":"https://github.com/elastic/kibana/pull/191598","mergeCommit":{"message":"[Automatic Import] Reproducible sampling of log entries (#191598)\n\n## Release note\r\n\r\nAutomatic Import now performs reproducible sampling from the list of log\r\nentries instead of just truncating them.\r\n\r\n## Summary\r\n\r\nWhen the user uploads a log sample that is too large for us to handle,\r\nwe would previously simply truncate it at `MaxLogsSampleRows` entries.\r\nWith this change, we perform a reproducible random sampling instead.\r\n\r\nUser notification remains the same (\"truncated\") for now (also it's also\r\ntranslated into different languages).\r\n \r\nThis sampling process:\r\n\r\n1. Keeps the first entry as-is for header detection.\r\n2. Selects at random remaining entries from the list.\r\n3. Shuffles the entries other than the first one (even if there are less\r\nentries than `MaxLogsSampleRows`).\r\n4. Is reproducible since the random seed is fixed.\r\n\r\n**Sampling** allows us to extract more information from the\r\nuser-provided data compared to truncation, while **reproducibility** is\r\nimportant to be able to provide customer support.\r\n\r\nThis brings us another step towards the implementation of\r\nhttps://github.com/elastic/security-team/issues/9844\r\n\r\n\r\n### Risk Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes |\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n| Behaviour of `seedrandom` package changes in the future, breaking the\r\ntests | Low | Low | This package is also already used in Kibana |\r\n| Users misunderstand how the sampling works and upload non-anonymized\r\ndata expecting that only the first rows are sent to the LLM | Low | Low\r\n| We should change the text in a future PR |\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"444fc48d8780e0e6eee3d7e3b43c6cae1861fe93"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/191598","number":191598,"mergeCommit":{"message":"[Automatic Import] Reproducible sampling of log entries (#191598)\n\n## Release note\r\n\r\nAutomatic Import now performs reproducible sampling from the list of log\r\nentries instead of just truncating them.\r\n\r\n## Summary\r\n\r\nWhen the user uploads a log sample that is too large for us to handle,\r\nwe would previously simply truncate it at `MaxLogsSampleRows` entries.\r\nWith this change, we perform a reproducible random sampling instead.\r\n\r\nUser notification remains the same (\"truncated\") for now (also it's also\r\ntranslated into different languages).\r\n \r\nThis sampling process:\r\n\r\n1. Keeps the first entry as-is for header detection.\r\n2. Selects at random remaining entries from the list.\r\n3. Shuffles the entries other than the first one (even if there are less\r\nentries than `MaxLogsSampleRows`).\r\n4. Is reproducible since the random seed is fixed.\r\n\r\n**Sampling** allows us to extract more information from the\r\nuser-provided data compared to truncation, while **reproducibility** is\r\nimportant to be able to provide customer support.\r\n\r\nThis brings us another step towards the implementation of\r\nhttps://github.com/elastic/security-team/issues/9844\r\n\r\n\r\n### Risk Matrix\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes |\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n| Behaviour of `seedrandom` package changes in the future, breaking the\r\ntests | Low | Low | This package is also already used in Kibana |\r\n| Users misunderstand how the sampling works and upload non-anonymized\r\ndata expecting that only the first rows are sent to the LLM | Low | Low\r\n| We should change the text in a future PR |\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"444fc48d8780e0e6eee3d7e3b43c6cae1861fe93"}}]}] BACKPORT--> Co-authored-by: Ilya Nikokoshev <[email protected]>
## Release Notes Automatic Import now analyses larger number of samples to generate an integration. ## Summary Closes elastic/security-team#9844 **Added: Backend Sampling** We pass 100 rows (these numeric values are adjustable) to the backend [^1] [^1]: As before, deterministically selected on the frontend, see #191598 The Categorization chain now processes the samples in batches, performing after initial categorization a number of review cycles (but not more than 5, tuned so that we stay under the 2 minute limit for a single API call). To decide when to stop processing we keep the list of _stable_ samples as follows: 1. The list is initially empty. 2. For each review we select a random subset of 40 samples, preferring to pick up the not-stable samples. 3. After each review – when the LLM potentially gives us new or changes the old processors – we compare the new pipeline results with the old pipeline results. 4. Those reviewed samples that did not change their categorization are added to the stable list. 5. Any samples that have changed their categorization are removed from the stable list. 6. If all samples are stable, we finish processing. **Removed: User Notification** Using 100 samples provides a balance between expected complexity and time budget we work with. We might want to change it in the future, possibly dynamically, making the specific number of no importance to the user. Thus we remove the truncation notification. **Unchanged:** - No batching is made in the related chain: it seems to work as-is. **Refactored:** - We centralize the sizing constants in the `x-pack/plugins/integration_assistant/common/constants.ts` file. - We remove the unused state key `formattedSamples` and combine `modelJSONInput` back into `modelInput`. > [!NOTE] > I had difficulty generating new graph diagrams, so they remain unchanged.
…6233) ## Release Notes Automatic Import now analyses larger number of samples to generate an integration. ## Summary Closes elastic/security-team#9844 **Added: Backend Sampling** We pass 100 rows (these numeric values are adjustable) to the backend [^1] [^1]: As before, deterministically selected on the frontend, see elastic#191598 The Categorization chain now processes the samples in batches, performing after initial categorization a number of review cycles (but not more than 5, tuned so that we stay under the 2 minute limit for a single API call). To decide when to stop processing we keep the list of _stable_ samples as follows: 1. The list is initially empty. 2. For each review we select a random subset of 40 samples, preferring to pick up the not-stable samples. 3. After each review – when the LLM potentially gives us new or changes the old processors – we compare the new pipeline results with the old pipeline results. 4. Those reviewed samples that did not change their categorization are added to the stable list. 5. Any samples that have changed their categorization are removed from the stable list. 6. If all samples are stable, we finish processing. **Removed: User Notification** Using 100 samples provides a balance between expected complexity and time budget we work with. We might want to change it in the future, possibly dynamically, making the specific number of no importance to the user. Thus we remove the truncation notification. **Unchanged:** - No batching is made in the related chain: it seems to work as-is. **Refactored:** - We centralize the sizing constants in the `x-pack/plugins/integration_assistant/common/constants.ts` file. - We remove the unused state key `formattedSamples` and combine `modelJSONInput` back into `modelInput`. > [!NOTE] > I had difficulty generating new graph diagrams, so they remain unchanged. (cherry picked from commit fc3ce54)
) (#196386) # Backport This will backport the following commits from `main` to `8.x`: - [[Auto Import] Use larger number of samples on the backend (#196233)](#196233) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ilya Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T16:22:05Z","message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","enhancement","v9.0.0","backport:prev-minor","8.16 candidate","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto Import] Use larger number of samples on the backend","number":196233,"url":"https://github.com/elastic/kibana/pull/196233","mergeCommit":{"message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196233","number":196233,"mergeCommit":{"message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}}]}] BACKPORT--> Co-authored-by: Ilya Nikokoshev <[email protected]>
Release note
Automatic Import now performs reproducible sampling from the list of log entries instead of just truncating them.
Summary
When the user uploads a log sample that is too large for us to handle, we would previously simply truncate it at
MaxLogsSampleRows
entries. With this change, we perform a reproducible random sampling instead.User notification remains the same ("truncated") for now (also it's also translated into different languages).
This sampling process:
MaxLogsSampleRows
).Sampling allows us to extract more information from the user-provided data compared to truncation, while reproducibility is important to be able to provide customer support.
For example, when uploading this 10 MB sample file of repeating Falcon events, the combined JSON should always be
This brings us another step towards the implementation of https://github.com/elastic/security-team/issues/9844
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
seedrandom
package changes in the future, breaking the tests