Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.15] [Automatic Import] Reproducible sampling of log entries (#191598) #192507

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.15:

Questions ?

Please refer to the Backport tool documentation

## Release note

Automatic Import now performs reproducible sampling from the list of log
entries instead of just truncating them.

## Summary

When the user uploads a log sample that is too large for us to handle,
we would previously simply truncate it at `MaxLogsSampleRows` entries.
With this change, we perform a reproducible random sampling instead.

User notification remains the same ("truncated") for now (also it's also
translated into different languages).

This sampling process:

1. Keeps the first entry as-is for header detection.
2. Selects at random remaining entries from the list.
3. Shuffles the entries other than the first one (even if there are less
entries than `MaxLogsSampleRows`).
4. Is reproducible since the random seed is fixed.

**Sampling** allows us to extract more information from the
user-provided data compared to truncation, while **reproducibility** is
important to be able to provide customer support.

This brings us another step towards the implementation of
elastic/security-team#9844

### Risk Matrix

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Behaviour of `seedrandom` package changes in the future, breaking the
tests | Low | Low | This package is also already used in Kibana |
| Users misunderstand how the sampling works and upload non-anonymized
data expecting that only the first rows are sent to the LLM | Low | Low
| We should change the text in a future PR |

---------

Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit 444fc48)
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
integrationAssistant 547 556 +9

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
integrationAssistant 939.5KB 947.3KB +7.8KB

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ilyannn

@kibanamachine kibanamachine merged commit 3671c5d into elastic:8.15 Sep 10, 2024
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants