Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Import] Fix cases where LLM generates incorrect array field access #196207

Merged
merged 7 commits into from
Oct 15, 2024

Conversation

ilyannn
Copy link
Contributor

@ilyannn ilyannn commented Oct 14, 2024

Release Note

Fixes cases where LLM was likely to generate invalid processors containing array access in Automatic Import.

Context

Previously, it happened from time to time that the LLM attempts to add related fields or apply categorization conditions that use a field, path to which goes through an array. Here's an example output of the review step in the related chain:

output:
  - field: related.ip
    value_field: source.ip
  - field: related.user
    value_field: user.id
  - field: related.user
    value_field: user.email
  - field: related.hosts
    value_field: host.hostname
  - field: related.user
    value_field: ai_falcon_202410141910.audit.event.AuditKeyValues.[?Key=='assigned_to_uid'].ValueString | [0] 
(check the original event)
{
    "metadata": {
        "customerIDString": "8f69fe9e-b995-4204-95ad-44f9bcf75b6b",
        "offset": 10,
        "eventType": "UserActivityAuditEvent",
        "eventCreationTime": 1581603262000,
        "version": "1.0"
    },
    "event": {
        "UserId": "[email protected]",
        "UserIp": "192.168.6.8",
        "OperationName": "detection_update",
        "ServiceName": "detections",
        "AuditKeyValues": [
            {
                "Key": "detection_id",
                "ValueString": "ldt:5a6fd0b7347440cd74cb84855a8aee18:17180539745"
            },
            {
                "Key": "new_state",
                "ValueString": "in_progress"
            },
            {
                "Key": "assigned_to",
                "ValueString": "First Last"
            },
            {
                "Key": "assigned_to_uid",
                "ValueString": "[email protected]"
            }
        ],
        "UTCTimestamp": 1581603262
    }
}
(check the event as seen by the LLM)
 {
    "@timestamp": "2020-02-13T14:14:22.000Z",
    "ecs": {
      "version": "8.11.0"
    },
    "related": {
      "user": [
        "[email protected]"
      ],
      "ip": [
        "192.168.6.8"
      ]
    },
    "organization": {
      "id": "8f69fe9e-b995-4204-95ad-44f9bcf75b6b"
    },
    "ai_falcon_202410141910": {
      "audit": {
        "event": {
          "UTCTimestamp": 1581603262,
          "ServiceName": "detections",
          "AuditKeyValues": [
            {
              "ValueString": "ldt:5a6fd0b7347440cd74cb84855a8aee18:17180539745",
              "Key": "detection_id"
            },
            {
              "ValueString": "in_progress",
              "Key": "new_state"
            },
            {
              "ValueString": "First Last",
              "Key": "assigned_to"
            },
            {
              "ValueString": "[email protected]",
              "Key": "assigned_to_uid"
            }
          ]
        },
        "metadata": {
          "eventType": "UserActivityAuditEvent",
          "offset": 10,
          "version": "1.0",
          "eventCreationTime": 1581603262000
        }
      }
    },
    "source": {
      "ip": "192.168.6.8"
    },
    "event": {
      "action": "detection_update",
      "category": [
        "intrusion_detection"
      ],
      "type": [
        "info"
      ]
    },
    "user": {
      "id": "[email protected]"
    },
    "tags": [
      "_geoip_database_unavailable_GeoLite2-City.mmdb",
      "_geoip_database_unavailable_GeoLite2-ASN.mmdb",
      "_geoip_database_unavailable_GeoLite2-City.mmdb",
      "_geoip_database_unavailable_GeoLite2-ASN.mmdb"
    ]
  },

The problem is that such an access is invalid and leads to an immediate error (key part highlighted):

SCR-20241014-snfl

Even including explicit instructions to avoid brackets or an array access did not seem enough, as the LLM would try to use a different syntax, owing to the aggressiveness of our review instructions:

output:
  - field: related.ip
    value_field: source.ip
  - field: related.user
    value_field: user.id
  - field: related.user
    value_field: user.email
  - field: related.hosts
    value_field: host.hostname
  - field: related.user
    value_field: ai_falcon_202410141910.audit.event.AuditKeyValues.ValueString 

The suggested solution is to remove all arrays from the information shown to the LLM in the related chain. This guarantees that no illegal access will ever be attempted.

Summary

  • Introduces a utility function to remove all arrays from a JSON object.
  • Applies this function for all LLM calls in the related chain.
  • Modifies the prompts of related and categorization chain to skip the arrays as well.

Testing

(check the event as seen by the LLM now)
{
    "@timestamp": "2020-02-13T14:14:22.000Z",
    "ecs": {
      "version": "8.11.0"
    },
    "related": {},
    "ai_falcon_202410141910": {
      "audit": {
        "event": {
          "UTCTimestamp": 1581603262,
          "OperationName": "detection_update",
          "ServiceName": "detections"
        },
        "metadata": {
          "customerIDString": "8f69fe9e-b995-4204-95ad-44f9bcf75b6b",
          "offset": 10,
          "version": "1.0",
          "eventCreationTime": 1581603262000
        }
      }
    },
    "source": {
      "ip": "192.168.6.8"
    },
    "event": {
      "action": "UserActivityAuditEvent"
    },
    "user": {
      "id": "[email protected]"
    }
  },

For maintainers

  • This will appear in the Release Notes and follow the guidelines

@ilyannn ilyannn added release_note:fix backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Team:Security-Scalability Team label for Security Integrations Scalability Team Feature:AutomaticImport bug Fixes for quality problems that affect the customer experience labels Oct 14, 2024
@ilyannn ilyannn changed the title [Auto Import] Reduce the cases where LLM generates array field access [Auto Import] Fix cases where LLM generates array field access Oct 14, 2024
@ilyannn ilyannn changed the title [Auto Import] Fix cases where LLM generates array field access [Auto Import] Fix cases where LLM generates incorrect array field access Oct 14, 2024
@ilyannn ilyannn marked this pull request as ready for review October 14, 2024 23:51
@ilyannn ilyannn requested a review from a team as a code owner October 14, 2024 23:51
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-scalability (Team:Security-Scalability)

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the custom fields with an array of ip or hosts or users getting into related fields.
We will never be able to put them into the related fields , right?
For example:

{
some_user: [ 'john' , 'smith' ],
some_host: [ 'machine1' , 'machine2']
}

@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 15, 2024

How are the custom fields with an array of ip or hosts or users getting into related fields. We will never be able to put them into the related fields , right? For example:

{
some_user: [ 'john' , 'smith' ],
some_host: [ 'machine1' , 'machine2']
}

I assume they get there with some custom Painless code, which we can also add in the future. I actually don't know – maybe it works in this example; we'll need to check.

But if there is a dictionary inside some_user then I don't think it's simple.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can merge this now. Will need to work on the fields inside arrays for related graph in a later PR.

@ilyannn ilyannn merged commit 8abe259 into elastic:main Oct 15, 2024
24 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11348150567

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 15, 2024
…ess (elastic#196207)

## Release Note

Fixes cases where LLM was likely to generate invalid processors
containing array access in Automatic Import.

## Context

Previously, it happened from time to time that the LLM attempts to add
related fields or apply categorization conditions that use a field, path
to which goes through an array.

The problem is that such an access is invalid and leads to an immediate
error (key part highlighted):

Even including explicit instructions to avoid brackets or an array
access did not seem enough, as the LLM would try to use a different
syntax, owing to the aggressiveness of our review instructions.

The suggested solution is to remove all arrays from the information
shown to the LLM in the related chain. This guarantees that no illegal
access will ever be attempted.

### Summary

- Introduces a utility function to remove all arrays from a JSON object.
- Applies this function for all LLM calls in the related chain.
- Modifies the prompts of related and categorization chain to skip the
arrays as well.

---------

Co-authored-by: Bharat Pasupula <[email protected]>
(cherry picked from commit 8abe259)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 15, 2024
…ld access (#196207) (#196329)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Auto Import] Fix cases where LLM generates incorrect array field
access (#196207)](#196207)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ilya
Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T14:24:41Z","message":"[Auto
Import] Fix cases where LLM generates incorrect array field access
(#196207)\n\n## Release Note\r\n\r\nFixes cases where LLM was likely to
generate invalid processors\r\ncontaining array access in Automatic
Import.\r\n\r\n## Context\r\n\r\nPreviously, it happened from time to
time that the LLM attempts to add\r\nrelated fields or apply
categorization conditions that use a field, path\r\nto which goes
through an array. \r\n\r\nThe problem is that such an access is invalid
and leads to an immediate\r\nerror (key part highlighted):\r\n\r\nEven
including explicit instructions to avoid brackets or an array\r\naccess
did not seem enough, as the LLM would try to use a different\r\nsyntax,
owing to the aggressiveness of our review instructions.\r\n\r\nThe
suggested solution is to remove all arrays from the information\r\nshown
to the LLM in the related chain. This guarantees that no
illegal\r\naccess will ever be attempted.\r\n\r\n### Summary\r\n\r\n-
Introduces a utility function to remove all arrays from a JSON
object.\r\n- Applies this function for all LLM calls in the related
chain.\r\n- Modifies the prompts of related and categorization chain to
skip the\r\narrays as well.\r\n\r\n---------\r\n\r\nCo-authored-by:
Bharat Pasupula
<[email protected]>","sha":"8abe25970aa1b483676dde17b7972359c8c55475","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","v9.0.0","backport:prev-minor","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto
Import] Fix cases where LLM generates incorrect array field
access","number":196207,"url":"https://github.com/elastic/kibana/pull/196207","mergeCommit":{"message":"[Auto
Import] Fix cases where LLM generates incorrect array field access
(#196207)\n\n## Release Note\r\n\r\nFixes cases where LLM was likely to
generate invalid processors\r\ncontaining array access in Automatic
Import.\r\n\r\n## Context\r\n\r\nPreviously, it happened from time to
time that the LLM attempts to add\r\nrelated fields or apply
categorization conditions that use a field, path\r\nto which goes
through an array. \r\n\r\nThe problem is that such an access is invalid
and leads to an immediate\r\nerror (key part highlighted):\r\n\r\nEven
including explicit instructions to avoid brackets or an array\r\naccess
did not seem enough, as the LLM would try to use a different\r\nsyntax,
owing to the aggressiveness of our review instructions.\r\n\r\nThe
suggested solution is to remove all arrays from the information\r\nshown
to the LLM in the related chain. This guarantees that no
illegal\r\naccess will ever be attempted.\r\n\r\n### Summary\r\n\r\n-
Introduces a utility function to remove all arrays from a JSON
object.\r\n- Applies this function for all LLM calls in the related
chain.\r\n- Modifies the prompts of related and categorization chain to
skip the\r\narrays as well.\r\n\r\n---------\r\n\r\nCo-authored-by:
Bharat Pasupula
<[email protected]>","sha":"8abe25970aa1b483676dde17b7972359c8c55475"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196207","number":196207,"mergeCommit":{"message":"[Auto
Import] Fix cases where LLM generates incorrect array field access
(#196207)\n\n## Release Note\r\n\r\nFixes cases where LLM was likely to
generate invalid processors\r\ncontaining array access in Automatic
Import.\r\n\r\n## Context\r\n\r\nPreviously, it happened from time to
time that the LLM attempts to add\r\nrelated fields or apply
categorization conditions that use a field, path\r\nto which goes
through an array. \r\n\r\nThe problem is that such an access is invalid
and leads to an immediate\r\nerror (key part highlighted):\r\n\r\nEven
including explicit instructions to avoid brackets or an array\r\naccess
did not seem enough, as the LLM would try to use a different\r\nsyntax,
owing to the aggressiveness of our review instructions.\r\n\r\nThe
suggested solution is to remove all arrays from the information\r\nshown
to the LLM in the related chain. This guarantees that no
illegal\r\naccess will ever be attempted.\r\n\r\n### Summary\r\n\r\n-
Introduces a utility function to remove all arrays from a JSON
object.\r\n- Applies this function for all LLM calls in the related
chain.\r\n- Modifies the prompts of related and categorization chain to
skip the\r\narrays as well.\r\n\r\n---------\r\n\r\nCo-authored-by:
Bharat Pasupula
<[email protected]>","sha":"8abe25970aa1b483676dde17b7972359c8c55475"}}]}]
BACKPORT-->

Co-authored-by: Ilya Nikokoshev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) bug Fixes for quality problems that affect the customer experience Feature:AutomaticImport release_note:fix Team:Security-Scalability Team label for Security Integrations Scalability Team v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants