Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Import] Use larger number of samples on the backend #196233

Merged
merged 20 commits into from
Oct 15, 2024

Conversation

ilyannn
Copy link
Contributor

@ilyannn ilyannn commented Oct 15, 2024

Release Notes

Automatic Import now analyses larger number of samples to generate an integration.

Summary

Closes https://github.com/elastic/security-team/issues/9844

Important

This PR also contains functionality of #196228 and #196207; they should be merged before this one.

Added: Backend Sampling

We pass 100 rows (these numeric values are adjustable) to the backend 1

The Categorization chain now processes the samples in batches, performing after initial categorization a number of review cycles (but not more than 5, tuned so that we stay under the 2 minute limit for a single API call).

To decide when to stop processing we keep the list of stable samples as follows:

  1. The list is initially empty.
  2. For each review we select a random subset of 40 samples, preferring to pick up the not-stable samples.
  3. After each review – when the LLM potentially gives us new or changes the old processors – we compare the new pipeline results with the old pipeline results.
  4. Those reviewed samples that did not change their categorization are added to the stable list.
  5. Any samples that have changed their categorization are removed from the stable list.
  6. If all samples are stable, we finish processing.

Removed: User Notification

Using 100 samples provides a balance between expected complexity and time budget we work with. We might want to change it in the future, possibly dynamically, making the specific number of no importance to the user. Thus we remove the truncation notification.

Unchanged:

  • No batching is made in the related chain: it seems to work as-is.

Refactored:

  • We centralize the sizing constants in the x-pack/plugins/integration_assistant/common/constants.ts file.
  • We remove the unused state key formattedSamples and combine modelJSONInput back into modelInput.

Note

I had difficulty generating new graph diagrams, so they remain unchanged.

Testing

Postgres

25 samples, 1 review cycle, 50s for categorization: ai_postgres_202410150832-1.0.0.zip

(generated ingest pipeline) ```yaml --- description: Pipeline to process ai_postgres_202410150832 audit logs processors: - set: tag: set_ecs_version field: ecs.version value: 8.11.0 - set: tag: copy_original_message field: originalMessage copy_from: message - csv: tag: parse_csv field: message target_fields: - ai_postgres_202410150832.audit.timestamp - ai_postgres_202410150832.audit.database - ai_postgres_202410150832.audit.user - ai_postgres_202410150832.audit.process_id - ai_postgres_202410150832.audit.client_address - ai_postgres_202410150832.audit.session_id - ai_postgres_202410150832.audit.line_num - ai_postgres_202410150832.audit.command_tag - ai_postgres_202410150832.audit.session_start_time - ai_postgres_202410150832.audit.virtual_transaction_id - ai_postgres_202410150832.audit.transaction_id - ai_postgres_202410150832.audit.error_severity - ai_postgres_202410150832.audit.sql_state_code - ai_postgres_202410150832.audit.message - ai_postgres_202410150832.audit.column15 - ai_postgres_202410150832.audit.column16 - ai_postgres_202410150832.audit.column17 - ai_postgres_202410150832.audit.column18 - ai_postgres_202410150832.audit.column19 - ai_postgres_202410150832.audit.column20 - ai_postgres_202410150832.audit.column21 - ai_postgres_202410150832.audit.application_name - ai_postgres_202410150832.audit.backend_type - ai_postgres_202410150832.audit.column24 description: Parse CSV input - rename: ignore_missing: true if: ctx.event?.original == null tag: rename_message field: originalMessage target_field: event.original - remove: ignore_missing: true if: ctx.event?.original != null tag: remove_copied_message field: originalMessage - remove: ignore_missing: true tag: remove_message field: message - rename: ignore_missing: true field: ai_postgres_202410150832.audit.transaction_id target_field: transaction.id - convert: ignore_failure: true ignore_missing: true field: ai_postgres_202410150832.audit.process_id target_field: process.pid type: long - rename: ignore_missing: true field: ai_postgres_202410150832.audit.error_severity target_field: log.level - script: tag: script_convert_array_to_string description: Ensures the date processor does not receive an array value. lang: painless source: | if (ctx.ai_postgres_202410150832?.audit?.session_start_time != null && ctx.ai_postgres_202410150832.audit.session_start_time instanceof ArrayList){ ctx.ai_postgres_202410150832.audit.session_start_time = ctx.ai_postgres_202410150832.audit.session_start_time[0]; } - date: if: ctx.ai_postgres_202410150832?.audit?.session_start_time != null tag: date_processor_ai_postgres_202410150832.audit.session_start_time field: ai_postgres_202410150832.audit.session_start_time target_field: event.start formats: - yyyy-MM-dd HH:mm:ss z - rename: ignore_missing: true field: ai_postgres_202410150832.audit.message target_field: message - script: tag: script_convert_array_to_string description: Ensures the date processor does not receive an array value. lang: painless source: | if (ctx.ai_postgres_202410150832?.audit?.timestamp != null && ctx.ai_postgres_202410150832.audit.timestamp instanceof ArrayList){ ctx.ai_postgres_202410150832.audit.timestamp = ctx.ai_postgres_202410150832.audit.timestamp[0]; } - date: if: ctx.ai_postgres_202410150832?.audit?.timestamp != null tag: date_processor_ai_postgres_202410150832.audit.timestamp field: ai_postgres_202410150832.audit.timestamp target_field: '@timestamp' formats: - yyyy-MM-dd HH:mm:ss.SSS z - rename: ignore_missing: true field: ai_postgres_202410150832.audit.database target_field: destination.domain - rename: ignore_missing: true field: ai_postgres_202410150832.audit.client_address target_field: source.address - rename: ignore_missing: true field: ai_postgres_202410150832.audit.user target_field: user.name - script: tag: script_drop_null_empty_values description: Drops null/empty values recursively. lang: painless source: | boolean dropEmptyFields(Object object) { if (object == null || object == "") { return true; } else if (object instanceof Map) { ((Map) object).values().removeIf(value -> dropEmptyFields(value)); return (((Map) object).size() == 0); } else if (object instanceof List) { ((List) object).removeIf(value -> dropEmptyFields(value)); return (((List) object).length == 0); } return false; } dropEmptyFields(ctx); - geoip: ignore_missing: true tag: geoip_source_ip field: source.ip target_field: source.geo - geoip: ignore_missing: true tag: geoip_source_asn database_file: GeoLite2-ASN.mmdb field: source.ip target_field: source.as properties: - asn - organization_name - rename: ignore_missing: true tag: rename_source_as_asn field: source.as.asn target_field: source.as.number - rename: ignore_missing: true tag: rename_source_as_organization_name field: source.as.organization_name target_field: source.as.organization.name - geoip: ignore_missing: true tag: geoip_destination_ip field: destination.ip target_field: destination.geo - geoip: ignore_missing: true tag: geoip_destination_asn database_file: GeoLite2-ASN.mmdb field: destination.ip target_field: destination.as properties: - asn - organization_name - rename: ignore_missing: true tag: rename_destination_as_asn field: destination.as.asn target_field: destination.as.number - rename: ignore_missing: true tag: rename_destination_as_organization_name field: destination.as.organization_name target_field: destination.as.organization.name - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'checkpointer' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'checkpointer' field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'client backend' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'client backend' field: event.type value: - access allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'postmaster' field: event.category value: - database allow_duplicates: false - append: if: ctx.message?.contains('starting PostgreSQL') field: event.type value: - info allow_duplicates: false - append: if: >- ctx.ai_postgres_202410150832?.audit?.column24 == 'postmaster' && !ctx.message?.contains('starting PostgreSQL') field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'startup' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'startup' field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.command_tag == 'authentication' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.command_tag == 'authentication' field: event.type value: - info allow_duplicates: false - append: if: ctx.message?.contains('connection received:') field: event.category value: - database allow_duplicates: false - append: if: ctx.message?.contains('connection received:') field: event.type value: - info allow_duplicates: false - append: if: ctx.message?.contains('disconnection:') field: event.category value: - database allow_duplicates: false - append: if: ctx.message?.contains('disconnection:') field: event.type value: - info allow_duplicates: false - append: if: >- ctx.message?.contains('parameter') && ctx.message?.contains('changed to') field: event.category value: - configuration allow_duplicates: false - append: if: >- ctx.message?.contains('parameter') && ctx.message?.contains('changed to') field: event.type value: - change allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'not initialized' field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.command_tag == 'authentication' field: event.type value: - access - info allow_duplicates: false - append: if: ctx.source?.address != null field: related.ip value: '{{{source.address}}}' allow_duplicates: false - append: if: ctx.user?.name != null field: related.user value: '{{{user.name}}}' allow_duplicates: false - append: if: ctx.destination?.domain != null field: related.hosts value: '{{{destination.domain}}}' allow_duplicates: false - remove: ignore_missing: true tag: remove_fields field: - ai_postgres_202410150832.audit.process_id - remove: ignore_failure: true ignore_missing: true if: ctx?.tags == null || !(ctx.tags.contains("preserve_original_event")) tag: remove_original_event field: event.original on_failure: - append: field: error.message value: >- Processor {{{_ingest.on_failure_processor_type}}} with tag {{{_ingest.on_failure_processor_tag}}} in pipeline {{{_ingest.on_failure_pipeline}}} failed with message: {{{_ingest.on_failure_message}}} - set: field: event.kind value: pipeline_error ```
(example event)
        {
            "@timestamp": "2021-01-04T01:06:54.227Z",
            "ai_postgres_202410150832": {
                "audit": {
                    "backend_type": "psql",
                    "column24": "client backend",
                    "command_tag": "idle",
                    "line_num": "5",
                    "session_id": "5ff26a0c.56",
                    "session_start_time": "2021-01-04 01:06:20 UTC",
                    "sql_state_code": "00000",
                    "timestamp": "2021-01-04 01:06:54.227 UTC",
                    "virtual_transaction_id": "3/4"
                }
            },
            "destination": {
                "domain": "postgres"
            },
            "ecs": {
                "version": "8.11.0"
            },
            "event": {
                "category": [
                    "database"
                ],
                "original": "2021-01-04 01:06:54.227 UTC,\"postgres\",\"postgres\",86,\"172.24.0.1:45126\",5ff26a0c.56,5,\"idle\",2021-01-04 01:06:20 UTC,3/4,0,LOG,00000,\"statement: SELECT name FROM  (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings   WHERE context != 'internal'   UNION ALL SELECT 'all') ss  WHERE substring(name,1,7)='log_min' LIMIT 1000\",,,,,,,,,\"psql\",\"client backend\"",
                "start": "2021-01-04T01:06:20.000Z",
                "type": [
                    "access"
                ]
            },
            "log": {
                "level": "LOG"
            },
            "message": "statement: SELECT name FROM  (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings   WHERE context != 'internal'   UNION ALL SELECT 'all') ss  WHERE substring(name,1,7)='log_min' LIMIT 1000",
            "process": {
                "pid": 86
            },
            "related": {
                "hosts": [
                    "postgres"
                ],
                "ip": [
                    "172.24.0.1:45126"
                ],
                "user": [
                    "postgres"
                ]
            },
            "source": {
                "address": "172.24.0.1:45126"
            },
            "tags": [
                "preserve_original_event"
            ],
            "transaction": {
                "id": "0"
            },
            "user": {
                "name": "postgres"
            }
        }

Teleport Audit Events

28 samples, 1 review cycle, 50s for categorization: ai_teleport_202410150835-1.0.0.zip

(generated ingest pipeline)
---
description: Pipeline to process ai_teleport_202410150835 audit logs
processors:
  - set:
      tag: set_ecs_version
      field: ecs.version
      value: 8.11.0
  - set:
      tag: copy_original_message
      field: originalMessage
      copy_from: message
  - rename:
      ignore_missing: true
      if: ctx.event?.original == null
      tag: rename_message
      field: originalMessage
      target_field: event.original
  - remove:
      ignore_missing: true
      if: ctx.event?.original != null
      tag: remove_copied_message
      field: originalMessage
  - remove:
      ignore_missing: true
      tag: remove_message
      field: message
  - json:
      tag: json_original
      field: event.original
      target_field: ai_teleport_202410150835.audit
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.event
      target_field: event.action
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.uid
      target_field: event.id
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.code
      target_field: event.code
  - script:
      tag: script_convert_array_to_string
      description: Ensures the date processor does not receive an array value.
      lang: painless
      source: |
        if (ctx.ai_teleport_202410150835?.audit?.time != null &&
            ctx.ai_teleport_202410150835.audit.time instanceof ArrayList){
            ctx.ai_teleport_202410150835.audit.time = ctx.ai_teleport_202410150835.audit.time[0];
        }
  - date:
      if: ctx.ai_teleport_202410150835?.audit?.time != null
      tag: date_processor_ai_teleport_202410150835.audit.time
      field: ai_teleport_202410150835.audit.time
      target_field: event.start
      formats:
        - yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
        - ISO8601
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.user
      target_field: user.name
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.success
      target_field: event.outcome
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.user_agent
      target_field: user_agent.original
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.addr.remote
      target_field: source.ip
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.identity.client_ip
      target_field: source.ip
  - script:
      tag: script_convert_array_to_string
      description: Ensures the date processor does not receive an array value.
      lang: painless
      source: >
        if (ctx.ai_teleport_202410150835?.audit?.identity?.prev_identity_expires
        != null &&
            ctx.ai_teleport_202410150835.audit.identity.prev_identity_expires instanceof ArrayList){
            ctx.ai_teleport_202410150835.audit.identity.prev_identity_expires = ctx.ai_teleport_202410150835.audit.identity.prev_identity_expires[0];
        }
  - date:
      if: >-
        ctx.ai_teleport_202410150835?.audit?.identity?.prev_identity_expires !=
        null
      tag: >-
        date_processor_ai_teleport_202410150835.audit.identity.prev_identity_expires
      field: ai_teleport_202410150835.audit.identity.prev_identity_expires
      target_field: event.end
      formats:
        - yyyy-MM-dd'T'HH:mm:ss'Z'
        - ISO8601
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.identity.route_to_database.protocol
      target_field: network.protocol
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.identity.route_to_database.username
      target_field: destination.user.name
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.db_protocol
      target_field: network.protocol
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.db_uri
      target_field: url.full
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.server_hostname
      target_field: destination.domain
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.server_addr
      target_field: destination.address
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.proto
      target_field: network.protocol
  - script:
      tag: script_convert_array_to_string
      description: Ensures the date processor does not receive an array value.
      lang: painless
      source: |
        if (ctx.ai_teleport_202410150835?.audit?.session_stop != null &&
            ctx.ai_teleport_202410150835.audit.session_stop instanceof ArrayList){
            ctx.ai_teleport_202410150835.audit.session_stop = ctx.ai_teleport_202410150835.audit.session_stop[0];
        }
  - date:
      if: ctx.ai_teleport_202410150835?.audit?.session_stop != null
      tag: date_processor_ai_teleport_202410150835.audit.session_stop
      field: ai_teleport_202410150835.audit.session_stop
      target_field: event.end
      formats:
        - yyyy-MM-dd'T'HH:mm:ss.SSSSSSSSS'Z'
        - ISO8601
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.addr.local
      target_field: source.address
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.tx
      target_field: network.bytes
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.url
      target_field: url.original
  - rename:
      ignore_missing: true
      field: ai_teleport_202410150835.audit.connector
      target_field: event.provider
  - script:
      tag: script_drop_null_empty_values
      description: Drops null/empty values recursively.
      lang: painless
      source: |
        boolean dropEmptyFields(Object object) {
          if (object == null || object == "") {
            return true;
          } else if (object instanceof Map) {
            ((Map) object).values().removeIf(value -> dropEmptyFields(value));
            return (((Map) object).size() == 0);
          } else if (object instanceof List) {
            ((List) object).removeIf(value -> dropEmptyFields(value));
            return (((List) object).length == 0);
          }
          return false;
        }
        dropEmptyFields(ctx);
  - geoip:
      ignore_missing: true
      tag: geoip_source_ip
      field: source.ip
      target_field: source.geo
  - geoip:
      ignore_missing: true
      tag: geoip_source_asn
      database_file: GeoLite2-ASN.mmdb
      field: source.ip
      target_field: source.as
      properties:
        - asn
        - organization_name
  - rename:
      ignore_missing: true
      tag: rename_source_as_asn
      field: source.as.asn
      target_field: source.as.number
  - rename:
      ignore_missing: true
      tag: rename_source_as_organization_name
      field: source.as.organization_name
      target_field: source.as.organization.name
  - geoip:
      ignore_missing: true
      tag: geoip_destination_ip
      field: destination.ip
      target_field: destination.geo
  - geoip:
      ignore_missing: true
      tag: geoip_destination_asn
      database_file: GeoLite2-ASN.mmdb
      field: destination.ip
      target_field: destination.as
      properties:
        - asn
        - organization_name
  - rename:
      ignore_missing: true
      tag: rename_destination_as_asn
      field: destination.as.asn
      target_field: destination.as.number
  - append:
      if: ctx.event?.action == 'user.login'
      field: event.category
      value:
        - authentication
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'user.login'
      field: event.type
      value:
        - start
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'user.login' && ctx.event?.outcome == false
      field: event.type
      value:
        - end
      allow_duplicates: false
  - append:
      if: '[''session.start'', ''session.end''].contains(ctx.event?.action)'
      field: event.category
      value:
        - session
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.start'
      field: event.type
      value:
        - start
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.end'
      field: event.type
      value:
        - end
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'cert.create'
      field: event.category
      value:
        - configuration
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'cert.create'
      field: event.type
      value:
        - creation
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'db.session.start'
      field: event.category
      value:
        - database
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'db.session.start'
      field: event.type
      value:
        - access
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'db.session.start' && ctx.event?.outcome == false
      field: event.type
      value:
        - error
      allow_duplicates: false
  - append:
      if: '[''role.created'', ''user.create''].contains(ctx.event?.action)'
      field: event.category
      value:
        - iam
      allow_duplicates: false
  - append:
      if: '[''role.created'', ''user.create''].contains(ctx.event?.action)'
      field: event.type
      value:
        - creation
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.data'
      field: event.category
      value:
        - process
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.data'
      field: event.type
      value:
        - info
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.upload'
      field: event.category
      value:
        - file
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.upload'
      field: event.type
      value:
        - creation
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'session.leave'
      field: event.type
      value:
        - info
      allow_duplicates: false
  - append:
      if: ctx.source?.ip != null
      field: related.ip
      value: '{{{source.ip}}}'
      allow_duplicates: false
  - append:
      if: ctx.user?.name != null
      field: related.user
      value: '{{{user.name}}}'
      allow_duplicates: false
  - append:
      if: ctx.destination?.domain != null
      field: related.hosts
      value: '{{{destination.domain}}}'
      allow_duplicates: false
  - append:
      if: ctx.destination?.user?.name != null
      field: related.user
      value: '{{{destination.user.name}}}'
      allow_duplicates: false
  - rename:
      ignore_missing: true
      tag: rename_destination_as_organization_name
      field: destination.as.organization_name
      target_field: destination.as.organization.name
  - remove:
      ignore_failure: true
      ignore_missing: true
      if: ctx?.tags == null || !(ctx.tags.contains("preserve_original_event"))
      tag: remove_original_event
      field: event.original
on_failure:
  - append:
      field: error.message
      value: >-
        Processor {{{_ingest.on_failure_processor_type}}} with tag
        {{{_ingest.on_failure_processor_tag}}} in pipeline
        {{{_ingest.on_failure_pipeline}}} failed with message:
        {{{_ingest.on_failure_message}}}
  - set:
      field: event.kind
      value: pipeline_error 
(example event)
        {
            "ai_teleport_202410150835": {
                "audit": {
                    "cluster_name": "teleport.ericbeahan.com",
                    "db_origin": "config-file",
                    "db_service": "example-dynamodb",
                    "db_type": "dynamodb",
                    "db_user": "ExampleTeleportDynamoDBRole",
                    "ei": 0,
                    "error": "access to db denied. User does not have permissions. Confirm database user and name.",
                    "message": "access to db denied. User does not have permissions. Confirm database user and name.",
                    "namespace": "default",
                    "private_key_policy": "none",
                    "server_id": "b321c207-fd08-46c8-b248-0c20436feb62",
                    "sid": "b3710ea5-a293-4e24-ab3e-6e6d14d6358f",
                    "time": "2024-02-23T19:24:34.602Z",
                    "user_kind": 1
                }
            },
            "ecs": {
                "version": "8.11.0"
            },
            "event": {
                "action": "db.session.start",
                "category": [
                    "database"
                ],
                "code": "TDB00W",
                "id": "8ea0beff-640b-471d-a5f1-5ab0e3090278",
                "original": "{\"ei\":0,\"event\":\"db.session.start\",\"uid\":\"8ea0beff-640b-471d-a5f1-5ab0e3090278\",\"code\":\"TDB00W\",\"time\":\"2024-02-23T19:24:34.602Z\",\"cluster_name\":\"teleport.ericbeahan.com\",\"user\":\"teleport-admin\",\"user_kind\":1,\"sid\":\"b3710ea5-a293-4e24-ab3e-6e6d14d6358f\",\"private_key_policy\":\"none\",\"namespace\":\"default\",\"server_id\":\"b321c207-fd08-46c8-b248-0c20436feb62\",\"success\":false,\"error\":\"access to db denied. User does not have permissions. Confirm database user and name.\",\"message\":\"access to db denied. User does not have permissions. Confirm database user and name.\",\"db_service\":\"example-dynamodb\",\"db_protocol\":\"dynamodb\",\"db_uri\":\"aws://dynamodb.us-east-2.amazonaws.com\",\"db_user\":\"ExampleTeleportDynamoDBRole\",\"db_type\":\"dynamodb\",\"db_origin\":\"config-file\"}",
                "outcome": false,
                "start": "2024-02-23T19:24:34.602Z",
                "type": [
                    "access",
                    "error"
                ]
            },
            "network": {
                "protocol": "dynamodb"
            },
            "related": {
                "user": [
                    "teleport-admin"
                ]
            },
            "tags": [
                "preserve_original_event"
            ],
            "url": {
                "full": "aws://dynamodb.us-east-2.amazonaws.com"
            },
            "user": {
                "name": "teleport-admin"
            }
        },

PAN-OS Traffic

100 samples, 4 review cycles, 120s, then 100s for categorization: ai_panw_202410150813-1.0.0.zip

(generated ingest pipeline)
---
description: Pipeline to process ai_panw_202410150813 traffic logs
processors:
  - set:
      tag: set_ecs_version
      field: ecs.version
      value: 8.11.0
  - set:
      tag: copy_original_message
      field: originalMessage
      copy_from: message
  - csv:
      tag: parse_csv
      field: message
      target_fields:
        - ai_panw_202410150813.traffic.version
        - ai_panw_202410150813.traffic.timestamp
        - ai_panw_202410150813.traffic.serial_number
        - ai_panw_202410150813.traffic.log_type
        - ai_panw_202410150813.traffic.subtype
        - ai_panw_202410150813.traffic.config_version
        - ai_panw_202410150813.traffic.time_generated
        - ai_panw_202410150813.traffic.source_ip
        - ai_panw_202410150813.traffic.destination_ip
        - ai_panw_202410150813.traffic.nat_source_ip
        - ai_panw_202410150813.traffic.nat_destination_ip
        - ai_panw_202410150813.traffic.rule_name
        - ai_panw_202410150813.traffic.source_user
        - ai_panw_202410150813.traffic.destination_user
        - ai_panw_202410150813.traffic.application
        - ai_panw_202410150813.traffic.virtual_system
        - ai_panw_202410150813.traffic.source_zone
        - ai_panw_202410150813.traffic.destination_zone
        - ai_panw_202410150813.traffic.inbound_interface
        - ai_panw_202410150813.traffic.outbound_interface
        - ai_panw_202410150813.traffic.log_action
        - ai_panw_202410150813.traffic.time_received
        - ai_panw_202410150813.traffic.session_id
        - ai_panw_202410150813.traffic.repeat_count
        - ai_panw_202410150813.traffic.source_port
        - ai_panw_202410150813.traffic.destination_port
        - ai_panw_202410150813.traffic.nat_source_port
        - ai_panw_202410150813.traffic.nat_destination_port
        - ai_panw_202410150813.traffic.flags
        - ai_panw_202410150813.traffic.protocol
        - ai_panw_202410150813.traffic.action
        - ai_panw_202410150813.traffic.bytes
        - ai_panw_202410150813.traffic.bytes_sent
        - ai_panw_202410150813.traffic.bytes_received
        - ai_panw_202410150813.traffic.packets
        - ai_panw_202410150813.traffic.start_time
        - ai_panw_202410150813.traffic.elapsed_time
        - ai_panw_202410150813.traffic.category
        - ai_panw_202410150813.traffic.padding
        - ai_panw_202410150813.traffic.sequence_number
        - ai_panw_202410150813.traffic.action_flags
        - ai_panw_202410150813.traffic.source_location
        - ai_panw_202410150813.traffic.destination_location
        - ai_panw_202410150813.traffic.padding_2
        - ai_panw_202410150813.traffic.packets_sent
        - ai_panw_202410150813.traffic.packets_received
        - ai_panw_202410150813.traffic.session_end_reason
        - ai_panw_202410150813.traffic.device_group_hierarchy_level_1
        - ai_panw_202410150813.traffic.device_group_hierarchy_level_2
        - ai_panw_202410150813.traffic.device_group_hierarchy_level_3
        - ai_panw_202410150813.traffic.device_group_hierarchy_level_4
        - ai_panw_202410150813.traffic.virtual_system_name
        - ai_panw_202410150813.traffic.device_name
        - ai_panw_202410150813.traffic.action_source
        - ai_panw_202410150813.traffic.source_vm_uuid
        - ai_panw_202410150813.traffic.destination_vm_uuid
        - ai_panw_202410150813.traffic.tunnel_id_imsi
        - ai_panw_202410150813.traffic.monitor_tag_imei
        - ai_panw_202410150813.traffic.parent_session_id
        - ai_panw_202410150813.traffic.parent_start_time
        - ai_panw_202410150813.traffic.tunnel_type
        - ai_panw_202410150813.traffic.sctp_association_id
        - ai_panw_202410150813.traffic.sctp_chunks
        - ai_panw_202410150813.traffic.sctp_chunks_sent
        - ai_panw_202410150813.traffic.sctp_chunks_received
        - ai_panw_202410150813.traffic.rule_uuid
        - ai_panw_202410150813.traffic.http2_connection
        - ai_panw_202410150813.traffic.app_flap_count
        - ai_panw_202410150813.traffic.policy_id
        - ai_panw_202410150813.traffic.link_changes
        - ai_panw_202410150813.traffic.sdwan_cluster
        - ai_panw_202410150813.traffic.sdwan_device_type
        - ai_panw_202410150813.traffic.sdwan_cluster_type
        - ai_panw_202410150813.traffic.sdwan_site
        - ai_panw_202410150813.traffic.dynusergroup_name
        - ai_panw_202410150813.traffic.xff_ip
        - ai_panw_202410150813.traffic.src_uuid
        - ai_panw_202410150813.traffic.dst_uuid
      description: Parse CSV input
  - rename:
      ignore_missing: true
      if: ctx.event?.original == null
      tag: rename_message
      field: originalMessage
      target_field: event.original
  - remove:
      ignore_missing: true
      if: ctx.event?.original != null
      tag: remove_copied_message
      field: originalMessage
  - remove:
      ignore_missing: true
      tag: remove_message
      field: message
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.nat_source_port
      target_field: source.nat.port
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.bytes_received
      target_field: destination.bytes
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.packets
      target_field: network.packets
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.source_ip
      target_field: source.ip
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.protocol
      target_field: network.transport
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.destination_ip
      target_field: destination.ip
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.action
      target_field: event.action
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.rule_name
      target_field: rule.name
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.nat_destination_ip
      target_field: destination.nat.ip
  - script:
      tag: script_convert_array_to_string
      description: Ensures the date processor does not receive an array value.
      lang: painless
      source: |
        if (ctx.ai_panw_202410150813?.traffic?.start_time != null &&
            ctx.ai_panw_202410150813.traffic.start_time instanceof ArrayList){
            ctx.ai_panw_202410150813.traffic.start_time = ctx.ai_panw_202410150813.traffic.start_time[0];
        }
  - date:
      if: ctx.ai_panw_202410150813?.traffic?.start_time != null
      tag: date_processor_ai_panw_202410150813.traffic.start_time
      field: ai_panw_202410150813.traffic.start_time
      target_field: event.start
      formats:
        - yyyy/MM/dd HH:mm:ss
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.destination_port
      target_field: destination.port
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.device_name
      target_field: host.name
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.packets_sent
      target_field: source.packets
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.packets_received
      target_field: destination.packets
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.source_port
      target_field: source.port
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.nat_destination_port
      target_field: destination.nat.port
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.bytes_sent
      target_field: source.bytes
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.nat_source_ip
      target_field: source.nat.ip
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.application
      target_field: network.application
  - convert:
      ignore_failure: true
      ignore_missing: true
      field: ai_panw_202410150813.traffic.bytes
      target_field: network.bytes
      type: long
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.destination_location
      target_field: destination.geo.country_name
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.source_user
      target_field: source.user.name
  - rename:
      ignore_missing: true
      field: ai_panw_202410150813.traffic.rule_uuid
      target_field: rule.uuid
  - script:
      tag: script_drop_null_empty_values
      description: Drops null/empty values recursively.
      lang: painless
      source: |
        boolean dropEmptyFields(Object object) {
          if (object == null || object == "") {
            return true;
          } else if (object instanceof Map) {
            ((Map) object).values().removeIf(value -> dropEmptyFields(value));
            return (((Map) object).size() == 0);
          } else if (object instanceof List) {
            ((List) object).removeIf(value -> dropEmptyFields(value));
            return (((List) object).length == 0);
          }
          return false;
        }
        dropEmptyFields(ctx);
  - geoip:
      ignore_missing: true
      tag: geoip_source_ip
      field: source.ip
      target_field: source.geo
  - geoip:
      ignore_missing: true
      tag: geoip_source_asn
      database_file: GeoLite2-ASN.mmdb
      field: source.ip
      target_field: source.as
      properties:
        - asn
        - organization_name
  - rename:
      ignore_missing: true
      tag: rename_source_as_asn
      field: source.as.asn
      target_field: source.as.number
  - rename:
      ignore_missing: true
      tag: rename_source_as_organization_name
      field: source.as.organization_name
      target_field: source.as.organization.name
  - geoip:
      ignore_missing: true
      tag: geoip_destination_ip
      field: destination.ip
      target_field: destination.geo
  - geoip:
      ignore_missing: true
      tag: geoip_destination_asn
      database_file: GeoLite2-ASN.mmdb
      field: destination.ip
      target_field: destination.as
      properties:
        - asn
        - organization_name
  - rename:
      ignore_missing: true
      tag: rename_destination_as_asn
      field: destination.as.asn
      target_field: destination.as.number
  - rename:
      ignore_missing: true
      tag: rename_destination_as_organization_name
      field: destination.as.organization_name
      target_field: destination.as.organization.name
  - append:
      field: event.category
      value:
        - network
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'allow'
      field: event.type
      value:
        - connection
        - allowed
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.subtype == 'end'
      field: event.type
      value:
        - end
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.subtype == 'start'
      field: event.type
      value:
        - start
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.session_id != null
      field: event.category
      value:
        - network
        - session
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.log_type == 'TRAFFIC'
      field: event.type
      value:
        - info
      allow_duplicates: false
  - append:
      if: ctx.network?.transport != null
      field: event.category
      value:
        - network
      allow_duplicates: false
  - append:
      if: ctx.network?.application != null
      field: event.type
      value:
        - protocol
      allow_duplicates: false
  - append:
      if: ctx.network?.bytes != null || ctx.network?.packets != null
      field: event.category
      value:
        - network
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'reset server'
      field: event.type
      value:
        - denied
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.session_end_reason != null
      field: event.type
      value:
        - end
      allow_duplicates: false
  - append:
      if: ctx.source?.ip != null && ctx.destination?.ip != null
      field: event.type
      value:
        - connection
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.subtype == 'start'
      field: event.type
      value:
        - start
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.subtype == 'deny'
      field: event.type
      value:
        - denied
      allow_duplicates: false
  - append:
      if: ctx.ai_panw_202410150813?.traffic?.subtype == 'drop'
      field: event.type
      value:
        - denied
      allow_duplicates: false
  - append:
      if: ctx.event?.action == 'drop'
      field: event.type
      value:
        - denied
      allow_duplicates: false
  - append:
      if: ctx.source?.ip != null
      field: related.ip
      value: '{{{source.ip}}}'
      allow_duplicates: false
  - append:
      if: ctx.destination?.ip != null
      field: related.ip
      value: '{{{destination.ip}}}'
      allow_duplicates: false
  - append:
      if: ctx.source?.nat?.ip != null
      field: related.ip
      value: '{{{source.nat.ip}}}'
      allow_duplicates: false
  - append:
      if: ctx.destination?.nat?.ip != null
      field: related.ip
      value: '{{{destination.nat.ip}}}'
      allow_duplicates: false
  - append:
      if: ctx.host?.name != null
      field: related.hosts
      value: '{{{host.name}}}'
      allow_duplicates: false
  - append:
      if: ctx.source?.user?.name != null
      field: related.user
      value: '{{{source.user.name}}}'
      allow_duplicates: false
  - remove:
      ignore_missing: true
      tag: remove_fields
      field:
        - ai_panw_202410150813.traffic.bytes
  - remove:
      ignore_failure: true
      ignore_missing: true
      if: ctx?.tags == null || !(ctx.tags.contains("preserve_original_event"))
      tag: remove_original_event
      field: event.original
on_failure:
  - append:
      field: error.message
      value: >-
        Processor {{{_ingest.on_failure_processor_type}}} with tag
        {{{_ingest.on_failure_processor_tag}}} in pipeline
        {{{_ingest.on_failure_pipeline}}} failed with message:
        {{{_ingest.on_failure_message}}}
  - set:
      field: event.kind
      value: pipeline_error 
(example event)
        {
            "ai_panw_202410150813": {
                "traffic": {
                    "action_flags": "0x0",
                    "action_source": "from-policy",
                    "category": "computer-and-internet-info",
                    "config_version": "2049",
                    "destination_zone": "untrust",
                    "device_group_hierarchy_level_1": "0",
                    "device_group_hierarchy_level_2": "0",
                    "device_group_hierarchy_level_3": "0",
                    "device_group_hierarchy_level_4": "0",
                    "elapsed_time": "586",
                    "flags": "0x400053",
                    "inbound_interface": "ethernet1/2",
                    "log_action": "send_to_mac",
                    "log_type": "TRAFFIC",
                    "outbound_interface": "ethernet1/1",
                    "padding": "0",
                    "padding_2": "0",
                    "parent_session_id": "0",
                    "repeat_count": "1",
                    "sctp_association_id": "0",
                    "sctp_chunks": "0",
                    "sctp_chunks_received": "0",
                    "sctp_chunks_sent": "0",
                    "sequence_number": "32091112",
                    "serial_number": "012801096514",
                    "session_end_reason": "tcp-fin",
                    "session_id": "22751",
                    "source_location": "192.168.0.0-192.168.255.255",
                    "source_zone": "trust",
                    "start_time": "2018/11/30 15:59:04",
                    "subtype": "end",
                    "time_generated": "2018/11/30 16:09:07",
                    "time_received": "2018/11/30 16:09:07",
                    "timestamp": "2018/11/30 16:09:07",
                    "tunnel_id_imsi": "0",
                    "tunnel_type": "N/A",
                    "version": "Nov 30 16:09:08 PA-220 1",
                    "virtual_system": "vsys1"
                }
            },
            "destination": {
                "bytes": "5976",
                "geo": {
                    "city_name": "Changchun",
                    "continent_name": "Asia",
                    "country_iso_code": "CN",
                    "country_name": "China",
                    "location": {
                        "lat": 43.88,
                        "lon": 125.3228
                    },
                    "region_iso_code": "CN-22",
                    "region_name": "Jilin Sheng"
                },
                "ip": "175.16.199.1",
                "nat": {
                    "ip": "175.16.199.1",
                    "port": "443"
                },
                "packets": "20",
                "port": "443"
            },
            "ecs": {
                "version": "8.11.0"
            },
            "event": {
                "action": "allow",
                "category": [
                    "network",
                    "session"
                ],
                "original": "Nov 30 16:09:08 PA-220 1,2018/11/30 16:09:07,012801096514,TRAFFIC,end,2049,2018/11/30 16:09:07,192.168.15.207,175.16.199.1,192.168.1.63,175.16.199.1,new_outbound_from_trust,,,apple-maps,vsys1,trust,untrust,ethernet1/2,ethernet1/1,send_to_mac,2018/11/30 16:09:07,22751,1,55113,443,16418,443,0x400053,tcp,allow,7734,1758,5976,36,2018/11/30 15:59:04,586,computer-and-internet-info,0,32091112,0x0,192.168.0.0-192.168.255.255,United States,0,16,20,tcp-fin,0,0,0,0,,PA-220,from-policy,,,0,,0,,N/A,0,0,0,0",
                "start": "2018-11-30T15:59:04.000Z",
                "type": [
                    "connection",
                    "allowed",
                    "end",
                    "info",
                    "protocol"
                ]
            },
            "host": {
                "name": "PA-220"
            },
            "network": {
                "application": "apple-maps",
                "bytes": 7734,
                "packets": "36",
                "transport": "tcp"
            },
            "related": {
                "hosts": [
                    "PA-220"
                ],
                "ip": [
                    "192.168.15.207",
                    "175.16.199.1",
                    "192.168.1.63"
                ]
            },
            "rule": {
                "name": "new_outbound_from_trust"
            },
            "source": {
                "bytes": "1758",
                "ip": "192.168.15.207",
                "nat": {
                    "ip": "192.168.1.63",
                    "port": "16418"
                },
                "packets": "16",
                "port": "55113"
            },
            "tags": [
                "preserve_original_event"
            ]
        },

Checklist

For maintainers

  • This will appear in the Release Notes and follow the guidelines

Footnotes

  1. As before, deterministically selected on the frontend, see https://github.com/elastic/kibana/pull/191598

@ilyannn ilyannn added release_note:enhancement enhancement New value added to drive a business result backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) 8.16 candidate Team:Security-Scalability Team label for Security Integrations Scalability Team Feature:AutomaticImport labels Oct 15, 2024
@ilyannn ilyannn self-assigned this Oct 15, 2024
@ilyannn ilyannn marked this pull request as ready for review October 15, 2024 07:59
@ilyannn ilyannn requested a review from a team as a code owner October 15, 2024 07:59
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-scalability (Team:Security-Scalability)

bhapas
bhapas previously approved these changes Oct 15, 2024
Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally.. Looks good overall.

@@ -7,11 +7,11 @@

import React from 'react';
import { act, fireEvent, render, waitFor, type RenderResult } from '@testing-library/react';
import '@testing-library/jest-dom';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

)
);

newStableSamples.sort();
Copy link
Contributor

@bhapas bhapas Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sort? To persist order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly to make this readable when debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the numbers are also converted to strings and so the sort order is like 1,10,100,2,21,22,23,24,3,30 etc., Not really readable in this case.

Copy link
Contributor Author

@ilyannn ilyannn Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it was better than nothing 😄 Anyway, agreed, I've removed it.

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably Need to add chunking to related graph too. The run here seems to pick up all the pipeline_results to find one related field. We can reduce number of tokens passed here.

@ilyannn
Copy link
Contributor Author

ilyannn commented Oct 15, 2024

Probably Need to add chunking to related graph too. The run here seems to pick up all the pipeline_results to find one related field. We can reduce number of tokens passed here.

This will only reduce the number of tokens if we do not need new back-and-forth cycles. In the case you linked we know there was a single related field, but if we implement the algorithm to always pass 20 samples and extrapolate the results to all 100 samples, there will likely be integrations where we miss the related fields in these 20 samples. Then the cost will be at least:

  • 20 samples passed
  • 100 samples passed for validation
  • another 20 samples passed
  • 100 samples passed for validation

which would be much larger that current costs. The current way is also using less tokens than ECS Mapping and Categorization.

I do agree we can think about reducing the number of tokens, but I think a much better way is to include additional information when doing ECS Mappinig. We can just ask during that mapping if the field is likely to contain an IP, host or user name and prune our all the other fiekds.

Copy link
Contributor

@bhapas bhapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We can merge this PR now. Tested flows locally and everything seems to be working fine.

We can experiment with related graph in a different PR if you wish to.

@ilyannn ilyannn enabled auto-merge (squash) October 15, 2024 14:23
@ilyannn ilyannn merged commit fc3ce54 into elastic:main Oct 15, 2024
22 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11350245785

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
integrationAssistant 55 56 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
integrationAssistant 878.4KB 878.1KB -297.0B
Unknown metric groups

API count

id before after diff
integrationAssistant 66 71 +5

History

cc @ilyannn

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 15, 2024
…6233)

## Release Notes

Automatic Import now analyses larger number of samples to generate an
integration.

## Summary

Closes elastic/security-team#9844

**Added: Backend Sampling**

We pass 100 rows (these numeric values are adjustable) to the backend
[^1]

[^1]: As before, deterministically selected on the frontend, see
elastic#191598

The Categorization chain now processes the samples in batches,
performing after initial categorization a number of review cycles (but
not more than 5, tuned so that we stay under the 2 minute limit for a
single API call).

To decide when to stop processing we keep the list of _stable_ samples
as follows:

1. The list is initially empty.
2. For each review we select a random subset of 40 samples, preferring
to pick up the not-stable samples.
3. After each review – when the LLM potentially gives us new or changes
the old processors – we compare the new pipeline results with the old
pipeline results.
4. Those reviewed samples that did not change their categorization are
added to the stable list.
5. Any samples that have changed their categorization are removed from
the stable list.
6. If all samples are stable, we finish processing.

**Removed: User Notification**

Using 100 samples provides a balance between expected complexity and
time budget we work with. We might want to change it in the future,
possibly dynamically, making the specific number of no importance to the
user. Thus we remove the truncation notification.

**Unchanged:**

- No batching is made in the related chain: it seems to work as-is.

**Refactored:**

- We centralize the sizing constants in the
`x-pack/plugins/integration_assistant/common/constants.ts` file.
- We remove the unused state key `formattedSamples` and combine
`modelJSONInput` back into `modelInput`.

> [!NOTE]
> I had difficulty generating new graph diagrams, so they remain
unchanged.

(cherry picked from commit fc3ce54)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@ilyannn ilyannn deleted the auto-import/backend-sampling-rebuilt branch October 15, 2024 17:07
kibanamachine added a commit that referenced this pull request Oct 15, 2024
) (#196386)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Auto Import] Use larger number of samples on the backend
(#196233)](#196233)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ilya
Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T16:22:05Z","message":"[Auto
Import] Use larger number of samples on the backend (#196233)\n\n##
Release Notes\r\n\r\nAutomatic Import now analyses larger number of
samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses
https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added:
Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are
adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before,
deterministically selected on the frontend,
see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe
Categorization chain now processes the samples in batches,\r\nperforming
after initial categorization a number of review cycles (but\r\nnot more
than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle
API call).\r\n\r\nTo decide when to stop processing we keep the list of
_stable_ samples\r\nas follows:\r\n\r\n1. The list is initially
empty.\r\n2. For each review we select a random subset of 40 samples,
preferring\r\nto pick up the not-stable samples.\r\n3. After each review
– when the LLM potentially gives us new or changes\r\nthe old processors
– we compare the new pipeline results with the old\r\npipeline
results.\r\n4. Those reviewed samples that did not change their
categorization are\r\nadded to the stable list.\r\n5. Any samples that
have changed their categorization are removed from\r\nthe stable
list.\r\n6. If all samples are stable, we finish
processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100
samples provides a balance between expected complexity and\r\ntime
budget we work with. We might want to change it in the
future,\r\npossibly dynamically, making the specific number of no
importance to the\r\nuser. Thus we remove the truncation
notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the
related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n-
We centralize the sizing constants in
the\r\n`x-pack/plugins/integration_assistant/common/constants.ts`
file.\r\n- We remove the unused state key `formattedSamples` and
combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE]
\r\n> I had difficulty generating new graph diagrams, so they
remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","enhancement","v9.0.0","backport:prev-minor","8.16
candidate","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto
Import] Use larger number of samples on the
backend","number":196233,"url":"https://github.com/elastic/kibana/pull/196233","mergeCommit":{"message":"[Auto
Import] Use larger number of samples on the backend (#196233)\n\n##
Release Notes\r\n\r\nAutomatic Import now analyses larger number of
samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses
https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added:
Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are
adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before,
deterministically selected on the frontend,
see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe
Categorization chain now processes the samples in batches,\r\nperforming
after initial categorization a number of review cycles (but\r\nnot more
than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle
API call).\r\n\r\nTo decide when to stop processing we keep the list of
_stable_ samples\r\nas follows:\r\n\r\n1. The list is initially
empty.\r\n2. For each review we select a random subset of 40 samples,
preferring\r\nto pick up the not-stable samples.\r\n3. After each review
– when the LLM potentially gives us new or changes\r\nthe old processors
– we compare the new pipeline results with the old\r\npipeline
results.\r\n4. Those reviewed samples that did not change their
categorization are\r\nadded to the stable list.\r\n5. Any samples that
have changed their categorization are removed from\r\nthe stable
list.\r\n6. If all samples are stable, we finish
processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100
samples provides a balance between expected complexity and\r\ntime
budget we work with. We might want to change it in the
future,\r\npossibly dynamically, making the specific number of no
importance to the\r\nuser. Thus we remove the truncation
notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the
related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n-
We centralize the sizing constants in
the\r\n`x-pack/plugins/integration_assistant/common/constants.ts`
file.\r\n- We remove the unused state key `formattedSamples` and
combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE]
\r\n> I had difficulty generating new graph diagrams, so they
remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196233","number":196233,"mergeCommit":{"message":"[Auto
Import] Use larger number of samples on the backend (#196233)\n\n##
Release Notes\r\n\r\nAutomatic Import now analyses larger number of
samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses
https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added:
Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are
adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before,
deterministically selected on the frontend,
see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe
Categorization chain now processes the samples in batches,\r\nperforming
after initial categorization a number of review cycles (but\r\nnot more
than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle
API call).\r\n\r\nTo decide when to stop processing we keep the list of
_stable_ samples\r\nas follows:\r\n\r\n1. The list is initially
empty.\r\n2. For each review we select a random subset of 40 samples,
preferring\r\nto pick up the not-stable samples.\r\n3. After each review
– when the LLM potentially gives us new or changes\r\nthe old processors
– we compare the new pipeline results with the old\r\npipeline
results.\r\n4. Those reviewed samples that did not change their
categorization are\r\nadded to the stable list.\r\n5. Any samples that
have changed their categorization are removed from\r\nthe stable
list.\r\n6. If all samples are stable, we finish
processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100
samples provides a balance between expected complexity and\r\ntime
budget we work with. We might want to change it in the
future,\r\npossibly dynamically, making the specific number of no
importance to the\r\nuser. Thus we remove the truncation
notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the
related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n-
We centralize the sizing constants in
the\r\n`x-pack/plugins/integration_assistant/common/constants.ts`
file.\r\n- We remove the unused state key `formattedSamples` and
combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE]
\r\n> I had difficulty generating new graph diagrams, so they
remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}}]}]
BACKPORT-->

Co-authored-by: Ilya Nikokoshev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.16 candidate backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) enhancement New value added to drive a business result Feature:AutomaticImport release_note:enhancement Team:Security-Scalability Team label for Security Integrations Scalability Team v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants