[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

cmacknz · 2023-04-03T19:15:54Z

Background

Starting from 8.6 the default global processors for Beats that are run by agent are configured in code instead of being read from the default Beat configuration file. Beats managed by agent no longer read a configuration file at startup and instead wait for their initial configuration to be sent by agent. This change was done in #34149.

The implementation from #34149 makes the processors global by configuring them for each input run by Elastic Agent. Looking at the beat-rendered-config.yml file available in the agent diagnostics for example shows the processors at the input level:

- data_stream:
    dataset: system.auth
    type: logs
  exclude_files:
  - .gz$
  id: logfile-system.auth-8913b026-f53e-43f4-909b-d7c91335f141
  index: logs-system.auth-vault
  multiline:
    match: after
    pattern: ^\s
  paths:
  - /var/log/auth.log*
  - /var/log/secure*
  processors:
  - add_host_metadata:
      when:
        not:
          contains:
            tags: forwarded
  - add_cloud_metadata: null
  - add_docker_metadata: null
  - add_kubernetes_metadata: null

This is in contrast to the default configuration file which defines a single instance of each processor for the process by defining them at the top level of the configuration (see Where are processors valid for details):

beats/x-pack/filebeat/filebeat.yml

Lines 167 to 172 in 91906c9

    
           processors: 
        
             - add_host_metadata: 
        
                 when.not.contains.tags: forwarded 
        
             - add_cloud_metadata: ~ 
        
             - add_docker_metadata: ~ 
        
             - add_kubernetes_metadata: ~

Problem

Similarly in 8.6 there was a change to the aws-s3 input to create a new beat.Client for each new SQS worker in #33658 to improve performance. This results in a new input pipeline being constructed for each SQS worker, each of which gets its own instance of the per input processors as of 8.6.

This was not a problem until 8.7, when it was discovered that each instance of a beat input pipeline was referencing an accidentally global instance of the per input processors. This was fixed in #34761. The change in #34761 now results in each input pipeline constructing a new instance of the global processors.

Each of these global processors is expensive to create and includes code to try to perform expensive work only at initialization time. The problem is this only works if there is a single instance of the processor, otherwise each unique instance of the processor attempts to reinitialize itself often performing the exact same request multiple times. For example:

In the case of Docker or Kubernetes each instance is subscribing/watching the Docker/k8s API for changes. And each instance is keeping its own in-memory state of the Docker/k8s resources. This ideally should only be done once per process for both the sake of the Beat and the Docker/K8s API.
The repeated construction of these processors by inputs like aws-s3 and filestream will be slow. These processors generally have a high instantiation cost. Like for docker the constructor creates a new docker client and tries making an API call before returning. Or in add_cloud_metadata there was in inadvertent change that introduces a 3s worst-case construction cost ([libbeat] add_cloud_metadata - startup blocked by AWS IMSDv2 token fetch #33058).

Impact

As of 8.7 we observing extremely high CPU usage for Beats run under agent and the agent itself in situations where inputs are frequently created. For example in the case of the add_cloud_metadata processor we are observing the agent itself being spammed by repeated log messages from the add_cloud_metadata initialization sequence:

{"log.level":"info","@timestamp":"2023-04-01T00:14:59.605Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.715Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.716Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:14:59.824Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.062Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.194Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.282Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.394Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.477Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.542Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-01T00:15:00.629Z","message":"add_cloud_metadata: hosting provider type detected as aws, metadata={\"cloud\":{\"account\":{\"id\":\"REDACTED\"},\"availability_zone\":\"us-west-2d\",\"image\":{\"id\":\"ami-REDACTED\"},\"instance\":{\"id\":\"i-REDACTED\"},\"machine\":{\"type\":\"r6g.xlarge\"},\"provider\":\"aws\",\"region\":\"us-west-2\",\"service\":{\"name\":\"EC2\"}}}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"aws-s3-default","type":"aws-s3"},"log":{"source":"aws-s3-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":106,"file.name":"add_cloud_metadata/add_cloud_metadata.go"},"ecs.version":"1.6.0"}

This comes from the code below which includes a sync.Once block that is being defeated by a new instance of the processor being created for each individual input:

beats/libbeat/processors/add_cloud_metadata/add_cloud_metadata.go

Lines 98 to 112 in 91906c9

    
           func (p *addCloudMetadata) init() { 
        
           	p.initOnce.Do(func() { 
        
           		result := p.fetchMetadata() 
        
           		if result == nil { 
        
           			p.logger.Info("add_cloud_metadata: hosting provider type not detected.") 
        
           			return 
        
           		} 
        
           		p.metadata = result.metadata 
        
           		p.logger.Infof("add_cloud_metadata: hosting provider type detected as %v, metadata=%v", 
        
           			result.provider, result.metadata.String()) 
        
           	}) 
        
           } 
        
           func (p *addCloudMetadata) getMeta() mapstr.M { 
        
           	p.init()

Solution

When Beats run under agent we need to create the default global processors at the Beat process level, instead of the input level to match what is done in the global configuration file.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-04-03T19:15:57Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

rdner · 2023-04-04T10:01:45Z

It does not seem like the processors are marked as global in #34149 they're just added to the config, right? We would need a way to distinguish them from the rest.

However, I'm against using anything global. Global processors are very dangerous as we learnt in #34761

Perhaps instead we should modify the processors themselves to better manage their connections (singleton connections shared by all the processors of this type) instead of hacking the global processors. This makes way more sense to me.

cmacknz · 2023-04-04T12:49:44Z

These processors were global in every release prior to 8.6. Global processors are a standard feature of Beats, and a commonly requested feature for agent (https://github.com/elastic/ingest-dev/issues/2442). They aren't something we can get away with not supporting.

I agree that where the processors are instantiated shouldn't matter, but in this case we have a critical performance degradation that can easily be fixed by reverting to a known good configuration. This is significantly less risk and effort than rewriting each of the default add_x_metadata processors.

I don't see a reason why we shouldn't take the easiest path and restore the behaviour we had before, even if we don't like the way these processors work architecturally.

rdner · 2023-04-04T15:37:25Z

This write-up might be helpful for the person working on this issue #34716 (comment)

Ignore the issue itself, the comment contains a lot of processor-related code that will help to navigate.

cmacknz added Team:Elastic-Agent Label for the Agent team v8.7.0 labels Apr 3, 2023

cmacknz added the Agent label Apr 3, 2023

cmacknz changed the title ~~[Elastic Agent] Default processors should be created at the process level instead of per input~~ [Elastic Agent] Default processors are created per input can result in high agent CPU usage Apr 3, 2023

cmacknz changed the title ~~[Elastic Agent] Default processors are created per input can result in high agent CPU usage~~ [Elastic Agent] Default processors created per input can result in high agent CPU usage Apr 3, 2023

pierrehilbert assigned fearful-symmetry Apr 4, 2023

cmacknz mentioned this issue Apr 4, 2023

Migrate the default global Beat processors to the shipper elastic/elastic-agent-shipper#292

Open

fearful-symmetry mentioned this issue Apr 5, 2023

Fix performance issues with processors scaling under agent #35031

Merged

4 tasks

pierrehilbert closed this as completed in #35031 Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

cmacknz commented Apr 3, 2023

elasticmachine commented Apr 3, 2023

rdner commented Apr 4, 2023

cmacknz commented Apr 4, 2023

rdner commented Apr 4, 2023 •

edited

Loading

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

[Elastic Agent] Default processors created per input can result in high agent CPU usage #35000

Comments

cmacknz commented Apr 3, 2023

Background

Problem

Impact

Solution

elasticmachine commented Apr 3, 2023

rdner commented Apr 4, 2023

cmacknz commented Apr 4, 2023

rdner commented Apr 4, 2023 • edited Loading

rdner commented Apr 4, 2023 •

edited

Loading