Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Agents are intermittently showing as off-line in Kibana Fleet #21025

Closed
EricDavisX opened this issue Sep 8, 2020 · 9 comments · Fixed by #21037
Closed

[Fleet] Agents are intermittently showing as off-line in Kibana Fleet #21025

EricDavisX opened this issue Sep 8, 2020 · 9 comments · Fixed by #21037
Assignees
Labels
Ingest Management:beta2 Group issues for ingest management beta2 regression

Comments

@EricDavisX
Copy link
Contributor

[Fleet] - Agents are reported as going off-line as seen in Kibana Fleet, I'm seeing lots of timestamps in the Agent Activity log with the same time. Is it somehow expected? Seems strange. Some screenshots are below...

  • found on latest deployed 8.0 snapshot. maybe its fixed already, i know that snapshots had been broken for a few days. I'm going to drop a quick ticket in with repro steps so we don't spend time if its already ok.

tested on:
https://kibana.endpoint.elastic.dev/app/ingestManager#/fleet/agents/946f0178-31e6-4e2d-be4b-556320bb55e0

  • deployed nightly with new / fresh install of latest of master.

as of now, its running code from Sept 3 (now is Sept 8th).
edavis-mbp:kibana_elastic edavis$ git show -s 60986d4f8202016c98409c2926ccf29d9d2ee7e0
commit 60986d4f8202016c98409c2926ccf29d9d2ee7e0
Author: Yuliia Naumenko [email protected]
Date: Thu Sep 3 13:07:23 2020 -0700

maybe related to e2e test cited bug (logged against Stand-alone mode, but maybe its bigger than we knew) #20992

the 'type ahead' to get more / better info from the logs thru ingest is not working currently, logged separately. I can dig in to get more logs from the agent hosts later, if help is needed - but no need for it to sit idle waiting on me, so I'm dropping it into the system.

seem to impact both Agents that have Endpoint and those that don't. But all Endpoint integrations seem up and alive in the Security app, so Agent must be basically ok!?

screenshots:
timestamps are repeated...
linux
win

impacting both endpoint + non-endpoint enabled hosts:
non-endpoint-policy-agents
Endpoint-policy-agents

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph
Copy link
Contributor

ph commented Sep 8, 2020

@nchaulet Could it be linked to the performance changes we did?

@nchaulet
Copy link
Member

nchaulet commented Sep 8, 2020

It could be linked yes looks like the same events are send again and again, maybe the change I made to have a timeout of 5 minutes is not working here.

I am doing some test to check what is happening here

@nchaulet
Copy link
Member

nchaulet commented Sep 8, 2020

Just did a test against https://kibana.endpoint.elastic.dev/ and my agent is well reported as online

Screen Shot 2020-09-08 at 4 03 07 PM

@EricDavisX Do you know how those agents are runned? somewhere on server running as a service? and do we have logs of these agents?

@EricDavisX
Copy link
Contributor Author

EricDavisX commented Sep 9, 2020

I do know! our wiki page at /display/DEV/Endpoint+and+Ingest+Nightly+Dev+Demo+Server has details.
for now you can look at the siem team repo:
https://github.com/elastic /siem-team/blob/master/cm/ansible/roles/deploy-agent/tasks/linux-main.yml

it will show something like this in ansible:

- name: Create install directory
  file:
    path: "{{ install_dir_linux }}"
    mode: "0755"
    state: directory

- name: Set download url
  set_fact:
    agent_url: "{{ snapshots.json | json_query('packages.\"' + agent_handle_linux + '.tar.gz\".url') }}"

- name: Download and Extract Agent zip
  unarchive:
    remote_src: yes
    src: "{{ agent_url }}"
    dest: "{{ install_dir_linux }}"

- name: Enroll the agent
  become: yes
  shell:  "{{ install_dir_linux }}/{{ agent_handle_linux }}/elastic-agent enroll -f https://{{ kibana_username }}:{{ kibana_password }}@kibana.{{ domain_name }}:443 {{ enroll_token }}"

- name: Create the service file
  template:
    dest: /etc/systemd/system/fleet-agent.service
    src: fleet-agent.service.j2
    mode: '0644'
  register: service_file

- name: reload systemd configs to pickup changes
  systemd:
    daemon_reload: yes
  when: service_file.changed

- name: restart fleet-agent service
  systemd:
    name: fleet-agent.service
    state: restarted
    enabled: yes

It has worked prior, and seems still working, to start agent (just not sure how long they'll stay up?)
there are other ansible files that get the token and do more supporting thing.

@EricDavisX
Copy link
Contributor Author

Just did a test against kibana.endpoint.elastic.dev and my agent is well reported as online

Screen Shot 2020-09-08 at 4 03 07 PM

@EricDavisX Do you know how those agents are runned? somewhere on server running as a service?

also @nchaulet if you used a 7.8 Agent its totally cheating. :) Can we confirm again and keep researching with a full 8.0 env? Its helpful to know tho that the older Agent works, it means the problem is indeed maybe on the Agent side.

@nchaulet
Copy link
Member

nchaulet commented Sep 9, 2020

Got a repro locally with a timeout, my bad I did not check with @michalpristas or @blakerouse what is the timeout for the request checkin, @michalpristas how complicated it is to modify the timeout for the checkin request? (it's set to 5 minutes on kibana side).

20-09-08T21:41:12.003-0400	ERROR	application/fleet_gateway.go:176	Could not communicate with Checking API will retry, error: fail to checkin to fleet: Post "http://localhost:5601/api/ingest_manager/fleet/agents/7b3785e6-7b3d-4d24-836d-bcf3de6ff8aa/checkin?": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

While we have a proper fix this can be fixed by adding this to kibana.yml config

xpack.ingestManager.fleet.pollingRequestTimeout: 60000

@michalpristas
Copy link
Contributor

michalpristas commented Sep 9, 2020

@nchaulet not complicated at all will prepare a PR.
can you link me a change which caused this? (so i can link PRs)

@ph
Copy link
Contributor

ph commented Sep 9, 2020

@nchaulet if I understand correctly this issue should be assigned to @michalpristas ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ingest Management:beta2 Group issues for ingest management beta2 regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants