Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable connection between master and workers according cluster.log in Reliability test #4385

Closed
pro-akim opened this issue Aug 2, 2023 · 7 comments · Fixed by #5477
Closed
Assignees
Labels
level/task Task issue qa_known Issues that are already known by the QA team type/bug

Comments

@pro-akim
Copy link
Member

pro-akim commented Aug 2, 2023

Wazuh version Component
4.6.0 Wazuh server cluster

Description

An error in automated tests is produced by the output generated in the cluster.log file. This log file snippet shows that the machine fails to connect to its master node and the same log is present consistently during all the log.

Running reliability/test_cluster/test_cluster_logs/test_cluster_connection/test_cluster_connection.py::test_cluster_connection

  if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_21
E           - worker_13
E           - worker_4
E           - worker_6
E           - worker_18
E           - worker_20
E           - worker_11
E           - worker_2
E           - worker_14
E           - worker_3
E           - worker_9
E           - worker_17
E           - worker_22
E           - worker_7
E           - worker_5
E           - worker_19
E           - worker_12
E           - worker_25
E           - worker_8
E           - worker_24
E           - worker_23
E           - worker_10
E           - worker_16
E           - worker_15
E           - worker_1

test_cluster_logs/test_cluster_connection/test_cluster_connection.py:47: Failed

Steps to reproduce

It can be reproduced by running the Workload benchmark test pipeline and after running manual reliability tests using wazuh-qa tests in 4.6.0.
The pipeline should be run with the same parameters as https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/274/

Report is attached here.

@pro-akim pro-akim closed this as completed Aug 2, 2023
@pro-akim pro-akim transferred this issue from wazuh/wazuh Aug 2, 2023
@pro-akim pro-akim reopened this Aug 2, 2023
@pro-akim
Copy link
Member Author

pro-akim commented Aug 3, 2023

Update

It seems related to wazuh/wazuh#17928

@Selutario
Copy link
Contributor

Selutario commented Dec 13, 2023

All the reliability tests:

  • test_cluster_connection
  • test_cluster_error_logs
  • test_check_logs_order_workers
  • test_check_logs_order_master
  • test_cluster_sync
  • test_cluster_task_order

are not currently prepared to analyze logs where one of the managers has been restarted. They should be adapted to work correctly when this happens, since it is a usual scenario in the workload benchmarks if the API performance is being tested.

For example, the test_check_logs_order_workers checks whether the logs of every cluster task (Agent-info sync, Integrity check, etc.) appear in the expected order. However, tasks are reinitialized when the connection is broken because of a cluster restart, so the test should also reset the expected order when it finds a connection broken log.

@Selutario
Copy link
Contributor

It was reported here that some tests did not run in the last time build. We should fix it as part of this issue if the problem keeps happening:

@nico-stefani
Copy link
Member

Update

I was reading the tests code and giving the first steps to the resolution.

@nico-stefani
Copy link
Member

Update

I was debugging the test test_check_logs_order_workers and I got working the case for a particular logs sample.

2023/08/14 19:42:19 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Permission to synchronize granted.
2023/08/14 19:42:19 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Starting.
2023/08/14 19:42:19 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Obtained 29 chunks of data in 0.235s.
2023/08/14 19:42:19 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Sending chunks.
2023/08/14 19:42:20 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] The master closed the connection
2023/08/14 19:43:50 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Permission to synchronize granted.
2023/08/14 19:43:50 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Starting.
2023/08/14 19:43:50 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Obtained 29 chunks of data in 0.232s.
2023/08/14 19:43:52 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Sending chunks.
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] Command received: 'b'dapi''
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Received request: b'master*124633 {"f": {"__callable__": {"__name__": "restart", "__module__": "wazuh.manager", "__qualname__": "restart", "__type__": "function"}}, "f_kwargs": {}, "request_type": "distributed_master", "wait_for_complete": false, "from_cluster": true, "is_async": false, "local_client_arg": null, "basic_services": ["wazuh-modulesd", "wazuh-analysisd", "wazuh-execd", "wazuh-db", "wazuh-remoted"], "rbac_permissions": {"agent:create": {"*:*:*": "allow"}, "group:create": {"*:*:*": "allow"}, "agent:read": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:delete": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:modify_group": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:reconnect": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:restart": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:upgrade": {"agent:id:*": "allow", "agent:group:*": "allow"}, "group:read": {"group:id:*": "allow"}, "group:delete": {"group:id:*": "allow"}, "group:update_config": {"group:id:*": "allow"}, "group:modify_assignments": {"group:id:*": "allow"}, "active-response:command": {"agent:id:*": "allow"}, "security:create": {"*:*:*": "allow"}, "security:create_user": {"*:*:*": "allow"}, "security:read_config": {"*:*:*": "allow"}, "security:update_config": {"*:*:*": "allow"}, "security:revoke": {"*:*:*": "allow"}, "security:edit_run_as": {"*:*:*": "allow"}, "security:read": {"role:id:*": "allow", "policy:id:*": "allow", "user:id:*": "allow", "rule:id:*": "allow"}, "security:update": {"role:id:*": "allow", "policy:id:*": "allow", "user:id:*": "allow", "rule:id:*": "allow"}, "security:delete": {"role:id:*": "allow", "policy:id:*": "allow", "user:id:*": "allow", "rule:id:*": "allow"}, "cluster:status": {"*:*:*": "allow"}, "manager:read": {"*:*:*": "allow"}, "manager:read_api_config": {"*:*:*": "allow"}, "manager:update_config": {"*:*:*": "allow"}, "manager:restart": {"*:*:*": "allow"}, "cluster:read_api_config": {"node:id:*": "allow"}, "cluster:read": {"node:id:*": "allow"}, "cluster:restart": {"node:id:*": "allow"}, "cluster:update_config": {"node:id:*": "allow"}, "ciscat:read": {"agent:id:*": "allow"}, "decoders:read": {"decoder:file:*": "allow"}, "decoders:delete": {"decoder:file:*": "allow"}, "decoders:update": {"*:*:*": "allow"}, "lists:read": {"list:file:*": "allow"}, "lists:delete": {"list:file:*": "allow"}, "lists:update": {"*:*:*": "allow"}, "rootcheck:clear": {"agent:id:*": "allow"}, "rootcheck:read": {"agent:id:*": "allow"}, "rootcheck:run": {"agent:id:*": "allow"}, "rules:read": {"rule:file:*": "allow"}, "rules:delete": {"rule:file:*": "allow"}, "rules:update": {"*:*:*": "allow"}, "mitre:read": {"*:*:*": "allow"}, "sca:read": {"agent:id:*": "allow"}, "syscheck:clear": {"agent:id:*": "allow"}, "syscheck:read": {"agent:id:*": "allow"}, "syscheck:run": {"agent:id:*": "allow"}, "syscollector:read": {"agent:id:*": "allow"}, "logtest:run": {"*:*:*": "allow"}, "task:status": {"*:*:*": "allow"}, "vulnerability:read": {"agent:id:*": "allow"}, "vulnerability:run": {"*:*:*": "allow"}, "rbac_mode": "white"}, "current_user": "", "broadcasting": true, "nodes": ["master", "CLUSTER-Workload_benchmarks_metrics_B283_manager_15", "CLUSTER-Workload_benchmarks_metrics_B283_manager_1", "CLUSTER-Workload_benchmarks_metrics_B283_manager_10", "CLUSTER-Workload_benchmarks_metrics_B283_manager_25", "CLUSTER-Workload_benchmarks_metrics_B283_manager_13", "CLUSTER-Workload_benchmarks_metrics_B283_manager_2", "CLUSTER-Workload_benchmarks_metrics_B283_manager_19", "CLUSTER-Workload_benchmarks_metrics_B283_manager_11", "CLUSTER-Workload_benchmarks_metrics_B283_manager_6", "CLUSTER-Workload_benchmarks_metrics_B283_manager_20", "CLUSTER-Workload_benchmarks_metrics_B283_manager_8", "CLUSTER-Workload_benchmarks_metrics_B283_manager_4", "CLUSTER-Workload_benchmarks_metrics_B283_manager_7", "CLUSTER-Workload_benchmarks_metrics_B283_manager_9", "CLUSTER-Workload_benchmarks_metrics_B283_manager_22", "CLUSTER-Workload_benchmarks_metrics_B283_manager_12", "CLUSTER-Workload_benchmarks_metrics_B283_manager_16", "CLUSTER-Workload_benchmarks_metrics_B283_manager_17", "CLUSTER-Workload_benchmarks_metrics_B283_manager_14", "CLUSTER-Workload_benchmarks_metrics_B283_manager_24", "CLUSTER-Workload_benchmarks_metrics_B283_manager_23", "CLUSTER-Workload_benchmarks_metrics_B283_manager_5", "CLUSTER-Workload_benchmarks_metrics_B283_manager_18", "CLUSTER-Workload_benchmarks_metrics_B283_manager_21", "CLUSTER-Workload_benchmarks_metrics_B283_manager_3"], "api_timeout": 10}'
2023/08/14 19:43:54 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Receiving request: restart from master (124633)
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Receiving parameters {}
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Starting to execute request locally
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Finished executing request locally
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Time calculating request result: 0.024s
2023/08/14 19:43:55 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] Command received: 'b'syn_m_a_e''
2023/08/14 19:43:55 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Finished in 5.318s. Updated 29 chunks.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Permission to synchronize granted.
2023/08/14 19:43:58 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Starting.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Compressing 'files_metadata.json' of 34 files.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Sending zip file to master.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Zip file sent to master.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] Command received: 'b'syn_m_c_ok''
2023/08/14 19:43:58 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Finished in 0.051s. Sync not required.
2023/08/14 19:43:59 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] The master closed the connection
2023/08/14 19:43:59 INFO: [Cluster] [Main] SIGNAL [(15)-(SIGTERM)] received. Exit...
2023/08/14 19:44:18 DEBUG: [Cluster] [Main] Removing '/var/ossec/queue/cluster/'.
2023/08/14 19:44:18 DEBUG: [Cluster] [Main] Removed '/var/ossec/queue/cluster/'.
2023/08/14 19:44:19 INFO: [Local Server] [Main] Serving on /var/ossec/queue/cluster/c-internal.sock
2023/08/14 19:44:19 DEBUG: [Local Server] [Keep alive] Calculating.
2023/08/14 19:44:19 DEBUG: [Local Server] [Keep alive] Calculated.
(wqa) ➜  wazuh-qa git:(fix/4385-unstable-realiability-tests) ✗ pytest -s --artifacts_path=/home/nstefani/Downloads/283/ tests/reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order/test_cluster_worker_logs_order.py
================================================================================================ test session starts ================================================================================================
platform linux -- Python 3.9.16, pytest-7.1.2, pluggy-0.13.1
rootdir: /home/nstefani/git/wazuh-qa/tests, configfile: pytest.ini
plugins: html-3.1.1, testinfra-5.0.0, metadata-2.0.4
collected 1 item

tests/reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order/test_cluster_worker_logs_order.py .

================================================================================================= 1 passed in 0.05s =================================================================================================

The next step is to generalize this solution for more cases.

@nico-stefani
Copy link
Member

nico-stefani commented May 29, 2024

Update

I've been working on the fix of test_cluster_connection.

Still, I need to investigate the behavior of test_check_logs_order_workers with the changes introduced in wazuh/wazuh#19888.

@nico-stefani nico-stefani linked a pull request May 30, 2024 that will close this issue
@nico-stefani
Copy link
Member

Update

I've been working on the requested changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue qa_known Issues that are already known by the QA team type/bug
Projects
No open projects
Status: Done
4 participants