Skip to content

Commit

Permalink
Merge pull request #48 from Constellation-Labs/adding-monitoring-service
Browse files Browse the repository at this point in the history
Adding monitoring service
  • Loading branch information
IPadawans authored Apr 24, 2024
2 parents 50b7b16 + c589a89 commit 2681411
Show file tree
Hide file tree
Showing 51 changed files with 789 additions and 216 deletions.
9 changes: 5 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
infra/docker/monitoring/grafana/config/
infra/docker/monitoring/prometheus/data/
infra/docker/monitoring/prometheus/monitoring/
infra/docker/grafana/grafana/config/
infra/docker/grafana/prometheus/data/
infra/docker/grafana/prometheus/monitoring/
.idea
infra/docker/shared/jars/**.jar
.vscode
source/project/*
source/metagraph-l0/genesis/genesis.address
source/metagraph-l0/genesis/genesis.snapshot
source/metagraph-l0/genesis/genesis.snapshot
source/*-monitoring-service
117 changes: 101 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,20 +67,23 @@ you should see something like this:
USAGE: hydra <COMMAND>
COMMANDS:
install Installs a local framework and detaches project
install-template Installs a project from templates
build Build containers
start-genesis Start containers from the genesis snapshot (erasing history) [aliases: start_genesis]
start-rollback Start containers from the last snapshot (maintaining history) [aliases: start_rollback]
stop Stop containers
destroy Destroy containers
purge Destroy containers and images
status Check the status of the containers
remote-deploy Remotely deploy to cloud instances using Ansible [aliases: remote_deploy]
remote-start Remotely start the metagraph on cloud instances using Ansible [aliases: remote_start]
remote-status Check the status of the remote nodes
update Update Euclid
logs Get the logs from containers
install Installs a local framework and detaches project
install-template Installs a project from templates
build Build containers
start-genesis Start containers from the genesis snapshot (erasing history) [aliases: start_genesis]
start-rollback Start containers from the last snapshot (maintaining history) [aliases: start_rollback]
stop Stop containers
destroy Destroy containers
purge Destroy containers and images
status Check the status of the containers
remote-deploy Remotely deploy to cloud instances using Ansible [aliases: remote_deploy]
remote-start Remotely start the metagraph on cloud instances using Ansible [aliases: remote_start]
remote-status Check the status of the remote nodes
update Update Euclid
logs Get the logs from containers
install-monitoring-service Download the metagraph-monitoring-service (https://github.com/Constellation-Labs/metagraph-monitoring-service) [aliases: install_monitoring_service]
remote-deploy-monitoring-service Deploy the metagraph-monitoring-service to remote host [aliases: remote_deploy_monitoring_service]
remote-start-monitoring-service Start the metagraph-monitoring-service on remote host [aliases: remote_start_monitoring_service]
```

TIP: You can use the same `-h` in each command listed above to see the accepted parameters
Expand Down Expand Up @@ -223,8 +226,12 @@ You can also call the `hydra` option
./hydra status
```

## Monitoring
With the containers building/starting we also build a monitoring tool. You can access this tool at this URL: `http://localhost:3000/`. The initial login and password are:
## Grafana
We have a Grafana container that can monitor your nodes. To enable this feature, modify the following field in `euclid.json` to `true`:

`start_grafana_container=true`

After updating this field, a Grafana container will be constructed when you start the services. You can access this tool at the following URL: http://localhost:3000/. The initial login credentials are:
```
username: admin
password: admin
Expand All @@ -233,6 +240,7 @@ You'll be requested to update the password after your first login

In this tool we have 2 dashboards, you can access them on `Dashboard` section

**NOTE: This monitoring feature is distinct from remote monitoring. It displays data from your nodes on Dashboards, allowing you to check various metrics. However, it does not perform restarts or any other operations.**

## Deployment

Expand Down Expand Up @@ -413,3 +421,80 @@ P2P port: :your_port
Peer id: :peerId
```

## Remote Monitoring

We have introduced a tool in version `v0.10.0` that can monitor your metagraph and restart it if necessary.

### Introduction
This service monitors your metagraph and performs restarts as necessary. It is deployed using Ansible on a remote Ubuntu host. Commands such as `install-monitoring-service`, `remote-deploy-monitoring-service`, and `remote-start-monitoring-service` will be further explained in subsequent sections.

The service is developed using `NodeJS`, and all necessary dependencies are installed on your remote instance during deployment.

Running in the background with PM2, the service initiates checks at intervals specified in the configuration under the field: `check_healthy_interval_in_minutes`. It evaluates the health of the metagraph based on predefined and customizable `restart-conditions`, detailed in the [metagraph-monitoring-service](https://github.com/Constellation-Labs/metagraph-monitoring-service) repository. For example, if an unhealthy node is detected, the service triggers a restart.

To restart a node or layer, the service first attempts to stop any running processes (referring to the layer). This operation requires sudo privileges without a password requirement (refer [to this](https://gcore.com/learning/how-to-disable-password-for-sudo-command/) document for instructions on setting up password-less sudo). In addition to terminating processes, log files from the node are moved to the `code/restart_logs` directory, which may also require sudo privileges.

After these steps, the service restarts the node or layer and reintegrates it into the cluster.

### Installation
This tool it's not default to Euclid, so you need to install this service. To do this you need to run the following:

`hydra install-monitoring-service`

this command to creates a monitoring project in your source directory, which will be named `metagraph-monitoring-service`

To use this feature, we need to know the informations about the remote host that will be used as monitoring.
So, you need to populate the file `infra/ansible/remote/hosts.ansible.yml` under the monitoring section.
You should provide a user that has sudo privileges without requiring a password. Refer to [this document](https://gcore.com/learning/how-to-disable-password-for-sudo-command/) to learn how to enable password-less sudo for a user.


### Monitoring Configuration

Before deploying to remote instances, you need to configure your monitoring by editing the file `config/config.json`. When you run the install command, some fields will be auto-populated based on the `euclid.json` file, which includes:

- `metagraph.id`: The unique identifier for your metagraph.
- `metagraph.name`: The name of your metagraph.
- `metagraph.version`: The version of your metagraph.
- `metagraph.default_restart_conditions`: Specifies conditions under which your metagraph should restart. These conditions are located in `src/jobs/restart/conditions`, including:
- `SnapshotStopped`: Triggers if your metagraph stops producing snapshots.
- `UnhealthyNodes`: Triggers if your metagraph nodes become unhealthy.
- `metagraph.layers`:
- `ignore_layer`: Set to `true` to disable a specific layer.
- `ports`: Specifies public, P2P, and CLI ports.
- `additional_env_variables`: Lists additional environment variables needed upon restart, formatted as `["TEST=MY_VARIABLE, TEST_2=MY_VARIABLE_2"]`.
- `seedlist`: Provides information about the layer seedlist, e.g., `{ base_url: ":your_url", file_name: ":your_file_name"}`.
- `metagraph.nodes`:
- `ip`: IP address of the node.
- `username`: Username for SSH access.
- `privateKeyPath`: Path to the private SSH key, relative to the service's root directory. Example: `config/your_key_file.pem`.
- `key_file`: Details of the `.p12` key file used for node startup, including `name`, `alias`, and `password`.
- `network.name`: The network your metagraph is part of, such as `integrationnet` or `mainnet`.
- `network.nodes`: Information about the GL0s nodes.
- `check_healthy_interval_in_minutes`: The interval, in minutes, for running the health check.

NOTE: You must provide your SSH key file that has access to each node. It is recommended to place this under the `config` directory. Ensure that this file has access to the node and that the user you've provided also has sudo privileges without a password.

### Customize Monitoring

Learn how to customize your monitoring by checking the repositories:

- [metagraph-monitoring-service-package](https://github.com/Constellation-Labs/metagraph-monitoring-service-package)
- [metagraph-monitoring-service](https://github.com/Constellation-Labs/metagraph-monitoring-service-package)

### Deploying Monitoring

Once you've configured your metagraph monitoring, deploy it to the remote host with:

`hydra remote-deploy-monitoring-service`

This command sends your current monitoring service from euclid to your remote instance and downloads all necessary dependencies.

### Starting Monitoring

After deployment, start your monitoring with:

`hydra remote-start-monitoring-service`

To force a complete restart of your metagraph, use:

`hydra remote-start-monitoring-service --force-restart`
20 changes: 14 additions & 6 deletions euclid.json
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
{
"github_token": "",
"version": "0.10.0-SNAPSHOT",
"tessellation_version": "2.3.1",
"tessellation_version": "2.3.3",
"project_name": "custom-project",
"framework": {
"name": "currency",
"modules": [
"data"
],
"version": "v2.3.1",
"version": "v2.3.3",
"ref_type": "tag"
},
"layers": [
Expand Down Expand Up @@ -44,7 +44,7 @@
}
],
"docker": {
"start_monitoring_container": false
"start_grafana_container": false
},
"deploy": {
"network": {
Expand All @@ -57,9 +57,17 @@
},
"ansible": {
"hosts": "infra/ansible/remote/hosts.ansible.yml",
"playbooks": {
"deploy": "infra/ansible/remote/playbooks/deploy/deploy.ansible.yml",
"start": "infra/ansible/remote/playbooks/start/start.ansible.yml"
"nodes": {
"playbooks": {
"deploy": "infra/ansible/remote/nodes/playbooks/deploy/deploy.ansible.yml",
"start": "infra/ansible/remote/nodes/playbooks/start/start.ansible.yml"
}
},
"monitoring": {
"playbooks": {
"deploy": "infra/ansible/remote/monitoring/playbooks/deploy/deploy.ansible.yml",
"start": "infra/ansible/remote/monitoring/playbooks/start/start.ansible.yml"
}
}
}
}
Expand Down
11 changes: 10 additions & 1 deletion infra/ansible/remote/hosts.ansible.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,13 @@ nodes:
base_currency_l1_cli_port: 9202
base_data_l1_public_port: 9300
base_data_l1_p2p_port: 9301
base_data_l1_cli_port: 9302
base_data_l1_cli_port: 9302

monitoring:
hosts:
monitoring-1:
ansible_host: #Your host IP
ansible_user: #Your host User
ansible_ssh_private_key_file: ~/.ssh/id_rsa
vars:
ansible_ssh_common_args: "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
- name: Install monitoring dependencies
hosts: monitoring
become: true
gather_facts: false
tasks:
- name: Update apt cache
ansible.builtin.apt:
update_cache: yes

- name: Install Python and pip
ansible.builtin.apt:
name:
- python3
- python3-pip
state: present

- name: Install build essential tools
ansible.builtin.apt:
name: build-essential
state: present

- name: Install additional libraries
ansible.builtin.apt:
name:
- gcc
- g++
- make
state: present

- name: Download and run NodeSource Node.js 20.x setup script
ansible.builtin.get_url:
url: https://deb.nodesource.com/setup_20.x
dest: /tmp/setup_node_20.sh
mode: '0755'
register: download_nodesource_script

- name: Execute NodeSource setup script
ansible.builtin.shell: /tmp/setup_node_20.sh
when: download_nodesource_script is succeeded

- name: Install Node.js and npm
ansible.builtin.apt:
name:
- nodejs
state: present
update_cache: yes

- name: Install node-gyp globally
ansible.builtin.npm:
name: node-gyp
global: yes

- name: Check if Yarn is already installed
ansible.builtin.command: which yarn
register: yarn_installed
ignore_errors: true

- name: Install Yarn
ansible.builtin.shell:
cmd: npm install --global yarn
when: yarn_installed.rc != 0

- name: Install PM2
ansible.builtin.shell:
cmd: yarn global add pm2
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
- import_playbook: configure.ansible.yml

- name: Send metagraph-monitoring-service to remote host
hosts: monitoring
gather_facts: false
tasks:
- name: Create directory
ansible.builtin.file:
path: /home/{{ ansible_user }}/code
state: directory

- name: Sending metagraph-monitoring-service excluding node_modules
ansible.builtin.synchronize:
src: "{{ lookup('env', 'SOURCE_PATH') }}/metagraph-monitoring-service/"
dest: "/home/{{ ansible_user }}/code/metagraph-monitoring-service"
delete: yes
rsync_opts:
- "--exclude=node_modules/"

- name: Install project dependencies
shell: |
cd "/home/{{ ansible_user }}/code/metagraph-monitoring-service"
yarn
37 changes: 37 additions & 0 deletions infra/ansible/remote/monitoring/playbooks/start/start.ansible.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
- name: Start monitoring service
hosts: monitoring
gather_facts: false
become: false
tasks:
- name: Check if monitoring-service exists
stat:
path: /home/{{ ansible_user }}/code/metagraph-monitoring-service
register: result_dir

- name: Fail if monitoring-service does not exist
fail:
msg: "The metagraph-monitoring-service does not exist."
when: not (result_dir.stat.exists and result_dir.stat.isdir)

- name: Stop current process
shell: |
cd /home/{{ ansible_user }}/code/metagraph-monitoring-service
yarn kill
ignore_errors: true

- name: Check if should force_restart
set_fact:
force_restart_bool: "{{ force_restart | default(false) | bool }}"

- name: Start monitoring service
shell: |
cd /home/{{ ansible_user }}/code/metagraph-monitoring-service
yarn start
when: not force_restart_bool

- name: Start monitoring service forcing restart
shell: |
cd /home/{{ ansible_user }}/code/metagraph-monitoring-service
yarn force-restart
when: force_restart_bool
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
ignore_errors: true

- name: Kill metagraph-l0 process
shell: "kill -9 {{ l0_process_id.stdout }}"
shell: "sudo kill -9 {{ l0_process_id.stdout }}"
ignore_errors: true
when: l0_process_id.stdout is defined

Expand Down
File renamed without changes.
Empty file.
Loading

0 comments on commit 2681411

Please sign in to comment.