Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping master up to date #438

Merged
merged 275 commits into from
Sep 28, 2023
Merged

Keeping master up to date #438

merged 275 commits into from
Sep 28, 2023

Conversation

XaverStiensmeier
Copy link
Contributor

For more information please take a look at the documentation in most cases the configuration documentation is the right place to start.

Large updates

Images Allow Regex

You can now use regex instead of specific image names. This is the only way how a cluster can keep spawning worker nodes because every specific image can be deactivated and deactivated images cannot be used to create workers anymore. Therefore, you are advised to use a fitting regex.

Fallback On Other Image

You can now offer another image or another regex if no active image can be found when taking your regular image key.

Gateway

BiBiGrid can now use a Gateway for create, ide and update.

BiBiGrid Rest API prototype

This is still a prototype version and will likely change in the future

autoMount

Volumes can now be mounted automatically. Read documentation before using.

Smaller updates

Logging has been revised

Logging is no longer done via a global logger, but by passing a created logger. This is necessary in order for the REST API.

Worker name change

Worker names no longer include a worker group number. This number is simply no longer relevant.

Fixed error regarding multiple subnets

Now multiple subnets are possible. Also there was a bug preventing BiBiGrid from starting if only a network is given.

Many more minor fixes

XaverStiensmeier and others added 24 commits May 26, 2023 12:53
Addendum: Removed worker groups and slightly streamlined id generation
… a list of the available configuration files
Improve Command Line Interface
* implemented switching to available similar named image when no exact image match is found.

* fixing images not existing at all.

* fixed minor mistakes and make difflib work.

* added regex implementation and documentation

* changed image selection. Now image can be a regex, too.

* updated everything to create_server.py

* updated documentation

* updated example configuration

* Made warning that occurs every time an info to avoid user irritation

* fixed naming error
* added exact versions for openstsacksdk and python-openstackclient (#413)

* Keep master updated (#401)

* add apt.bi.denbi.de as package source

* update slurm tasks (now uses self-build slurm packages -> v22.05.7, restructure slurm files)

* add documentation to build newer Slurm package

* fixes

* slurmrestd uses openapi/v0.0.38

* Added check_nfs as a non fatale evaluation (#366)

* Added "." and "-" cases for cid. This allows further rescuing and gives info messages. (#365)

* Added identifier for when no profile is defined to have a distinct identifier.

* Activated vpn setup

* Fixed example command

* Added logging info for file push and commands

* fix slurmrestd connfiguration

* Implementing wireguard

* update task order (slurm-server)

* fix default user chown settings

* Add an additional mariadb repository for Ubuntu 20.04. Zabbix 7.2 needs at least MariaDB 10.5 or higher and Focal comes with MariaDB 10.3.

* Extend slurm documentation.

* Extends documentation that BiBiGrid now supports Ubuntu 20.04/22.04 and Debian 11 (fixes #348).

* cleanup

* fix typos in documentation

* Updated wg0

* fix typos in documentation

* add workflow-job to lint python/ansible

* add more output

* add more output

* update runner working directory

* make ansible_lint happy

* rewrite linting workflow
add linting dependencies

* fix a typo

* fix pylintrc -> remove ignore-pattern=test/ (not needed, since pylint currently lints bibigrid folder)
make pylint happy

* fixing jinja

* changed jinja

* Fixed wrong when clause

* Removed unnecessary comments and added index implementation

* this_peer is now used

* Added configuration reload if necessary

* Moved restart to handlers

* Added missing handler

* Changed to systemd setup

* Fixed nfs

* Fixed a few bugs more to come

* added some defaults

* Added vpn wkr without ip

* removed unnecessary print and fixed typo

* added vpn counter

* debugging bug

* debugging vpnwkr naming is wrong

* Commenting out worker creation

* Fixed bug making first worker and numberless

* fixed number order in deletion

* vpn workers added to instances.yml

* Added key generator for wireguard keys
Fixed minor bus and added wireguard vpn support except subnets

* Added subnet cidr

* Fixing default value bugs

* added identifier

* added identifier as variable and changed providers to access all flavors

* reformatted

* slurm

* fixed ip assigning

* foreign workers are now included in compute nodes

* Added vpnwkrs to playbook start

* Fixed formatting. Added identifier instead of "Test" for wireguard configuration to improve debugging

* Larger rework of instances file

* fixing bugs caused by aforementioned rework

* fixing bugs caused by aforementioned rework

* fixing bugs caused by aforementioned rework

* fixing bugs caused by aforementioned rework

* cluster_dict no longer needed for ansible configuration

* Changed instances_yml so it allows grouping by cloud

* Renamed to match jinja extension of other files

* instances.master

* instances.master

* removed master from instances list and fixed minor bugs.

* Fixed slicing

* Removed empty vpnworkers list as there can be only one

* Removed no longer needed import

* minor reference fixes regarding master and vpn

* Changed ip to cidr as it should be in nfs exports

* removed faulty space in nfs export entry

* added vpnwkrs to list of nodes to run ansible-playbook on

* added missing vpnwkr

* Set default partition

* Removed default partition as this key doesn't exist

* default if cloud fits

* all credentials will now be stored. Not compatible with save script yet.

* fixed wrong parameter type due to ac handling multiple providers now instead of just one

* Fixed cidr bug

* changed cloud_specification to use identifier

* Fixed master not being filtered out due to buggy detection

* create is now cloud structured but badly implemented (needs asynchronous implementation)

* Removed master = none

* removed faulty bracket.

* Worker start follows cloud structure now

* fixed badly placed assignment of ac_cloud_yaml

* replaced no longer fitting regex by an actual exact check using slurm's hostname resolution

* fixed old variable name leading to hickups

* Changed nfs exports to add all subnets. Currently not very nice looking, but working.

* Added comments and improved variable names.

* Added delete_server.py routine and connected it to fail.sh (untested).

* Further grouped code and simplified logging.

* fixed minor bugs and added a little bit of logging.

* patch for wait for post-launch services to stop

* Added private_v4 to configuration implementation. Bit dirty.

* Changed nfs for workers back to private_v4. Will crash with vpnwkr as long as security groups are not set correctly.

* Added missing instances

* add dnsmasq support ( #372 ) (#380)

* add dnsmasq support ( #372 )

* extend dnsmasq support ( #372 )

* bugfixes dnsmasq support ( #372 )

* fix ansible syntax
add all vpnworker to dnsmasq.hosts ( #372 )
change order of copying clouds.yaml
many changes

* Added wireguard_ip

* wireguard_ip increased by 1 to ignore master

* Added a print for private_v4 to symbolize the start of dns entry creation

* Add support for additional vars file : hosts.yml
Extend hosts.j2 template to support worker entries

* - extends instances configuration
- add worker_userdata template

* - remove unused wireguard-worker.yml
- add userdata support (create_server.py)
- enable ip forwarding and tcp mtu probing  on vpn gateways

* Fix program crash when image is not active (#382)

* Fixed function missing call

* Fixed linter that wasn't troubled before

* Fix ephemeral not working (#385)

* implemented usage of host_vars

* probably solved, but not best solution yet

* changed from host_vars to group_vars to have fewer files doing the same work

* update requirements.txt

* add ConfigurationException

* Provider and it implementation for Openstack gets another method to add allowed_addresses to an interface/port

* Remove not longer functions/ code fragments.  Add support for extended network configuration, when creating a multi-cloud cluster.

* added hybrid cloud

* updating check documentation

* updating check documentation

* updating check documentation

* Removed artefact

* Filled text beyond headings

* Add security group support to provider and its implementing classes.

* Update create action:
- support for security groups
- slightly restructuring

* add wirguard network to list of allowed addresses

* fix wrong usage of jinja templating

* add usage of security groups when creating a worker

* fix wireguard systemd network configuration

* add firewall rules when running in a multi-cloud setup

* add termination of created security groups
fix a converning adding allowed addresses

* fix "allowed addresses" when running with more than 2 providers

* pin openstacksdk to an older version to avoid deprecation warnings.

* Added host file solution for vpnwkrs. Moved wireguard to configuration.

* Added host vars to deletion process and fixed vpnwkrs using group vars instead of host vars bug.

* Fixing structural changes due to merge

* Fixed vpn workers getting lost

* fixed merge bug, improved data structure ansible/jinja

* Removed another bug regarding passing too many arguments.

* removed delay for now

* fixed worker count

* fixed wireguard

* Added reattempt for ConflictException still not perfect.

* Further fixed vpnwkr merge issues

* Adapted command to new group vpn that contains both master and vpnwkr

* Fixed wireguard ip bug

* fixed bug wireguard not installed on vpn-worker

* Changed "local" to "ssh" in order to avoid sudo right issue on master.

* fixed group name?

* adapted timeout to experiences

* fixed group name now using "-" instead of ":"

* fixed userdata being list cause of using readlines instead of read. Now is string.

* group name cannot contain '-' therefore switched to underscores. Maybe change this in the node naming convention as well.

* Make all clouds default

* first draft add ip routes

* Added ip routes to main.yml

* Changed ip route registration to make use of linux network files

* Workers now save the gateway_ip (private_v4 of master or vpnwkr). Also fixed a counting error.

* now using common variable wireguard_common instead of group_var wireguard which is always missing on workers.

* Added rights.

* Disabling netplan and going full networkd

* Disabling cloud network changes after initialization

* Added netplan deactivation

* Fixed connection issues

* Added missing handler and added a task that updates the host file on worker

* Fixed minor bad namings and added missing ".yaml" extension to task file

* Added implementation of "bibiname" a short script that allows node name creation

* fixed name issue regarding slurm user executing ansible. Now master name is determined without user involvement.

* renamed task to "generate bibiname script"

* Adapted scripts to meet hybrid cloud solution

* Added delete_server.py script to bin copied files

* fixed fail and terminate script

* changed terminate script to timeout delete

* fixed minor code issues

* fixed linting issues delete_server.py

* fixed linting issues provider.py

* fixed linting issues startup_tests.py

* fixed linting issues

* fixed linting issues

* fixed typo

* fixed termination ConflictException not caught

* Added basic structure for multi_cloud.md

* Added elixir compute presentation as an additional light-weight read.

* added this file that - in the future - maybe should hold information regarding other projects that are using BiBiGrid. That makes it easier to keep an eye on all applications that might be affected by BiBiGrid's changes.

* Added basic wireguard.md documentation

* fixed grammar

* removed redundant warning

* added dnsmasq documentation structure

* removed encryption

* updated purpose description

* update DNS

* now creating empty hosts.yml file in order to allow ansible execution

* Remove entire vars folder

* fixed path

* changed provider.NAME provider.cloud_specification['identifier']

* Removed vpnwkr from slurm as it should only be used to establish connection and not for computing

* Decoupled for loop worker ansible host creation from vpnwkr host creation

* fixed vpnwkr still being added to the partition even though the node doesn't exist anymore

* Fixed bug in bibiname.j2 that gave master a number (master never has a number as there is only one)

* removed all references to the instances.master

* removed further references to instances.yml and fixed bugs appearing because of it. Needs rework where master access can be shortened.

* fixed slurm.conf creating NodeName duplicates. Still unordered.

* Added all partition

* Removed instances.yml from create_server.py

* Removed instances.yml from delete_server.py

* removed last remains of instance.yml

* Servers are now created asynchronously.

* Fixed rest error

* Added support for feature in slurm.conf

* Putting features into group_vars

* Updated configuration.md documentation to mention new feature "feature" for instances and configuration.

* Added merge information and updates bibigrid.yml accordingly

* added features to master and workergroups

* fixed features not added as string to slurm.conf

* added missing empty line

* Now a single string instead of a list of features is understood as well.

* Improved cloud_identifier selection and documented the new way: picking clouds.yaml key.

* updated configuration.md and removed many inaccuracies

* changed instances to instance for instance creation as workers are no longer created.

* Improved create.md

* Improved naming of subparagraph

* Fixed indentation, readability and documentation

* Improved logging information.

* Improved logging

* Added warning message when configuration is not list.

* added configuration list parameter

* Added logging when network or subnet couldn't be set

* Improved logging of ConfigurationExceptions

* Improved documentation. Removed unnecessary variable in ide

* Improved documentation.

* Added brief information regarding wireguard and zabbix

* changed vpnwkr to vpngtw

* Fixed security group deletion for not multi-cloud clusters.

---------

Co-authored-by: Jan Krüger <[email protected]>
Co-authored-by: Jan Krüger <[email protected]>

* Added option to generate cluster_id before create process

* Added rest api prototype

* reworked naming convention and added terminate command. Added basic replies.

* Converter global LOG to class attribute self.log to enable different logs per thread

* Reverted logging to global logging because using redirect might be more feasible

* Using contextlib to redirect prints

* Started rewriting prints to logging and make logging not global and thread-safe

* Fixed list_clusters needing log now.

* updated terminate.py and occurences to local logging.

* changed logging to local for ansible configurator

* unfinished: started localizing logging in logging_path_handler.py

* updating ssh_handler.py now logging locally (and affected modules)

* updating ssh_handler.py now logging locally (and affected modules)

* improved variable names

* updated provider_handler.py to local logging

* changed global logging to local logging

* changed global logging to local logging

* Fixed many small logging mistakes and changed validation logging to local

* Fixed formatting

* Cleaned startup.py

* Fixed logging error and made use of logging for all commands

* Added cpu based worker selection

* Added new logging option 42 for "PRINT"

* Improved logger and added an explanation implementation

* Changed info to post and contains list now instead of single element

* Switched to main method.

* fixed many small things regarding log, added gateway mode for ssh_handler.py and fixed rest added get_log option

* Enabled multiple subnets for when network is given. Not fully operational yet.

* Fixed crash causing bug when using network instead of subnet

* Removed unnecessary debug warning

* made print nicer

* further fixed using network instead of subnet

* fixed issues regarding port calculation and gateway_ip

* Added check wether a cluster is running

* removed prints

* removed prints

* Added comments for docs

* Added pydantic base models

* Capitalized names

* added option to terminate with assume_true

* removed as docs fulfills this purpose now

* added option to not upload Credentials

* fixed minor bug causing bibigrid not finding private keys.

* removed print

* fixed name not being capitalized (ansible)

* fixed old linting error

* fixed old linting error

* implemented gateway with portFunction using sympy

* using gateway automatically deactivates public ip usage now.

* updated documentation

* update is now able to use gateway if given.

* ide is now able to use gateway if given.

* new version correctly integrated

* removed unnecessary add to stdout (already standard)

* removed unnecessary add to stdout (already standard) from startup_rest.py

* if regex is found, check will succeed now.

* fixed ssh not using gateway

---------

Co-authored-by: Jan Krüger <[email protected]>
Co-authored-by: Jan Krüger <[email protected]>
# Conflicts:
#	bibigrid.yml
#	bibigrid/core/actions/check.py
#	bibigrid/core/actions/create.py
#	bibigrid/core/actions/list_clusters.py
#	bibigrid/core/actions/terminate_cluster.py
#	bibigrid/core/actions/version.py
#	bibigrid/core/provider.py
#	bibigrid/core/startup.py
#	bibigrid/core/utility/ansible_configurator.py
#	bibigrid/core/utility/handler/configuration_handler.py
#	bibigrid/core/utility/handler/ssh_handler.py
#	bibigrid/core/utility/validate_configuration.py
#	bibigrid/models/exceptions.py
#	bibigrid/openstack/openstack_provider.py
#	documentation/markdown/features/check.md
#	documentation/markdown/features/configuration.md
#	requirements.txt
#	resources/playbook/roles/bibigrid/files/slurm/create_server.py
#	resources/playbook/roles/bibigrid/tasks/000-add-ip-routes.yml
#	resources/playbook/roles/bibigrid/tasks/003-dns.yml
#	resources/playbook/roles/bibigrid/tasks/010-bin-server.yml
#	resources/playbook/roles/bibigrid/tasks/025-nfs-server.yml
#	resources/playbook/roles/bibigrid/templates/bin/bibiname.j2
@XaverStiensmeier XaverStiensmeier merged commit 1978b9b into master Sep 28, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants