Table of Contents
This software is pre-production and should not be deployed to production servers.
Orchestration-aware Workload Collocation Agent goal is to reduce interference between collocated tasks and increase tasks density while ensuring the quality of service for high priority tasks. Chosen approach is to enable real-time resource isolation management to ensure that high priority jobs meet their Service Level Objective (SLO) and best-effort jobs effectively utilize as many idle resources as possible.
Resource usage can be increased by:
- collocating best effort and high priority tasks to exploit resources that are underutilized by high priority applications,
- collocating tasks that do not compete for shared resources on the platform.
OWCA abstracts compute node, workloads, monitoring and resource allocation. An externally provided algorithm is responsible for allocating resources or anomaly detection logic. OWCA and the algorithm exchange information about current resource usage, isolation actuations or detected anomalies. OWCA stores information about detected anomalies, resource allocation and platform utilization metrics to a remote storage such as Kafka.
The diagram below puts OWCA in context of an example Mesos cluster and monitoring infrastructure:
See OWCA Architecture 1.5.pdf for futher details.
OWCA is targeted at and tested on Centos 7.5.
Note: for full production installation please follow this detailed installation guide.
# Install required software.
sudo yum install epel-release -y
sudo yum install git python36 -y
python3.6 -m ensurepip --user
python3.6 -m pip install --user pipenv
# Clone the repository & build.
git clone https://github.com/intel/owca
cd owca
pipenv install --dev
pipenv shell
tox
# Run manually (alongside Mesos agent):
sudo dist/owca.pex --config configs/mesos_example.yaml --root
OWCA introduces simple but extensible mechanism to inject dependencies into classes and build complete software stack of components.
OWCA main control loop is based on Runner
base class that implements
single run
blocking method. Depending on Runner
class used, the OWCA is run in different execution mode (e.g. detection,
allocation).
Examples runners:
DetectionRunner
implements the loop callingdetect
function in regular and configurable intervals. See detection API for details.AllocationRunner
(Work in progress) implements the loop callingallocate
function in regular and configurable intervals. See allocation API for details.
Conceptually Runner
reads a state of the system (both metrics and workloads),
passes the information to external component (an algorithm), logs the algorithm input and output using implementation of Storage
and allocates resources if instructed.
Following snippet is an example configuration of a runner:
runner: !SomeRunner
node: !SomeNode
callback_component: !ClassImplementingCallback
storage: !SomeStorage
After starting OWCA with the above mentioned configuration, an instance of the class SomeRunner
will be created. The instance's properties will be set to:
node
- to an instance ofSomeNode
callback_component
- to an instance ofClassImplementingCallback
storage
- to an instance ofSomeStorage
Configuration mechanism allows to:
- Create and configure complex python objects (e.g.
DetectionRunner
,MesosNode
,KafkaStorage
) using YAML tags. - Inject dependencies (with type checking support) into constructed objects using dataclasses annotations.
- Register external classes using
-r
command line argument or by usingowca.config.register
decorator API.
See external detector example for more details.
Following built-in components are available:
- MesosNode provides workload discovery on Mesos cluster node where mesos containerizer is used.
- DetectionRunner implements anomaly detection loop and encodes anomalies as metrics to enable alerting and analysis. See Detection API for more details.
- AllocationRunner implements resource allocation loop.See Allocation API for more details (Work in progress).
- NOPAnomalyDetector dummy "no operation" detector that returns no metrics, nor anomalies. See Detection API for more details.
- KafkaStorage logs metrics to Kafka streaming platform using configurable topics
- LogStorage logs metrics to standard error or to a file at configurable location.
The project contains Dockerfiles together with helper scripts aimed at preparation of reference workloads to be run on Mesos cluster using Aurora framework.
To enable anomaly detection algorithm validation the workloads are prepared to:
- provide continuous stream of Application Performance Metrics using wrappers (all workloads),
- simulate varying load (patches to generate sine-like pattern of requests per second are available for YCSB and rpc-perf ).
See workloads directory for list of supported applications and load generators.