Skip to content

Kepler Operator Design Discussion

Sunyanan Choochotkaew edited this page Sep 26, 2022 · 7 revisions

What to config? and How it should be operated?

   flowchart LR;
     machine-config-->integrated-operator-install-->kepler-install
     kepler-collected-metric
     kepler-exported-power
Loading

machine-config

node-selector: (default: all)
cgroup2:
 enable: (default: true)

operations:

  • deploy MachineConfigPool using node-selector (:warning: do we have to select node?)
  • deploy cgroupv2 MachineConfig

integrated-operator-install

prometheus:
grafana:

operations:

kepler-install

scape-interval:
daemon:
  exporter:
    image:
    port: (default: 9102)
  estimator-sidecar:
    enabled: (default: false)
    image:
    mnt-path: (default: /tmp)
model-server:
    enabled: (default: :warning:false)
    storage:
      type: (default: local? , values: local, hostpath, nfs, external (such as via s3))
      path: (default: models)
    sampling-period:

operations:

  • deploy model-server
  • deploy model-server-service
  • deploy rbac-related resources (serviceaccount, clusterrole, clusterrolebinding)
  • deploy corresponding pv, pvc if hostpath or nfs
  • deploy daemonset (w/wo estimator)
  • deploy exporter-service
  • deploy servicemonitor

kepler-collected-metric

spec:
  counter:
  cgroup:
  kubelet:
  gpu:
  ...
status:
  counter:
    cpu_cycles: enabled/disabled/unavailable
    ...

operations:

  • apply metric configuration to kepler-ds env (:warning: or implement kepler-exporter to watch this CR)

kepler-exported-power

spec:
  node:
  package:
  pod:
status:
  power-source:
    node:
    package:
    pod:  
  ...

operations:

  • apply power configuration to kepler-ds env (:warning: or implement kepler-exporter to watch this CR) --> similar to collected-metric

Reconcile Loop

⚠️ which choice of design is good?

  1. Single CR and Controller (merge metric and power to the kepler-ds and combine with other install config)
   flowchart LR;
     machine-config-->integrated-operator-install-->kepler-install-config
Loading
  1. Single install CR and Controller + metric CR and Controller + power CR and Controller
   flowchart TD;
     machine-config-->integrated-operator-install-->kepler-install
     kepler-collected-metric
     kepler-exported-power
Loading
  1. machine-config CR and Controller + integrated-operator-install CR and Controller + metric CR and Controller + power CR and Controller
   flowchart LR;
     machine-config; 
     integrated-operator-install;
     kepler-collected-metric
     kepler-exported-power
Loading
reconcile choice advantage disadvantage
single single point of modify frequently activate unnecessary logics on install, bad code readability
single install + metric + power separate infrequent change from relatively-frequent change, improved abstraction
all separated clean logic, separates different functionality, most ideal abstraction

⚠️ Should the integrated operators also be separated into their own controllers? This may be useful if users want to use different dashboards (not just Grafana)