Advanced networking placement features and automate exposure #13608

mansam · 2017-01-20T22:10:04Z

This PR is an attempt to get #4899 fixed up for the current codebase. The main thing that might need to change is the state machine code to deal with error conditions while provisioning Openstack Cloud instances without hanging till the provision request times out. @Ladas @gmcculloug

Depends on ManageIQ/manageiq-gems-pending#43

--

Adding methods for querying utilization of networks, and allowing
the usage in automate.

These methods are allowing e.g. deployment to least utilized private
network, or deployment to least utilized private network connected to
least utilized public network.

Implements BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1205392

Ladas · 2017-01-23T08:08:16Z

app/models/cloud_network.rb

@@ -47,6 +47,68 @@ def self.class_by_ems(ext_management_system, external = false)
    end
  end

+  def ip_address_total_count


hm, as I am looking at it, this probably needs to go to OpenStack STI subclass, there are OpenStack specific things

or, we separate base class/openstack class and add checks. e.g. allocation_pools must be defined and ip_address_used_count_live is doing OpenStack specific query.

Ladas · 2017-01-24T08:04:30Z

lib/miq_automation_engine/service_models/miq_ae_service_cloud_network.rb

@@ -8,5 +8,13 @@ class MiqAeServiceCloudNetwork < MiqAeServiceModelBase
    expose :floating_ips,          :association => true
    expose :network_ports,         :association => true
    expose :network_routers,       :association => true
+
+    expose :ip_address_total_count


could you also expose this only on the OpenStack automate model?

Ladas · 2017-01-24T08:05:20Z

lib/miq_automation_engine/service_models/miq_ae_service_vm_cloud.rb

@@ -12,5 +14,8 @@ class MiqAeServiceVmCloud < MiqAeServiceVm
    expose :floating_ips,      :association => true
    expose :security_groups,   :association => true
    expose :key_pairs,         :association => true
+    expose :associate_floating_ip


these 3 should be only on OpenStack automate model, it should follow the real model placement

Ladas · 2017-01-24T08:09:42Z

Last 2 comments, otherwise looks good.

@gmcculloug for the automate part, not sure if you have something else in mind but the continuation condition should include all terminate states, which are [:active, :error]. Or it could go through on_error branch possibly, where we could put the code that will reset the step and try another network that might not fail.

mansam · 2017-01-24T16:08:39Z

@Ladas I've moved those things over onto the Openstack automate models, and I've also had to change the name of the associate_floating_ip method to associate_floating_ip_from_network to avoid a name conflict and also better indicate how it works compared to the other related method.

gmcculloug · 2017-01-24T17:49:15Z

app/models/manageiq/providers/openstack/cloud_manager/provision/cloning.rb

@@ -4,7 +4,7 @@ def do_clone_task_check(clone_task_ref)
      instance = openstack.handled_list(:servers).detect { |s| s.id == clone_task_ref }
      status   = instance.state.downcase.to_sym

-      return true if status == :active
+      return true if [:active, :error].include?(status)


@mansam If we know there is an error you could raise an MiqException::MiqProvisionError which would get processed by the state machine here (https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_request_task/state_machine.rb#L23).

This would be better than returning true and then having poll_destination_in_vmdb moving into the customize_destination state.

Thanks @gmcculloug. So if I understand properly, it would be sufficient to raise the exception right here where the status is checked?

Correct, I think that will do what you want but would suggest testing it out to validate.

@gmcculloug Right, so raising the exception should move into on_error state of the state machine? I think that looks cleaner, then we will need a method inside on_error state, that will pick another network and retry the provisioning step. Can we do that?

@Ladas what code is responsible for picking a network in the first place?

mansam · 2017-01-26T16:15:51Z

@Ladas regarding network picking and retry logic in on_error-- was any part of your original PR or any other PRs doing that? I don't see any indication of where automatic IP allocation was happening.

Ladas · 2017-01-26T20:24:30Z

@mansam So those are only custom automate statemachines now, attached to the BZ. Some of it probably could go to the default automate statemachine for OpenStack.

mansam · 2017-01-27T00:36:31Z

@Ladas @gmcculloug Is special handling in on_error necessary, or will a thrown exception automatically prompt the state machine to restart up to its max number of retries?

Ladas · 2017-01-27T08:25:17Z

@mansam right, so the current automate scripts were just looking if the state was error, so not going through error state.

This was necessary when deploying many VMs at once, especially if you wanted to keep your networks utilized to max. The VMs are deployed in parallel, so it will e.g. happen that when deploying 10 Vms into most utilized network that has 5 Ips left, 5 of them will fail, so those need to retry the step and look again for a most utilized network that is not full.

Also, you can decide that based on available floating ips. Cause by picking a private network, you also see how many Floating Ips will be available. And it can happen, that when you get in a state machine step that assigns floating ip, there are no left, so you need to go a several automate states back and pick a different private network.

So it's really only needed for more complex customer usecases, where you want to have a more complex network placement strategy and deploy many Vms at once. :And you want them all to finish. :-)

mansam · 2017-01-27T14:36:10Z

@Ladas Alright, so if I understand correctly it sounds like this PR is sufficient then and anything else will have to be done in Automate.

Ladas · 2017-01-27T14:38:34Z

@mansam I think yes.

tzumainn · 2017-01-30T17:39:31Z

@lsmola sounds like this is good to go? your review on this is much better than mine

tzumainn · 2017-01-30T17:39:39Z

@Ladas ^

Ladas · 2017-02-01T09:18:34Z

looks good to me 👍 but to verify it actually works, you will need to test it with the custom automate :-D

tzumainn · 2017-02-01T19:29:18Z

one minor comment; maybe it's worthwhile to alphabetize the order of methods? other than that, looks good to me!

blomquisg · 2017-02-01T20:41:40Z

LGTM.

@gmcculloug anything else from the automate side you want to point out?

gmcculloug

This PR gets us a step closer to the goal, but after the provision fails the task will have an error status and cannot be restarted.

While the automate state-machine does support jumping back to a previous state there will still need to be logic to reset the task (and likely the request) to a non-error status before retrying.

That work can be the focus of a separate PR.

gmcculloug · 2017-02-07T15:24:24Z

app/models/manageiq/providers/openstack/cloud_manager/vm.rb

+    end
+  end
+
+  def destroy_if_failed


This method seems odd to me. Automate has access to the raw_power_state since it is a db column and the destroy method is also available to automate from the remove_from_disk method defined in the base class.

Is it really needed?

It sounds like it's no longer necessary based on what you are saying. I don't think I would have realized remove_from_disk would do the same thing prior to you mentioning it.

mansam · 2017-02-07T22:10:33Z

When you say "cannot be restarted" do you mean that the "on_error" handler can't retry the job for some reason, or something else?

gmcculloug · 2017-02-07T22:23:32Z

I have not tired but I am pretty sure once the task has an error status we cannot jump back a few steps in the state-machine, select different options and re-run provisioning without resetting the state/status of the task and request. They will need to be reset as part of the logic to select new options and retry.

mansam · 2017-02-08T17:12:10Z

If I am understanding the documentation right, I would just have to do something like the following inside of the on_error handler:

# Provisioning failed, retry from the placement step
$evm.root['ae_result'] = 'restart' 
$evm.root['ae_next_state'] = 'Placement'

And again, as long as I understand correctly, there would only be changes in the automate domain (under Openstack's CheckProvision on_error handler for example), not in any of the core code here?

gmcculloug · 2017-02-08T19:51:01Z

@mansam You are correct about the restart/next_state logic. At this point I would suggest testing the approach since no one has done this yet. You might find there are other issues that need to be addressed to support fully support this feature.

mansam · 2017-02-08T19:54:27Z

@gmcculloug Okay, gotcha. I have actually tested this (I have some automate changes locally to go with this) and it does appear to work correctly. I didn't realize this was relatively new ground. Thanks for your assistance.

Adding methods for querying utilization of networks, and allowing the usage in automate. These methods are allowing e.g. deployment to least utilized private network, or deployment to least utilized private network connected to least utilized public network. Implements BZ https://bugzilla.redhat.com/show_bug.cgi?id=1205392

miq-bot · 2017-02-08T20:42:28Z

Checked commits mansam/manageiq@bf8eb2d~...cf578db with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
8 files checked, 2 offenses detected

app/models/manageiq/providers/openstack/network_manager/cloud_network.rb

❗ - Line 126, Col 5 - Style/ConditionalAssignment - Use the return of the conditional for variable assignment and comparison.
❗ - Line 137, Col 5 - Style/ConditionalAssignment - Use the return of the conditional for variable assignment and comparison.

tzumainn · 2017-02-08T21:25:45Z

@mansam thanks for the updates, and @gmcculloug thanks for your comments! It seems that the issues for this particular PR are resolved - is this mergeable now, or is there something overlooked?

gmcculloug · 2017-02-08T21:31:57Z

Just need to confirm with @bdunne that it is ok to merge in light of PR #13783 which is moving automate engine and related files to a new repo. He is finalizing that work now.

tzumainn · 2017-02-13T18:32:58Z

@miq-bot add_label enhancement

tzumainn · 2017-02-13T19:34:03Z

@miq-bot add_label provisioning,providers/openstack

miq-bot · 2017-02-13T19:34:11Z

@tzumainn Cannot apply the following label because they are not recognized: providers/openstack

tzumainn · 2017-02-13T19:36:56Z

@miq-bot add_label providers/openstack/cloud

gmcculloug · 2017-02-24T18:17:48Z

@mansam @blomquisg @tzumainn This merged PR lists in the description a dependency on ManageIQ/manageiq-gems-pending#43 which is not merged.

Can someone take a look at this?

tzumainn · 2017-02-24T18:35:40Z

Ah, I took a quick look, and it looks good to me! I can't merge however.

chessbyte added the wip label Jan 21, 2017

Ladas reviewed Jan 23, 2017

View reviewed changes

mansam mentioned this pull request Jan 23, 2017

Add MiqNetworkPortNotDefinedError ManageIQ/manageiq-gems-pending#43

Merged

mansam force-pushed the porting-over-4899-ladislav branch from 4429856 to 516c566 Compare January 23, 2017 22:47

Ladas reviewed Jan 24, 2017

View reviewed changes

mansam force-pushed the porting-over-4899-ladislav branch from 634fdb7 to 2c92999 Compare January 24, 2017 16:07

gmcculloug reviewed Jan 24, 2017

View reviewed changes

mansam force-pushed the porting-over-4899-ladislav branch from 2c92999 to c117d2a Compare January 25, 2017 20:35

mansam changed the title ~~[WIP] Advanced networking placement features and automate exposure~~ Advanced networking placement features and automate exposure Jan 27, 2017

chessbyte assigned tzumainn Jan 30, 2017

chessbyte removed the wip label Jan 30, 2017

gmcculloug reviewed Feb 7, 2017

View reviewed changes

Ladas and others added 7 commits February 8, 2017 15:36

move ip usage methods to Openstack STI subclass

2e90255

Fix lint

c56d7bf

Expose floating ip management methods only on Openstack model

8681fe8

Raise MiqProvisionError if instance is in error state

58f4224

Revert changes to poll_destination_in_vmdb

232d2ad

Remove redundant destroy_if_failed

844d592

mansam force-pushed the porting-over-4899-ladislav branch from 3d95b68 to 844d592 Compare February 8, 2017 20:36

Alphabetize ip utilitzation methods

cf578db

blomquisg merged commit 53dfa5f into ManageIQ:master Feb 9, 2017

blomquisg added this to the Sprint 54 Ending Feb 13, 2017 milestone Feb 9, 2017

mansam mentioned this pull request Feb 9, 2017

Cause the clone task to restart if CheckProvision is in error ManageIQ/manageiq-content#43

Closed

miq-bot added the enhancement label Feb 13, 2017

miq-bot added the lifecycle/provisioning label Feb 13, 2017

miq-bot added the providers/openstack/cloud label Feb 13, 2017

gmcculloug mentioned this pull request Feb 24, 2017

[WIP] Openstack neutron automate advanced network association actions #4899

Closed

Fryguy added the providers/cloud label Jan 24, 2020

Advanced networking placement features and automate exposure #13608

Advanced networking placement features and automate exposure #13608

Conversation

mansam commented Jan 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas commented Jan 24, 2017

mansam commented Jan 24, 2017

Choose a reason for hiding this comment

mansam Jan 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mansam commented Jan 26, 2017

Ladas commented Jan 26, 2017

mansam commented Jan 27, 2017

Ladas commented Jan 27, 2017

mansam commented Jan 27, 2017

Ladas commented Jan 27, 2017

tzumainn commented Jan 30, 2017

tzumainn commented Jan 30, 2017

Ladas commented Feb 1, 2017

tzumainn commented Feb 1, 2017

blomquisg commented Feb 1, 2017

gmcculloug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mansam commented Feb 7, 2017

gmcculloug commented Feb 7, 2017

mansam commented Feb 8, 2017

gmcculloug commented Feb 8, 2017

mansam commented Feb 8, 2017

miq-bot commented Feb 8, 2017

tzumainn commented Feb 8, 2017

gmcculloug commented Feb 8, 2017

tzumainn commented Feb 13, 2017

tzumainn commented Feb 13, 2017

miq-bot commented Feb 13, 2017

tzumainn commented Feb 13, 2017

gmcculloug commented Feb 24, 2017

tzumainn commented Feb 24, 2017

mansam commented Jan 20, 2017 •

edited

Loading

mansam Jan 24, 2017 •

edited

Loading