Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced networking placement features and automate exposure #13608

Merged
merged 8 commits into from
Feb 9, 2017

Conversation

mansam
Copy link
Contributor

@mansam mansam commented Jan 20, 2017

This PR is an attempt to get #4899 fixed up for the current codebase. The main thing that might need to change is the state machine code to deal with error conditions while provisioning Openstack Cloud instances without hanging till the provision request times out. @Ladas @gmcculloug

Depends on ManageIQ/manageiq-gems-pending#43

--

Adding methods for querying utilization of networks, and allowing
the usage in automate.

These methods are allowing e.g. deployment to least utilized private
network, or deployment to least utilized private network connected to
least utilized public network.

Implements BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1205392

@chessbyte chessbyte added the wip label Jan 21, 2017
@@ -47,6 +47,68 @@ def self.class_by_ems(ext_management_system, external = false)
end
end

def ip_address_total_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, as I am looking at it, this probably needs to go to OpenStack STI subclass, there are OpenStack specific things

or, we separate base class/openstack class and add checks. e.g. allocation_pools must be defined and ip_address_used_count_live is doing OpenStack specific query.

@@ -8,5 +8,13 @@ class MiqAeServiceCloudNetwork < MiqAeServiceModelBase
expose :floating_ips, :association => true
expose :network_ports, :association => true
expose :network_routers, :association => true

expose :ip_address_total_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also expose this only on the OpenStack automate model?

@@ -12,5 +14,8 @@ class MiqAeServiceVmCloud < MiqAeServiceVm
expose :floating_ips, :association => true
expose :security_groups, :association => true
expose :key_pairs, :association => true
expose :associate_floating_ip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 should be only on OpenStack automate model, it should follow the real model placement

@Ladas
Copy link
Contributor

Ladas commented Jan 24, 2017

Last 2 comments, otherwise looks good.

@gmcculloug for the automate part, not sure if you have something else in mind but the continuation condition should include all terminate states, which are [:active, :error]. Or it could go through on_error branch possibly, where we could put the code that will reset the step and try another network that might not fail.

@mansam
Copy link
Contributor Author

mansam commented Jan 24, 2017

@Ladas I've moved those things over onto the Openstack automate models, and I've also had to change the name of the associate_floating_ip method to associate_floating_ip_from_network to avoid a name conflict and also better indicate how it works compared to the other related method.

@@ -4,7 +4,7 @@ def do_clone_task_check(clone_task_ref)
instance = openstack.handled_list(:servers).detect { |s| s.id == clone_task_ref }
status = instance.state.downcase.to_sym

return true if status == :active
return true if [:active, :error].include?(status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mansam If we know there is an error you could raise an MiqException::MiqProvisionError which would get processed by the state machine here (https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_request_task/state_machine.rb#L23).

This would be better than returning true and then having poll_destination_in_vmdb moving into the customize_destination state.

Copy link
Contributor Author

@mansam mansam Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gmcculloug. So if I understand properly, it would be sufficient to raise the exception right here where the status is checked?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I think that will do what you want but would suggest testing it out to validate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmcculloug Right, so raising the exception should move into on_error state of the state machine? I think that looks cleaner, then we will need a method inside on_error state, that will pick another network and retry the provisioning step. Can we do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladas what code is responsible for picking a network in the first place?

@mansam
Copy link
Contributor Author

mansam commented Jan 26, 2017

@Ladas regarding network picking and retry logic in on_error-- was any part of your original PR or any other PRs doing that? I don't see any indication of where automatic IP allocation was happening.

@Ladas
Copy link
Contributor

Ladas commented Jan 26, 2017

@mansam So those are only custom automate statemachines now, attached to the BZ. Some of it probably could go to the default automate statemachine for OpenStack.

@mansam
Copy link
Contributor Author

mansam commented Jan 27, 2017

@Ladas @gmcculloug Is special handling in on_error necessary, or will a thrown exception automatically prompt the state machine to restart up to its max number of retries?

@Ladas
Copy link
Contributor

Ladas commented Jan 27, 2017

@mansam right, so the current automate scripts were just looking if the state was error, so not going through error state.

This was necessary when deploying many VMs at once, especially if you wanted to keep your networks utilized to max. The VMs are deployed in parallel, so it will e.g. happen that when deploying 10 Vms into most utilized network that has 5 Ips left, 5 of them will fail, so those need to retry the step and look again for a most utilized network that is not full.

Also, you can decide that based on available floating ips. Cause by picking a private network, you also see how many Floating Ips will be available. And it can happen, that when you get in a state machine step that assigns floating ip, there are no left, so you need to go a several automate states back and pick a different private network.

So it's really only needed for more complex customer usecases, where you want to have a more complex network placement strategy and deploy many Vms at once. :And you want them all to finish. :-)

@mansam
Copy link
Contributor Author

mansam commented Jan 27, 2017

@Ladas Alright, so if I understand correctly it sounds like this PR is sufficient then and anything else will have to be done in Automate.

@mansam mansam changed the title [WIP] Advanced networking placement features and automate exposure Advanced networking placement features and automate exposure Jan 27, 2017
@Ladas
Copy link
Contributor

Ladas commented Jan 27, 2017

@mansam I think yes.

@tzumainn
Copy link
Contributor

@lsmola sounds like this is good to go? your review on this is much better than mine

@tzumainn
Copy link
Contributor

@Ladas ^

@Ladas
Copy link
Contributor

Ladas commented Feb 1, 2017

looks good to me 👍 but to verify it actually works, you will need to test it with the custom automate :-D

@tzumainn
Copy link
Contributor

tzumainn commented Feb 1, 2017

one minor comment; maybe it's worthwhile to alphabetize the order of methods? other than that, looks good to me!

@blomquisg
Copy link
Member

LGTM.

@gmcculloug anything else from the automate side you want to point out?

Copy link
Member

@gmcculloug gmcculloug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR gets us a step closer to the goal, but after the provision fails the task will have an error status and cannot be restarted.

While the automate state-machine does support jumping back to a previous state there will still need to be logic to reset the task (and likely the request) to a non-error status before retrying.

That work can be the focus of a separate PR.

end
end

def destroy_if_failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems odd to me. Automate has access to the raw_power_state since it is a db column and the destroy method is also available to automate from the remove_from_disk method defined in the base class.

Is it really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like it's no longer necessary based on what you are saying. I don't think I would have realized remove_from_disk would do the same thing prior to you mentioning it.

@mansam
Copy link
Contributor Author

mansam commented Feb 7, 2017

When you say "cannot be restarted" do you mean that the "on_error" handler can't retry the job for some reason, or something else?

@gmcculloug
Copy link
Member

I have not tired but I am pretty sure once the task has an error status we cannot jump back a few steps in the state-machine, select different options and re-run provisioning without resetting the state/status of the task and request. They will need to be reset as part of the logic to select new options and retry.

@mansam
Copy link
Contributor Author

mansam commented Feb 8, 2017

If I am understanding the documentation right, I would just have to do something like the following inside of the on_error handler:

# Provisioning failed, retry from the placement step
$evm.root['ae_result'] = 'restart' 
$evm.root['ae_next_state'] = 'Placement'

And again, as long as I understand correctly, there would only be changes in the automate domain (under Openstack's CheckProvision on_error handler for example), not in any of the core code here?

@gmcculloug
Copy link
Member

@mansam You are correct about the restart/next_state logic. At this point I would suggest testing the approach since no one has done this yet. You might find there are other issues that need to be addressed to support fully support this feature.

@mansam
Copy link
Contributor Author

mansam commented Feb 8, 2017

@gmcculloug Okay, gotcha. I have actually tested this (I have some automate changes locally to go with this) and it does appear to work correctly. I didn't realize this was relatively new ground. Thanks for your assistance.

Ladas and others added 7 commits February 8, 2017 15:36
Adding methods for querying utilization of networks, and allowing
the usage in automate.

These methods are allowing e.g. deployment to least utilized private
network, or deployment to least utilized private network connected to
least utilized public network.

Implements BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1205392
@miq-bot
Copy link
Member

miq-bot commented Feb 8, 2017

Checked commits mansam/manageiq@bf8eb2d~...cf578db with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
8 files checked, 2 offenses detected

app/models/manageiq/providers/openstack/network_manager/cloud_network.rb

@tzumainn
Copy link
Contributor

tzumainn commented Feb 8, 2017

@mansam thanks for the updates, and @gmcculloug thanks for your comments! It seems that the issues for this particular PR are resolved - is this mergeable now, or is there something overlooked?

@gmcculloug
Copy link
Member

Just need to confirm with @bdunne that it is ok to merge in light of PR #13783 which is moving automate engine and related files to a new repo. He is finalizing that work now.

@blomquisg blomquisg merged commit 53dfa5f into ManageIQ:master Feb 9, 2017
@blomquisg blomquisg added this to the Sprint 54 Ending Feb 13, 2017 milestone Feb 9, 2017
@tzumainn
Copy link
Contributor

@miq-bot add_label enhancement

@tzumainn
Copy link
Contributor

@miq-bot add_label provisioning,providers/openstack

@miq-bot
Copy link
Member

miq-bot commented Feb 13, 2017

@tzumainn Cannot apply the following label because they are not recognized: providers/openstack

@tzumainn
Copy link
Contributor

@miq-bot add_label providers/openstack/cloud

@gmcculloug
Copy link
Member

@mansam @blomquisg @tzumainn This merged PR lists in the description a dependency on ManageIQ/manageiq-gems-pending#43 which is not merged.

Can someone take a look at this?

@tzumainn
Copy link
Contributor

Ah, I took a quick look, and it looks good to me! I can't merge however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants