Victoria ossa 2023 002 #22

JohnGarbutt · 2023-01-25T11:43:15Z

No description provided.

Some GPU drivers like i915 don't provide a name attribute for mdev types. As we don't use this attribute yet, let's just make sure we support the fact it's optional. Change-Id: Ia745ed7095c74e2bfba38379e623a3f81e7799eb Closes-Bug: #1896741 (cherry picked from commit 416cd1a)

When migrating/resizing VM to destination host that has VIF type difference from source VIF type, it fails due to exception in unplugging VIF on source host after user perform a confirmation action. This change unplugs the vifs in resize_instance and wraps the call to unplug in confirm with an try except block. the call to unplug_vifs in confirm is not removed to support rolling upgrades but a todo is added to remove it after the Wallaby release. Change-Id: I2c195df5fcf844c0587933b5b5995bdca1a3ebed Closes-Bug: #1895220 (cherry picked from commit 66c7f00)

Bug 1853009 describes a race condition involving multiple nova-compute services with ironic. As the compute services start up, the hash ring rebalances, and the compute services have an inconsistent view of which is responsible for a compute node. The sequence of actions here is adapted from a real world log [1], where multiple nova-compute services were started simultaneously. In some cases mocks are used to simulate race conditions. There are three main issues with the behaviour: * host2 deletes the orphan node compute node after host1 has taken ownership of it. * host1 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned. This prevents host1 from recreating the compute node. * host1 assumes that another compute service will not delete its resource providers. Once an RP is in the provider tree, it is not removed. This functional test documents the current behaviour, with the idea that it can be updated as this behaviour is fixed. [1] http://paste.openstack.org/show/786272/ Co-Authored-By: Matt Riedemann <[email protected]> Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd Related-Bug: #1853009 Related-Bug: #1853159 (cherry picked from commit 4a894fefc4f99db81d967cae0c23454c5a62b8ba)

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The issue being addressed here is that if a compute node is deleted by a host which thinks it is an orphan, then the compute host that actually owns the node might not recreate it if the node is already in its resource tracker cache. This change fixes the issue by clearing nodes from the resource tracker cache for which a compute node entry does not exist. Then, when the available resource for the node is updated, the compute node object is not found in the cache and gets recreated. Change-Id: I39241223b447fcc671161c370dbf16e1773b684a Partial-Bug: #1853009 (cherry picked from commit 8f5a078dd7bbe5b6b38cf8e04d916281dc418409)

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The issue being addressed here is that if a compute node is deleted by a host which thinks it is an orphan, then the resource provider for that node might also be deleted. The compute host that owns the node might not recreate the resource provider if it exists in the provider tree cache. This change fixes the issue by clearing resource providers from the provider tree cache for which a compute node entry does not exist. Then, when the available resource for the node is updated, the resource providers are not found in the cache and get recreated in placement. Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e Related-Bug: #1853009 Related-Bug: #1841481 (cherry picked from commit 7dcc6bfa63c1ab06d86b07bcdd05838a8ad35dec)

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The main race condition involved is in update_available_resources in the compute manager. When the list of compute nodes is queried, there is a compute node belonging to the host that it does not expect to be managing, i.e. it is an orphan. Between that time and deleting the orphan, the real owner of the compute node takes ownership of it ( in the resource tracker). However, the node is still deleted as the first host is unaware of the ownership change. This change prevents this from occurring by filtering on the host when deleting a compute node. If another compute host has taken ownership of a node, it will have updated the host field and this will prevent deletion from occurring. The first host sees this has happened via the ComputeHostNotFound exception, and avoids deleting its resource provider. Closes-Bug: #1853009 Related-Bug: #1841481 Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02 (cherry picked from commit 65a6756fa17ebfd644c76953c943ba8a4ad3787d)

In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be restored, to ensure we can reuse ironic node UUIDs as compute node UUIDs. While this seems to largely work, it results in some nasty errors being generated [3]: InvalidRequestError This session is in 'inactive' state, due to the SQL transaction being rolled back; no further SQL can be emitted within this transaction. This happens because compute_node_create is decorated with pick_context_manager_writer, which begins a transaction. While _compute_node_get_and_update_deleted claims that calling a second pick_context_manager_writer decorated function will begin a new subtransaction, this does not appear to be the case. This change removes pick_context_manager_writer from the compute_node_create function, and adds a new _compute_node_create function which ensures the transaction is finished if _compute_node_get_and_update_deleted is called. The new unit test added here fails without this change. This change marks the removal of the final FIXME from the functional test added in [4]. [1] https://bugs.launchpad.net/nova/+bug/1839560 [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411 [3] http://paste.openstack.org/show/786350/ [4] https://review.opendev.org/#/c/695012/ Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73 Closes-Bug: #1853159 Related-Bug: #1853009 (cherry picked from commit 54038b7f914d624a6684b5c0f168bdf84872a60c)

The nova-compute fails to start if the hypervisor has PCI addresses 32bit domain. Change-Id: I48dcb7faa17fe9f8346445a1746cff5845baf358 Related-Bug: #1897528 (cherry picked from commit 976ac72)

Nova and QEMU[1] supports PCI devices with a PCI address that has 16 bit domain. However there are hypervisors that reports PCI addresses with 32 bit domain. While today we cannot assign these to guests this should not prevent the nova-compute service to start. This patch changes the PCI manager to ignore such PCI devices. Please note that this patch does not change fact that nova does not allow specifying PCI addresses with 32bit domain in the [pci]/passthrough_whitelist configuration. Such configuration is still rejected at nova-compute service startup. Closes-Bug: #1897528 [1] https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169da97c3f2bad5/hw/core/qdev-properties.c#L993 Change-Id: I59a0746b864610b6a314078cf5661d3d2b84b1d4 (cherry picked from commit 8c9d6fc)

Our console proxies (novnc, serial, spice) run in a websockify server whose request handler inherits from the python standard SimpleHTTPRequestHandler. There is a known issue [1] in the SimpleHTTPRequestHandler which allows open redirects by way of URLs in the following format: http://vncproxy.my.domain.com//example.com/%2F.. which if visited, will redirect a user to example.com. We can intercept a request and reject requests that pass a redirection URL beginning with "//" by implementing the SimpleHTTPRequestHandler.send_head() method containing the vulnerability to reject such requests with a 400 Bad Request. This code is copied from a patch suggested in one of the issue comments [2]. Closes-Bug: #1927677 [1] https://bugs.python.org/issue32084 [2] https://bugs.python.org/issue32084#msg306545 Conflicts: nova/console/websocketproxy.py nova/tests/unit/console/test_websocketproxy.py NOTE(melwitt): The conflicts are because the following changes are not in Victoria: Ib2c406327fef2fb4868d8050fc476a7d17706e23 (Remove six.moves) I58b0382c86d4ef798572edb63d311e0e3e6937bb (Refactor and rename test_tcp_rst_no_compute_rpcapi) Change-Id: Ie36401c782f023d1d5f2623732619105dc2cfa24 (cherry picked from commit 781612b) (cherry picked from commit 4709256)

Awhile back, change I25baf5edd25d9e551686b7ed317a63fd778be533 moved rbd_utils out from the libvirt driver and into a central location under nova/storage. This move missed adding a __init__.py file under nova/tests/unit/storage, so unit test discovery wasn't picking up the rbd_utils tests and couldn't run them. This adds a __init__.py file under nova/tests/unit/storage to get the tests running again. This also fixes a small bug introduced by change I3032bbe6bd2d6acc9ba0f0cac4d00ed4b4464ceb in RbdTestCase.setUp() that passed nonexistent self.images_rbd_pool to self.flags. It should be self.rbd_pool. Closes-Bug: #1928007 Change-Id: Ic03a5336abdced883f62f395690c0feac12075c8 (cherry picked from commit 8b647f1) (cherry picked from commit 8f018d7)

If2608406776e0d5a06b726e65b55881e70562d18 dropped the single node grenade job from the integrated-gate-compute template as it duplicates the existing grenade-multinode job. However it doesn't remove the remianing single node grenade job still present in the Nova project. This change replaces the dsvm based nova-grenade-multinode job with the zuulv3 native grenade-multinode based job. Various legacy playbooks and hook scripts are also removed as they are no longer used. Note that this does result in a loss of coverage for ceph that should be replaced as soon as a zuulv3 native ceph based multinode job is available. Change-Id: I02b2b851a74f24816d2f782a66d94de81ee527b0 (cherry picked from commit 91e53e4) (cherry picked from commit c45bedd)

Change Ifb3afb13aff7e103c2e80ade817d0e63b624604a added a nova side config option for specifying neutron client retries that maps to the ksa connect_retries config option to provide parity with the cinder and glance clients that have nova side config options. That change missed passing CONF.neutron.http_retries to the manual client used for calling the port binding API. This sets the connect_retries attribute on the manual ksa client so http_retries will be honored. Closes-Bug: #1929886 Related-Bug: #1866937 Change-Id: I8296e4be9f0706fab043451b856efadbb7bd41f6 (cherry picked from commit 56eb253) (cherry picked from commit 46aa3f4)

NOTE(melwitt): This is a combination of two changes to avoid intermittent test failure that was introduced by the original bug fix, and was fixed by change I2bd360dcc6501feea7baf02d4510b282205fc061. We have discovered that if an exception is raised at any point during the running of the init_application WSGI script in an apache/mod_wsgi Daemon Mode environment, it will prompt apache/mod_wsgi to re-run the script without starting a fresh python process. Because we initialize global data structures during app init, subsequent runs of the script blow up as some global data do *not* support re-initialization. It is anyway not safe to assume that init of global data is safe to run multiple times. This mod_wsgi behavior appears to be a special situation that does not behave the same as a normal reload in Daemon Mode as the script file is being reloaded upon failure instead of the daemon process being shutdown and restarted as described in the documentation [1]. In order to handle this situation, we can move the initialization of global data structures to a helper method that is decorated to run only once per python interpreter instance. This way, we will not attempt to re-initialize global data that are not safe to init more than once. Co-Authored-By: Michele Baldessari <[email protected]> Co-Authored-By: melanie witt <[email protected]> Conflicts: nova/api/openstack/wsgi_app.py NOTE(melwitt): The conflict is because change If4783adda92da33d512d7c2834f0bb2e2a9b9654 (Support sys.argv in wsgi app) is not in Victoria. Closes-Bug: #1882094 [1] https://modwsgi.readthedocs.io/en/develop/user-guides/reloading-source-code.html#reloading-in-daemon-mode Reset global wsgi app state in unit test Since I2bd360dcc6501feea7baf02d4510b282205fc061 there is a global state set during the wsgi_app init making our unit test cases non-deterministic based on the order of them. This patch makes sure that the global state is reset for each test case. Closes-Bug: #1921098 (cherry picked from commit bc2c19b) Change-Id: I2bd360dcc6501feea7baf02d4510b282205fc061 (cherry picked from commit 7c9edc0)

Error-out the migrations (cold and live) whenever the anti-affinity policy is violated. This addresses violations when multiple concurrent migrations are requested. Added detection on: - prep_resize - check_can_live_migration_destination - pre_live_migration The improved method of detection now locks based on group_id and considers other migrations in-progress as well. Closes-bug: #1821755 Change-Id: I32e6214568bb57f7613ddeba2c2c46da0320fabc (cherry picked from commit 33c8af1) (cherry picked from commit 8b62a4e)

By default, we don't currently allow the Nova microversion header for CORS requests. It should be something that is included out of the box because it's part of the core API. Deployers can workaround this by overriding allow_headers, but we should provide a better experience out of the box. Closes-Bug: #1931908 Change-Id: Idf4650f36952331f02d7512580c21451f3ee3b63 (cherry picked from commit b02a95a)

This currently runs in the 'check' pipeline, as part of the pep8 job, which causes otherwise perfectly valid backports to report as failing CI. There's no reason a stable core shouldn't be encouraged to review these patches: we simply want to prevent them *merging* before their parent(s). Resolve this conflict by moving the check to separate voting job in the 'gate' pipeline as well as a non-voting job in the 'check' pipeline to catch more obvious issues. NOTE(lyarwood): Conflict as I672904e9bfb45a66a82331063c7d49c4bc0439df isn't present on stable/victoria. Conflicts: .zuul.yaml Change-Id: Id3e4452883f6a3cf44ff58b39ded82e882e28c23 Signed-off-by: Stephen Finucane <[email protected]> (cherry picked from commit 98b01c9) (cherry picked from commit fef0305)

During the VM booting process Nova asks Neutron for the security groups of the project. If there are no any fields specified, Neutron will prepare list of security groups with all fields, including rules. In case if project got many SGs, it may take long time as rules needs to be loaded separately for each SG on Neutron's side. During booting of the VM, Nova really needs only "id" and "name" of the security groups so this patch limits request to only those 2 fields. This lazy loading of the SG rules was introduced in Neutron in [1] and [2]. [1] https://review.opendev.org/#/c/630401/ [2] https://review.opendev.org/#/c/637407/ Related-Bug: #1865223 Change-Id: I15c3119857970c38541f4de270bd561166334548 (cherry picked from commit 388498a) (cherry picked from commit 4f49545)

…t" into stable/victoria

If the ceph df command fails in the get_pool_info method of RBD utils the actual command executed if seen by the users in the fault error message. This hides the command behind a StorageError exception and logs the exception instead of leaking it to the users. Change-Id: I6e3a73f2e04d1a7636daf96d5af73c9cf2fbe220 Closes-Bug: 1926978 (cherry picked from commit 86af7fe) (cherry picked from commit 5ede75c)

…victoria

…oria

Consider the following situation: - Using the Ironic virt driver - Replacing (so removing and re-adding) all baremetal nodes associated with a single nova-compute service The update resources periodic will have destroyed the compute node records because they're no longer being reported by the virt driver. If we then attempt to manually delete the compute service record, the datbase layer will raise an exception, as there are no longer any compute node records for the host. This exception gets bubbled up as an error 500 in the API. This patch adds a unit test to demonstrate this. Related bug: 1860312 Change-Id: I03eec634b25582ec9643cacf3e5868c101176983 (cherry picked from commit 32257a2) (cherry picked from commit e6cd23c)

Consider the following situation: - Using the Ironic virt driver - Replacing (so removing and re-adding) all baremetal nodes associated with a single nova-compute service The update resources periodic will have destroyed the compute node records because they're no longer being reported by the virt driver. If we then attempt to manually delete the compute service record, the datbase layer will raise an exception, as there are no longer any compute node records for the host. Previously, this exception would get bubbled up as an error 500 in the API. This patch catches it and allows service deletion to complete succefully. Closes bug: 1860312 Change-Id: I2f9ad3df25306e070c8c3538bfed1212d6d8682f (cherry picked from commit 880611d) (cherry picked from commit df5158b)

This is a followup for change Ie36401c782f023d1d5f2623732619105dc2cfa24 to reduce mocking in the unit test coverage for it. While backporting the bug fix, it was found to be incompatible with earlier versions of Python < 3.6 due to a difference in internal implementation [1]. This reduces the mocking in the unit test to be more agnostic to the internals of the StreamRequestHandler (ancestor of SimpleHTTPRequestHandler) and work across Python versions >= 2.7. Related-Bug: #1927677 [1] python/cpython@34eeed4 Change-Id: I546d376869a992601b443fb95acf1034da2a8f36 (cherry picked from commit 214cabe) (cherry picked from commit 9c2f297)

…/victoria

During resize, on the source host, in resize_instance(), the instance.host and .node is updated to point to the destination host. This indicates to the source host's resource tracker that the allocation of this instance does not need to be tracked as an instance but as an outbound migration instead. However for the source host's resource tracker to do that it, needs to use the instance.old_flavor. Unfortunately the instance.old_flavor is only set during finish_resize() on the destination host. (resize_instance cast to the finish_resize). So it is possible that a running resize_instance() set the instance.host to point to the destination and then before the finish_resize could set the old_flavor an update_available_resources periodic runs on the source host. This causes that the allocation of this instance is not tracked as an instance as the instance.host point to the destination but it is not tracked as a migration either as the instance.old_flavor is not yet set. So the allocation on the source host is simply dropped by the periodic job. When such migration is confirmed the confirm_resize() tries to drop the same resource allocation again but fails as the pinned CPUs of the instance already freed. When such migration is reverted instead, then revert succeeds but the source host resource allocation will not contain the resource allocation of the instance until the next update_available_resources periodic runs and corrects it. This does not affect resources tracked exclusively in placement (e.g. VCPU, MEMORY_MB, DISK_GB) but it does affect NUMA related resource that are still tracked in the resource tracker (e.g. huge pages, pinned CPUs). This patch moves the instance.old_flavor setting to the source node to the same transaction that sets the instance.host to point to the destination host. Hence solving the race condition. Change-Id: Ic0d6c59147abe5e094e8f13e0d3523b178daeba9 Closes-Bug: #1944759 (cherry picked from commit b841e55) (cherry picked from commit d4edcd6) (cherry picked from commit c8b04d1)

The libvirt driver power on and hard reboot destroys the domain first and unplugs the vifs then recreate the domain and replug the vifs. However nova does not wait for the network-vif-plugged event before unpause the domain. This can cause that the domain starts running and requesting IP via DHCP before the networking backend finished plugging the vifs. So this patch adds a workaround config option to nova to wait for network-vif-plugged events during hard reboot the same way as nova waits for this event during new instance spawn. This logic cannot be enabled unconditionally as not all neutron networking backend sending plug time events to wait for. Also the logic needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend often deployed together on the same compute. While ml2/ovs sends plug time event the sriov backend does not send it reliably. So the configuration is not just a boolean flag but a list of vnic_types instead. This way the waiting for the plug time event for a vif that is handled by ml2/ovs is possible while the instance has other vifs handled by the sriov backend where no event can be expected. Conflicts: nova/virt/libvirt/driver.py both I73305e82da5d8da548961b801a8e75fb0e8c4cf1 and I0b93bdc12cdce591c7e642ab8830e92445467b9a are not in stable/victoria The stable/victoria specific changes: * The list of accepted vnic_type-s are adapted to what is supported by neutron on this release. So vdpa, accelerator-direct, and accelerator-direct-physical are removed as they are only added in stable/wallaby Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499 Closes-Bug: #1946729 (cherry picked from commit 68c970e) (cherry picked from commit 0c41bfb) (cherry picked from commit 89c4ff5)

Currently neutron can report ports to have MAC addresses in upper case when they're created like that. In the meanwhile libvirt configuration file always stores MAC in lower case which leads to KeyError while trying to retrieve migrate_vif. Conflicts: nova/tests/unit/virt/libvirt/test_migration.py Note: conflict is caused by not having six.text_type removal patch I779bd1446dc1f070fa5100ccccda7881fa508d79 in stable/victoria. Original assertion method was preserved to resolve this conflict. Closes-Bug: #1945646 Change-Id: Ie3129ee395427337e9abcef2f938012608f643e1 (cherry picked from commit 6a15169) (cherry picked from commit 63a6388) (cherry picked from commit 6c3d5de)

This patch adds a functional test that reproduces a race between incoming migration and the update_available_resource periodic Fixes: - Added more memory to mock 'host_info', since the default would not fit the instance. Default was changed in later releases Change-Id: I4be429c56aaa15ee12f448978c38214e741eae63 Related-Bug: #1953359 (cherry picked from commit c59224d) (cherry picked from commit f0a6d94) (cherry picked from commit d8859e4)

This patch extends the original reproduction I4be429c56aaa15ee12f448978c38214e741eae63 to cover bug 1952915 as well as they have a common root cause. Change-Id: I57982131768d87e067d1413012b96f1baa68052b Related-Bug: #1953359 Related-Bug: #1952915 (cherry picked from commit 9f296d7) (cherry picked from commit 0411962) (cherry picked from commit 94f17be)

There is a race condition between an incoming resize and an update_available_resource periodic in the resource tracker. The race window starts when the resize_instance RPC finishes and ends when the finish_resize compute RPC finally applies the migration context on the instance. In the race window, if the update_available_resource periodic is run on the destination node, then it will see the instance as being tracked on this host as the instance.node is already pointing to the dest. But the instance.numa_topology still points to the source host topology as the migration context is not applied yet. This leads to CPU pinning error if the source topology does not fit to the dest topology. Also it stops the periodic task and leaves the tracker in an inconsistent state. The inconsistent state only cleanup up after the periodic is run outside of the race window. This patch applies the migration context temporarily to the specific instances during the periodic to keep resource accounting correct. Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a Closes-Bug: #1953359 Closes-Bug: #1952915 (cherry picked from commit 32c1044) (cherry picked from commit 1235dc3) (cherry picked from commit 5f2f283)

…toria

…o stable/victoria

…ble/victoria

During the PTG the TC discussed the topic and decided to drop the job completely. Since the latest job configuration broke all stable gate for nova (older than yoga) this is needed there to unblock our gates. For dropping the job on master let's wait to the resolution as the gate is not broken there, hence the patch is stable-only. Conflicts: .zuul.yaml lower-constraints.txt NOTE(elod.illes): conflict is due to branch specific settings (job template names, lower constraints changes). Change-Id: I514f6b337ffefef90a0ce9ab0b4afd083caa277e (cherry picked from commit 15b7271) (cherry picked from commit ba3c5b8) (cherry picked from commit 327693a)

When tox 'docs' target is called, first it installs the dependencies (listed in 'deps') in 'installdeps' phase, then it installs nova (with its requirements) in 'develop-inst' phase. In the latter case 'deps' is not used so that the constraints defined in 'deps' are not used. This could lead to failures on stable branches when new packages are released that break the build. To avoid this, the simplest solution is to pre-install requirements, i.e. add requirements.txt to 'docs' tox target. Conflicts: tox.ini NOTE(elod.illes): conflict is due to branch specific upper constraints file link. Change-Id: I4471d4488d336d5af0c23028724c4ce79d6a2031 (cherry picked from commit 494e8d7) (cherry picked from commit 1ac0d69) (cherry picked from commit 64cc084) (cherry picked from commit f66a570)

We have placement-nova-tox-functional-py38 job defined and run on placement gate[1] to run the nova functional test excluding api and notification _sample_tests, and db-related tests but that job skip those tests via tox_extra_args which is not right way to do as we currently facing error when tox_extra_args is included in tox siblings task - https://opendev.org/zuul/zuul-jobs/commit/c02c28a982da8d5a9e7b4ca38d30967f6cd1531d - https://zuul.openstack.org/build/a8c186b2c7124856ae32477f10e2b9a4 Let's define a new tox env which can exclude the required test in stestr command itself. Conflicts: tox.ini NOTE(melwitt): The conflict is because change I672904e9bfb45a66a82331063c7d49c4bc0439df (Add functional-py39 testing) is not in Victoria. [1] https://opendev.org/openstack/placement/src/commit/bd5b19c00e1ab293fc157f4699bc4f4719731c25/.zuul.yaml#L83 Change-Id: I20d6339a5203aed058f432f68e2ec1af57030401 (cherry picked from commit 7b063e4) (cherry picked from commit 64f5c1c) (cherry picked from commit baf0d93)

This patch is based upon a downstream patch which came up in discussion amongst the ironic community when some operators began discussing a case where resource providers had disappeared from a running deployment with several thousand baremetal nodes. Discussion amongst operators and developers ensued and we were able to determine that this was still an issue in the current upstream code and that time difference between collecting data and then reconciling the records was a source of the issue. Per Arun, they have been running this change downstream and had not seen any reoccurances of the issue since the patch was applied. This patch was originally authored by Arun S A G, and below is his original commit mesage. An instance could be launched and scheduled to a compute node between get_uuids_by_host() call and _get_node_list() call. If that happens the ironic node.instance_uuid may not be None but the instance_uuid will be missing from the instance list returned by get_uuids_by_host() method. This is possible because _get_node_list() takes several minutes to return in large baremetal clusters and a lot can happen in that time. This causes the compute node to be orphaned and associated resource provider to be deleted from placement. Once the resource provider is deleted it is never created again until the service restarts. Since resource provider is deleted subsequent boots/rebuilds to the same host will fail. This behaviour is visibile in VMbooter nodes because it constantly launches and deletes instances there by increasing the likelihood of this race condition happening in large ironic clusters. To reduce the chance of this race condition we call _get_node_list() first followed by get_uuids_by_host() method. Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8 Co-Authored-By: Arun S A G <[email protected]> Related-Bug: #1841481 (cherry picked from commit f84d591) (cherry picked from commit 0c36bd2)

…le/victoria

…nto stable/victoria

When the nova-compute service starts, by default it attempts to startup instance configuration states for aspects such as networking. This is fine in most cases, and makes a lot of sense if the nova-compute service is just managing virtual machines on a hypervisor. This is done, one instance at a time. However, when the compute driver is ironic, the networking is managed as part of the physical machine lifecycle potentially all the way into committed switch configurations. As such, there is no need to attempt to call ``plug_vifs`` on every single instance managed by the nova-compute process which is backed by Ironic. Additionally, using ironic tends to manage far more physical machines per nova-compute service instance then when when operating co-installed with a hypervisor. Often this means a cluster of a thousand machines, with three controllers, will see thousands of un-needed API calls upon service start, which elongates the entire process and negatively impacts operations. In essence, nova.virt.ironic's plug_vifs call now does nothing, and merely issues a debug LOG entry when called. Closes-Bug: #1777608 Change-Id: Iba87cef50238c5b02ab313f2311b826081d5b4ab (cherry picked from commit 7f81cf2) (cherry picked from commit eb6d70f) (cherry picked from commit f210115)

This change add a repoducer test for evacuating a vm in the powering-off state While backporting to stable/victoria Fixed conflict in functional.integrated_helpers 1 - Added placeholder NOT_SPECIFIED as object Its a default parameter value for _evacuate_server 2 - Updated _evacuate_server defition to allow function to wait, until expected server state reached 3 - Added _start_server and _stop_server 4 - Updated _evacuate_server arguments for test_evacuate as per updated _evacuate_server signature Related-Bug: #1978983 Change-Id: I5540df6c7497956219c06cff6f15b51c2c8bc299 (cherry picked from commit 5904c7f) (cherry picked from commit 6bd0bf0) (cherry picked from commit 1e0af92) (cherry picked from commit b57b0ee)

ignore instance task state and continue with vm evacutaion. Closes-Bug: #1978983 Change-Id: I5540df6c7497956219c06cff6f15b51c2c8bc29d (cherry picked from commit db919aa) (cherry picked from commit 6d61fcc) (cherry picked from commit 8e9aa71) (cherry picked from commit 0b8124b)

…ictora-ironic-rebalance

Nova's use of libvirt's compareCPU() API served its purpose over the years, but its design limitations break live migration in subtle ways. For example, the compareCPU() API compares against the host physical CPUID. Some of the features from this CPUID aren not exposed by KVM, and then there are some features that KVM emulates that are not in the host CPUID. The latter can cause bogus live migration failures. With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in terms of CPU compatibility checks on the destination host during live migration. Nova satisfies these minimum version requirements by a good margin. So, provide a workaround to skip the CPU comparison check on the destination host before migrating a guest, and let libvirt handle it correctly. This workaround will be removed once Nova replaces the older libvirt APIs with their newer and improved counterparts[1][2]. - - - Note that Nova's libvirt driver calls compareCPU() in another method, _check_cpu_compatibility(); I did not remove its usage yet. As it needs more careful combing of the code, and then: - where possible, remove the usage of compareCPU() altogether, and rely on libvirt doing the right thing under the hood; or - where Nova _must_ do the CPU comparison checks, switch to the better libvirt CPU APIs -- baselineHypervisorCPU() and compareHypervisorCPU() -- that are described here[1]. This is work in progress[2]. [1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27 [2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d Closes-Bug: #1913716 Signed-off-by: Kashyap Chamarthy <[email protected]> Signed-off-by: Balazs Gibizer <[email protected]> (cherry picked from commit 267a406)

Adds a regression test or repoducer for post live migration fail at destination, the possible casue can be fail to get instance network info or block device info changes: adds return server from _live_migrate in _integrated_helpers Related-Bug: #1628606 Change-Id: I48dbe0aae8a3943fdde69cda1bd663d70ea0eb19 (cherry picked from commit a20baec) (cherry picked from commit 74a618a) (cherry picked from commit 71e5a1d) (cherry picked from commit 5efcc3f) (cherry picked from commit ed1ea71)

This change add a new _post_live_migration_update_host function that wraps _post_live_migration and just ensures that if we exit due to an exception instance.host is set to the destination host. when we are in _post_live_migration the guest has already started running on the destination host and we cannot revert. Sometimes admins or users will hard reboot the instance expecting that to fix everything when the vm enters the error state after the failed migrations. Previously this would end up recreating the instance on the source node leading to possible data corruption if the instance used shared storage. Change-Id: Ibc4bc7edf1c8d1e841c72c9188a0a62836e9f153 Partial-Bug: #1628606 (cherry picked from commit 8449b7c) (cherry picked from commit 643b0c7) (cherry picked from commit 17ae907) (cherry picked from commit 15502dd) (cherry picked from commit 43c0e40)

In response to bug 1927677 we added a workaround to NovaProxyRequestHandler to respond with a 400 Bad Request if an open redirect is attempted: Ie36401c782f023d1d5f2623732619105dc2cfa24 I95f68be76330ff09e5eabb5ef8dd9a18f5547866 Recently in python 3.10.6, a fix has landed in cpython to respond with a 301 Moved Permanently to a sanitized URL that has had extra leading '/' characters removed. This breaks our existing unit tests which assume a 400 Bad Request as the only expected response. This adds handling of a 301 Moved Permanently response and asserts that the redirect location is the expected sanitized URL. Doing this instead of checking for a given python version will enable the tests to continue to work if and when the cpython fix gets backported to older python versions. While updating the tests, the opportunity was taken to commonize the code of two unit tests that were nearly identical. Conflicts: nova/tests/unit/console/test_websocketproxy.py NOTE(melwitt): The conflict is because change I58b0382c86d4ef798572edb63d311e0e3e6937bb (Refactor and rename test_tcp_rst_no_compute_rpcapi) is not in Victoria. Related-Bug: #1927677 Closes-Bug: #1986545 Change-Id: I27441d15cc6fa2ff7715ba15aa900961aadbf54a (cherry picked from commit 15769b8) (cherry picked from commit 4a2b44c) (cherry picked from commit 0e4a257) (cherry picked from commit 3023e16)

…le/victoria

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Conflicts vs wallaby in: nova/conf/compute.py nova/tests/unit/virt/test_images.py Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360 (cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)

…023-002 Change-Id: I55447f5e2c0a82a1ac986e48c06b231eeaa11922

This reverts commit 7a7a223. That commit was added because - tl'dr - upon revert resize, Neutron with the OVS backend and the iptables security group driver would send us the network-vif-plugged event as soon as we updated the port binding. That behaviour has changed with commit 66c7f00. With that commit, we started unplugging the vifs on the source compute host when doing a resize. When reverting the resize, the vifs had to be re-plugged again, regarldess of the networking backend in use. This renders commit 7a7a223. pointless, and it can be reverted. Backport is clean from master, and the TODO that was present in the commit message on master is removed, as its a driver interface change and can only be done on master. Closes-bug: 1952003 Change-Id: I3cb39a9ec2c260f422b3c48122b9db512cdd799b (cherry picked from commit 0b0f40d) (cherry picked from commit c3ebe0f) (cherry picked from commit 36378de) Conflicts: zuul.yaml

sbauza and others added 30 commits October 14, 2020 21:00

Reproduce bug 1897528

0354d4d

The nova-compute fails to start if the hypervisor has PCI addresses 32bit domain. Change-Id: I48dcb7faa17fe9f8346445a1746cff5845baf358 Related-Bug: #1897528 (cherry picked from commit 976ac72)

Merge "Reproduce bug 1897528" into stable/victoria

e553082

Merge "Ignore PCI devices with 32bit domain" into stable/victoria

78a63c2

Merge "rbd: Get rbd_utils unit tests running again" into stable/victoria

4bb2338

Merge "Initialize global data separately and run_once in WSGI app ini…

66fb0ec

…t" into stable/victoria

Merge "Honor [neutron]http_retries in the manual client" into stable/…

cf6288b

…victoria

Merge "Stop leaking ceph df cmd in RBD utils" into stable/victoria

4135970

Merge "Reject open redirection in the console proxy" into stable/vict…

b2fd01f

…oria

Merge "Move 'check-cherry-picks' test to gate, n-v check" into stable…

4ce01d6

…/victoria

Balazs Gibizer and others added 30 commits September 27, 2021 14:39

Merge "Reproduce bug 1953359" into stable/victoria

920b3b3

Merge "Extend the reproducer for 1953359 and 1952915" into stable/vic…

69fafb9

…toria

Merge "Add a WA flag waiting for vif-plugged event during reboot" int…

9b3d69c

…o stable/victoria

Merge "[rt] Apply migration context for incoming migrations" into sta…

b5b57c4

…ble/victoria

Merge "libvirt: make mdev types name attribute be optional" into stab…

8097c2b

…le/victoria

Merge "[ironic] Minimize window for a resource provider to be lost" i…

4b9eba6

…nto stable/victoria

Merge remote-tracking branch 'origin/stable/victoria' into stackhpc/v…

31fd81c

…ictora-ironic-rebalance

Merge remote-tracking branch 'origin/stable/victoria' into stackhpc/v…

c0ade88

…ictora-ironic-rebalance

Merge "Fix unplugging VIF when migrate/resize VM" into stable/victoria

31814f2

Merge "Adapt websocketproxy tests for SimpleHTTPServer fix" into stab…

70fd0cf

…le/victoria

Merge branch 'stackhpc/victora-ironic-rebalance' into victoria-OSSA-2…

73d07dc

…023-002 Change-Id: I55447f5e2c0a82a1ac986e48c06b231eeaa11922

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Victoria ossa 2023 002 #22

Victoria ossa 2023 002 #22

JohnGarbutt commented Jan 25, 2023

Victoria ossa 2023 002 #22

Are you sure you want to change the base?

Victoria ossa 2023 002 #22

Conversation

JohnGarbutt commented Jan 25, 2023