There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs.
#8438 (comment)
#8433 (comment)
#7344 (comment)
While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).
On the agent side, it gets stuck waiting for a response from the Script execution.
To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.
SELECT *
FROM performance_schema.metadata_locks
INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID
WHERE PROCESSLIST_ID <> CONNECTION_ID() \G;
This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.
This PR fixes a regression caused by #8465 on advanced zones, import fails with:
2024-01-10 12:13:33,234 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-3:ctx-991bbe9f job-128 ctx-f49517d4) (logid:d7b8e716) Allocating nic for vm 142272e8-9e2e-407b-9d7e-e9a03b81653c in network Network {"id": 204, "name": "Isolated", "uuid": "9679fac5-e3ac-4694-a57b-beb635340f39", "networkofferingid": 10} during import
2024-01-10 12:13:33,239 ERROR [o.a.c.v.UnmanagedVMsManagerImpl] (API-Job-Executor-3:ctx-991bbe9f job-128 ctx-f49517d4) (logid:d7b8e716) Failed to import NICs while importing vm: i-2-31-VM
com.cloud.exception.InsufficientVirtualNetworkCapacityException: Unable to acquire Guest IP address for network Network {"id": 204, "name": "Isolated", "uuid": "9679fac5-e3ac-4694-a57b-beb635340f39", "networkofferingid": 10}Scope=interface com.cloud.dc.DataCenter; id=1
at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.importNic(NetworkOrchestrator.java:4582)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importNic(UnmanagedVMsManagerImpl.java:859)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importVirtualMachineInternal(UnmanagedVMsManagerImpl.java:1198)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importUnmanagedInstanceFromHypervisor(UnmanagedVMsManagerImpl.java:1511)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.baseImportInstance(UnmanagedVMsManagerImpl.java:1342)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importUnmanagedInstance(UnmanagedVMsManagerImpl.java:1282)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Also, addresses the VNC password field set instead of a fixed string
1. Problem description
In Apache CloudStack (ACS), when a VM is deployed in a host with the KVM hypervisor, an XML file is created in the assigned host, which has a property shares that defines the weight of the VM to access the host CPU. The value of this property has no unit, and it is a relative measure to calculate how much CPU a given VM will have in the host. However, this value has a limit, which depends on the version of cgroup utilized by the host's kernel. The problem lies at the range value of shares that varies between both versions: [2, 264144] for cgroups version 1; and [1, 10000] for cgroups version 2. Currently, ACS calculates the value of shares using Equation 1, presented below, where CPU is the number of cores and speed is the CPU frequency; both specified in the VM's compute offering. Therefore, if a compute offering has, for example, 6 cores at 2 GHz, the shares value will be 12000 and an exception will be thrown by libvirt if the host utilizes cgroup v2. The second version is becoming the default one in current Linux distributions; thus, it is necessary to address this limitation.
Equation 1
shares = CPU * speed
Fixes: #6744
2. Proposed changes
To address the problem described, we propose to apply a scale conversion considering the max shares of the host. Using the same formula currently utilized by ACS, it is possible to calculate the maximum shares of a VM for a given host. In other words, using the number of cores and the nominal speed of the host's CPU as the upper limit of shares allowed to a VM. Then, this value will be scaled to the allowed interval of [1, 10000] of cgroup v2 by using a linear scale conversion.
The VM shares would be calculated as Equation 2, presented below, where VM requested shares is the requested shares value calculated using Equation 1, cgroup upper limit is fixed with a value of 10000 (cgroups v2 upper limit), and host max shares is the maximum shares value of the host, calculated using Equation 1. Using Equation 2, the only case where a VM passes the cgroup v2 limit is when the user requests more resources than the host has, which is not possible with the current implementation of ACS.
Equation 2
shares = (VM requested shares * cgroup upper limit)/host max shares
To implement the proposal, the following APIs will be updated: deployVirtualMachine, migrateVirtualMachine and scaleVirtualMachine. When a VM is being deployed, a new verification will be added to find a suitable host. The max shares of each host will be calculated, and the VM calculated shares will be verified if it does not surpass the host's value. Likewise, the migration of VMs will have a similar new verification. Lastly, the scale of VMs will also have the same verification for the VM's host.
To determine the max shares of a given host, we will use the same equation currently used in ACS for calculating the shares of VMs, presented in Section 1. When Equation 1 is used to determine the maximum shares of a host, CPU is the number of cores of the host, and speed is the nominal CPU speed, i.e., considering the CPU's base frequency.
It is important to note that these changes are only for hosts with the KVM hypervisor using cgroup v2 for now.
This PR provides a new primary storage volume type called "FiberChannel" that allows access to volumes connected to hosts over fiber channel connections. It requires Multipath to provide path discovery and failover. Second, the PR adds an AdaptivePrimaryDatastoreProvider that abstracts how volumes are managed/orchestrated from the connector to communicate with the primary storage provider, using a ProviderAdapter interface, allowing the code interacting with the primary storage provider API's to be simpler and have no direct dependencies on Cloudstack code. Lastly, the PR provides an implementation of the ProviderAdapter classes for the HP Enterprise Primera line of storage solutions and the Pure Flash Array line of storage solutions.
This PR adds the capability in CloudStack to convert VMware Instances disk(s) to KVM using virt-v2v and import them as CloudStack instances. It enables CloudStack operators to import VMware instances from vSphere into a KVM cluster managed by CloudStack. vSphere/VMware setup might be managed by CloudStack or be a standalone setup.
CloudStack will let the administrator select a VM from an existing VMware vCenter in the CloudStack environment or external vCenter requesting vCenter IP, Datacenter name and credentials.
The migrated VM will be imported as a KVM instance
The migration is done through virt-v2v: https://access.redhat.com/articles/1351473, https://www.ovirt.org/develop/release-management/features/virt/virt-v2v-integration.html
The migration process timeout can be set by the setting convert.instance.process.timeout
Before attempting the virt-v2v migration, CloudStack will create a clone of the source VM on VMware. The clone VM will be removed after the registration process finishes.
CloudStack will delegate the migration action to a KVM host and the host will attempt to migrate the VM invoking virt-v2v. In case the guest OS is not supported then CloudStack will handle the error operation as a failure
The migration process using virt-v2v may not be a fast process
CloudStack will not perform any check about the guest OS compatibility for the virt-v2v library as indicated on: https://access.redhat.com/articles/1351473.
This PR adds new functionality to copy snapshots across zones and take snapshots for multiple zones.
Copy functionality is similar to template copy. The source zone acts as the web server from where the destination zone(s) can download the snapshot files. For this purpose, a new API - `copySnapshot` has been added. The response for copySnapshot will be returning zone and download details from the first destination zone of the request. This behaviour is similar to the `copyTemplate` API.
In a similar manner, multiple zones can be selected while taking the snapshots or creating snapshot policies. For this snapshot will be taken in the base zone(in which volume is present) and then copied to the additional zones. A new parameter - `zoneids` has been added to `createSnapshot` and `createSnapshotPolicy` APIs.
As snapshots can be present on multiple zones (secondary stores), a new parameter `zoneid` has been added to delete the snapshot copy on a specific zone.
`listSnapshots` API has been updated to allow listing snapshot entries for different zones/datastores. New parameters - `showUnique`, `locationType` have been added.
Events generated during snapshot operations will now be linked to the snapshot itself rather than the volume of the snapshot.
`listSnapshotPolicies` and `createSnapshotPolicy` APIs will return zone details of the zones in which backup will be scheduled for the policy.
----
New API added
`copySnapshot`
Request and response params updated for APIs
```
- listSnapshots
- deleteSnapshot
- createTemplate
- listZones
- listSnapshotPolicies
- createSnapshotPolicy
```
UI updated for
- Snapshot detail view
- Create snapshot form
- Create snapshot policy form
- Create volume (from snapshot) form
- Create template (from snapshot) form
Doc PR: https://github.com/apache/cloudstack-documentation/pull/344
PR: https://github.com/apache/cloudstack/pull/7873
* Trigger out of band VM state update via libvirt event when VM stops
* Add License headers, refactor nested try
---------
Co-authored-by: Marcus Sorensen <mls@apple.com>
* Allow configkey to set 'cloud-name' cloud-init metadata
* Update engine/api/src/main/java/com/cloud/vm/VirtualMachineManager.java
Co-authored-by: Daniel Augusto Veronezi Salvador <38945620+GutoVeronezi@users.noreply.github.com>
* Update server/src/main/java/com/cloud/network/NetworkModelImpl.java
Co-authored-by: Daniel Augusto Veronezi Salvador <38945620+GutoVeronezi@users.noreply.github.com>
* Update server/src/main/java/com/cloud/network/router/CommandSetupHelper.java
Co-authored-by: Daniel Augusto Veronezi Salvador <38945620+GutoVeronezi@users.noreply.github.com>
* Revert "Update server/src/main/java/com/cloud/network/router/CommandSetupHelper.java"
This reverts commit 8abc3e38c4f973227e77253a8bcf3c94827d187f.
* Revert "Update server/src/main/java/com/cloud/network/NetworkModelImpl.java"
This reverts commit 7f239be919112913d764b141e943acca4473f755.
* Rework/Fix review code suggestions
---------
Co-authored-by: Marcus Sorensen <mls@apple.com>
Co-authored-by: Daniel Augusto Veronezi Salvador <38945620+GutoVeronezi@users.noreply.github.com>
This fixes the following cases in which Solidfire storage integration
caused issues when using Solidfire datadisks with VMware:
1. Take Volume Snapshot of Solidfire data disk
2. Delete an active Instance with Solidfire data disk attached
3. Attach used existing Solidfire data disk to a running/stopped VM
4. Stop and Start an instance with Solidfire data disks attached
5. Expand disk by resizing Solidfire data disk by providing size
6. Expand disk by changing disk offering for the Solidfire data disk
Additional changes:
- Use VMFS6 as managed datastore type if the host supports
- Refactor detection and splitting of managed storage ds name in storage
processor
- Restrict storage rescanning for managed datastore when resizing
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* 4.18:
Storage and volumes statistics tasks for StorPool primary storage (#7404)
proper storage construction (#6797)
guarantee MAC uniqueness (#7634)
server: allow migration of all VMs with local storage on KVM (#7656)
Add L2 networks to Zones with SG (#7719)
When deploying a VM is failed during the allocation process it may leave the resources that have been already allocated before the failure. They will get removed from the database as the whole code block is wrapped inside a transaction twice but the server would not inform the network or storage plugins to clean up the allocated resources.
This PR removes Transactions during VM allocation which results in the allocated VM and its resource records being persisted in DB even during failures. When failure is encountered VM is moved to Error state. This helps VM and its resources to be properly deallocated when it is expunged either by a server task such as ExpungeTask or during manual expunge.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Fixes case of account/domain having negative storage count when there is a volume size difference while deploying volumes on certain stroages. In case of Powerflex volume size results in a multiple of 8. If user deploys a volume of 12GB it will result in 16GB in size. But currently CloudStack will not update resource count after deployment.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* Live storage migration of volume in scaleIO within same storage scaleio cluster
* Added migrate command
* Recent changes of migration across clusters
* Fixed uuid
* recent changes
* Pivot changes
* working blockcopy api in libvirt
* Checking block copy status
* Formatting code
* Fixed failures
* code refactoring and some changes
* Removed unused methods
* removed unused imports
* Unit tests to check if volume belongs to same or different storage scaleio cluster
* Unit tests for volume livemigration in ScaleIOPrimaryDataStoreDriver
* Fixed offline volume migration case and allowed encrypted volume migration
* Added more integration tests
* Support for migration of encrypted volumes across different scaleio clusters
* Fix UI notifications for migrate volume
* Data volume offline migration: save encryption details to destination volume entry
* Offline storage migration for scaleio encrypted volumes
* Allow multiple Volumes to be migrated with migrateVirtualMachineWithVolume API
* Removed unused unittests
* Removed duplicate keys in migrate volume vue file
* Fix Unit tests
* Add volume secrets if does not exists during volume migrations. secrets are getting cleared on package upgrades.
* Fix secret UUID for encrypted volume migration
* Added a null check for secret before removing
* Added more unit tests
* Fixed passphrase check
* Add image options to the encypted volume conversion