This pull request (PR) implements a Distributed Resource Scheduler (DRS) for a CloudStack cluster. The primary objective of this feature is to enable automatic resource optimization and workload balancing within the cluster by live migrating the VMs as per configuration.
Administrators can also execute DRS manually for a cluster, using the UI or the API.
Adds support for two algorithms - condensed & balanced. Algorithms are pluggable allowing ACS Administrators to have customized control over scheduling.
Implementation
There are three top level components:
Scheduler
A timer task which:
Generate DRS plan for clusters
Process DRS plan
Remove old DRS plan records
DRS Execution
We go through each VM in the cluster and use the specified algorithm to check if DRS is required and to calculate cost, benefit & improvement of migrating that VM to another host in the cluster. On the basis of cost, benefit & improvement, the best migration is selected for the current iteration and the VM is migrated. The maximum number of iterations (live migrations) possible on the cluster is defined by drs.iterations which is defined as a percentage (as a value between 0 and 1) of total number of workloads.
Algorithm
Every algorithms implements two methods:
needsDrs - to check if drs is required for cluster
getMetrics - to calculate cost, benefit & improvement of a migrating a VM to another host.
Algorithms
Condensed - Packs all the VMs on minimum number of hosts in the cluster.
Balanced - Distributes the VMs evenly across hosts in the cluster.
Algorithms use drs.level to decide the amount of imbalance to allow in the cluster.
APIs Added
listClusterDrsPlan
id - ID of the DRS plan to list
clusterid - to list plans for a cluster id
generateClusterDrsPlan
id - cluster id
iterations - The maximum number of iterations in a DRS job defined as a percentage (as a value between 0 and 1) of total number of workloads. Defaults to value of cluster's drs.iterations setting.
executeClusterDrsPlan
id - ID of the cluster for which DRS plan is to be executed.
migrateto - This parameter specifies the mapping between a vm and a host to migrate that VM. Format of this parameter: migrateto[vm-index].vm=<uuid>&migrateto[vm-index].host=<uuid>.
Config Keys Added
ClusterDrsPlanExpireInterval
Key drs.plan.expire.interval
Scope Global
Default Value 30 days
Description The interval in days after which old DRS records will be cleaned up.
ClusterDrsEnabled
Key drs.automatic.enable
Scope Cluster
Default Value false
Description Enable/disable automatic DRS on a cluster.
ClusterDrsInterval
Key drs.automatic.interval
Scope Cluster
Default Value 60 minutes
Description The interval in minutes after which a periodic background thread will schedule DRS for a cluster.
ClusterDrsIterations
Key drs.max.migrations
Scope Cluster
Default Value 50
Description Maximum number of live migrations in a DRS execution.
ClusterDrsAlgorithm
Key drs.algorithm
Scope Cluster
Default Value condensed
Description DRS algorithm to execute on the cluster. This PR implements two algorithms - balanced & condensed.
ClusterDrsLevel
Key drs.imbalance
Scope Cluster
Default Value 0.5
Description Percentage (as a value between 0.0 and 1.0) of imbalance allowed in the cluster. 1.0 means no imbalance
is allowed and 0.0 means imbalance is allowed.
ClusterDrsMetric
Key drs.imbalance.metric
Scope Cluster
Default Value memory
Description The cluster imbalance metric to use when checking the drs.imbalance.threshold. Possible values are memory and cpu.
This PR adds new functionality to copy snapshots across zones and take snapshots for multiple zones.
Copy functionality is similar to template copy. The source zone acts as the web server from where the destination zone(s) can download the snapshot files. For this purpose, a new API - `copySnapshot` has been added. The response for copySnapshot will be returning zone and download details from the first destination zone of the request. This behaviour is similar to the `copyTemplate` API.
In a similar manner, multiple zones can be selected while taking the snapshots or creating snapshot policies. For this snapshot will be taken in the base zone(in which volume is present) and then copied to the additional zones. A new parameter - `zoneids` has been added to `createSnapshot` and `createSnapshotPolicy` APIs.
As snapshots can be present on multiple zones (secondary stores), a new parameter `zoneid` has been added to delete the snapshot copy on a specific zone.
`listSnapshots` API has been updated to allow listing snapshot entries for different zones/datastores. New parameters - `showUnique`, `locationType` have been added.
Events generated during snapshot operations will now be linked to the snapshot itself rather than the volume of the snapshot.
`listSnapshotPolicies` and `createSnapshotPolicy` APIs will return zone details of the zones in which backup will be scheduled for the policy.
----
New API added
`copySnapshot`
Request and response params updated for APIs
```
- listSnapshots
- deleteSnapshot
- createTemplate
- listZones
- listSnapshotPolicies
- createSnapshotPolicy
```
UI updated for
- Snapshot detail view
- Create snapshot form
- Create snapshot policy form
- Create volume (from snapshot) form
- Create template (from snapshot) form
Doc PR: https://github.com/apache/cloudstack-documentation/pull/344
PR: https://github.com/apache/cloudstack/pull/7873
* marvin,test: fix directdownload template checksum test
During failure while deploying a VM with wrong checksum template, VM may be left in Error state. This PR adds code to delete such VM.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* remove unnecessary logs
---------
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Fixes#8034
Adds the following test for a backed-up snapshot (original template and VM deleted beforehand):
- Create volume from snapshot
- Create a template from the snapshot and deploy a VM using it
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
This PR fixes few issues:
- check ip range of new network instead of network cidr, so that the two networks can use same cidr but no IP conflicts.
- Private gateways: return vlan number only for root admins
- when update isolated network, check new guest vm cidr and IPs of neworks/vpc gateways associated to it
* Don't allow inadvertent deletion of hidden details via API
* Update VM details unit test ensuring system/hidden details not removed
* Update test/integration/component/test_update_vm.py
---------
Co-authored-by: Marcus Sorensen <mls@apple.com>
Co-authored-by: dahn <daan.hoogland@gmail.com>
* VMware: add support for 8.0b (8.0.0.2)
* VMware 8: add new guest os mappings in VirtualMachineGuestOsIdentifier
The full list can be found at https://developer.vmware.com/apis/1355/vsphere
* VMware: get guest os mappings of parent version
* VMware8: remove guest os mappings for 8.0.0.2
* VMware8: fix code smells
* vmware: remove annotations in VmwareVmImplementerTest which caused 0.0% code coverage
* VMware8: add a unit test case
* VMware: add support for 8.0c (8.0.0.3)
* VMware8: move to CloudStackVersion.getVMwareParentVersion
* VMware: add support for 8.0u1 (8.0.1.0)
* Copy engine/schema/src/main/java/com/cloud/upgrade/GuestOsMapper.java from PR 6979
* Copy engine/schema/src/main/java/com/cloud/storage/dao/GuestOSHypervisorDao.java from PR 6979
* VMware: ignore the last number in VMware versions
* VMware: copy guest os mapping from 8.0 to 8.0.1
* VMware: add unit tests in VmwareVmImplementerTest.java
* Copy engine/schema/src/test/java/com/cloud/upgrade/GuestOsMapperTest.java from PR 6979
* VMware8: retry vm poweron if fails due to exception "File system specific implementation of Ioctl[file] failed"
This fixes a weird issue on vmware8. When power on a vm, sometimes it fails due to error
2023-04-27 07:04:43,207 ERROR [c.c.h.v.r.VmwareResource] (DirectAgent-442:ctx-cdd42b03 10.0.32.133, job-105/job-106, cmd: StartCommand) (logid:8a24a607) StartCommand failed due to [Exception: java.lang.RuntimeException
Message: File system specific implementation of Ioctl[file] failed
].
java.lang.RuntimeException: File system specific implementation of Ioctl[file] failed
at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:426)
at com.cloud.hypervisor.vmware.mo.VirtualMachineMO.powerOn(VirtualMachineMO.java:288)
in vmware.log on ESXi host, it shows
2023-04-27T09:20:41.713Z In(05)+ vmx - Power on failure messages: File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - Failed to lock the file
2023-04-27T09:20:41.713Z In(05)+ vmx - Cannot open the disk '/vmfs/volumes/7b29c876-ac102328/i-2-167-VM/ROOT-167.vmdk' or one of the snapshot disks it depends on.
2023-04-27T09:20:41.713Z In(05)+ vmx - Module 'Disk' power on failed.
2023-04-27T09:20:41.713Z In(05)+ vmx - Failed to start the virtual machine.
There is a KB article for it, but I still do not know why and how to fix it.
https://kb.vmware.com/s/article/1004232
* VMware: extract to method powerOnVM
* vmware: fix mistake in logs
* vmware8: use curl instead of wget to fix test failures
Traceback (most recent call last):
File "/root/test_internal_lb.py", line 555, in test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80
self.execute_internallb_roundrobin_tests(vpc_offering)
File "/root/test_internal_lb.py", line 641, in execute_internallb_roundrobin_tests
client_vm, applb.sourceipaddress, max_http_requests)
File "/root/test_internal_lb.py", line 497, in run_ssh_test_accross_hosts
(e, clienthost.public_ip))
AssertionError: list index out of range: SSH failed for VM with IP Address: 10.0.52.187
and
sshClient: DEBUG: {Cmd: /usr/bin/wget -T3 -qO- --user=admin --password=password http://10.1.2.253:8081/admin?stats via Host: 10.0.52.188} {returns: ["/usr/bin/wget: '/usr/lib/libpcre.so.1' is not an ELF file", "/usr/bin/wget: can't load library 'libpcre.so.1'"]}
* VMware: correct guest OS names in hypervisor mappings for VMware 8.0
el9 and variants were introduced by https://github.com/apache/cloudstack/pull/7059
they are supported with guest os identifiers since VMware 8.0
see https://vdc-repo.vmware.com/vmwb-repository/dcr-public/c476b64b-c93c-4b21-9d76-be14da0148f9/04ca12ad-59b9-4e1c-8232-fd3d4276e52c/SDK/vsphere-ws/docs/ReferenceGuide/vim.vm.GuestOsDescriptor.GuestOsIdentifier.html
* VMware: add Ubuntu 20.04 and 22.04 support for vmware 7.0+
* PR7380: only add guest os mappings for Ubuntu 20.04
* PR7380: Correct RHEL9 guest os names and others for VMware 8.0
* PR7380: correct guest os names on 8.0.0.1 as well
* PR7380: remove Windows 12 and Windows Server 2025 which are not released yet
* 4.18:
server: remove registered userdata when cleanup an account (#7777)
server: Use max secondary storage defined on the account during upload (#7441)
test: upgrade kubernetes versions to 1.25.0/1.26.0 (#7685)
kvm: Added VNI Devices as normal bridge slave devs (#7836)
noVNC: fix JP keyboard on vmware7+ which uses websocket URL (#7694)
* 4.18:
UI: allow new keys for VM details (#7793)
Refactoring StorPool's smoke tests (#7392)
UI: decode userdata in EditVM dialog (#7796)
packaging: unalias cp before package upgrade (#7722)
make NoopDbUpgrade do a systemvm template check (#7564)
UI unit test: fix expected values (#7792)
* Removed the hardcoded StorPool endpoint from tests
- removed the hardcoded enpoint of StorPool primary storage from tests
- added the git commit information into the maven build
* Convert indents to spaces
* update git-commit-id-plugin version
Fixes case of appending userdata when both template and vm data are either shellscript or cloudconfig
Fixes error when appending gzip userdata
Fixes case when userdata manual text from VM is not getting decoded-encoded correctly.
Fixes case of appending multipart data when both template and vm data contain same format types.
Refactor - moved validateUserData method to UserDataManager class
Refactor userdata test to check resultant multipart userdata thoroughly
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
There are tools like cluster-api which create and manage kubernetes cluster on CloudStack. This PR adds the option to add unmanaged kubernetes cluster which are not managed by CKS plugin. This helps provide a consolidated view of unmanaged clusters on CloudStack. The changes done make sure that operations for managed clusters are not executed for unmanaged clusters.
Two new APIs have also been added:
1. addVirtualMachinesToKubernetesCluster - to add VMs to unmanaged clusters.
2. removeVirtualMachinesFromKubernetesCluster - to remove VMs to unmanaged clusters.
Two APIs have been updated:
1. createKubernetesCluster - made KUBERNETES_VERSION_ID, SERVICE_OFFERING_ID, SIZE as not required for unmanaged clusters. Add an additional parameter, managed, which is true by default.
2. listKubernetesClusters - Add a parameter managed to filter on managed field.
Co-authored-by: Pearl Dsilva <pearl1594@gmail.com>
Co-authored-by: dahn <daan.hoogland@gmail.com>
* Guest OS mapping improvements
- Checks the OS mapping name in hypervisor (VMware, XenServer)
- Displays guest OS mappings in UI
* Added API getHypervisorGuestOsNames to list the guest OS names in the hypervisor, and code improvements
* Some static analysis fixes
* Removed commented code in listview
* Guest OS list
* UI changes for adding guest os and mappings
* Added guest os mappings in guest os form
* Added new filter to guest os mapping
* Name and description changes
* VMWare Host and cluster MO unit tests
* CheckGuestOsMapping command and answer unit tests
* GetHypervisorGuestOsNames command and answer unit tests
* VmwareResource unitests
* GuestOsMapper unittests
* icon changes
* Addressed review comments
* Renaming fixes
* Removed comments
* marvin tests for guest os operations
* Added marvin tests for OS mappings
* Document links and UI improvements
* Added deduplication for the list guest OS API
* Fixed linter failure
* Few bug fixes and UI changes
* Few improvements
* Addressed code smells
* Fixed UI issues after rebase
---------
Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>
Co-authored-by: Harikrishna Patnala <harikrishna.patnala@gmail.com>