* CKS: retry if unable to drain node or unable to upgrade k8s node
I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded.
2 of 16 upgrades failed due to
```
error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command...
```
3 of 16 upgrades failed due to
```
Error from server: error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role"
Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard"
from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed
```
* CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test
* Update PR 8402 as suggested
* test: remove CKS cluster if fail to create or verify
* veeam: detach only the restored volume during backup restore
Steps to reproduce the issue
1. create a VM (A) with ROOT and DATA disk
2. assign to a backup offering
3. create backup
4. create another VM (B)
5. restore the DATA disk of VM A, and attach to VM B
6. When operation is done, check the datastore
Without this change, the ROOT image is not removed and left over on the datastore.
```
[root@ref-trl-5933-v-Mr8-wei-zhou-esxi2:/vmfs/volumes/5f60667d-18d828eb] ls -l /vmfs/volumes/5f60667d-18d828eb/CS-RSTR-dfb6f21c-a941-49db-9963-4f0286a17dac
total 1784840
-rw------- 1 root root 5242880000 Jan 24 09:23 ROOT-722_2-flat.vmdk
-rw------- 1 root root 499 Jan 24 09:23 ROOT-722_2.vmdk
```
With this change, the whole temporary vm has been destroyed.
```
[root@ref-trl-5933-v-Mr8-wei-zhou-esxi2:/vmfs/volumes/5f60667d-18d828eb] ls -l /vmfs/volumes/5f60667d-18d828eb/CS-RSTR-734bee3b-640c-4ff0-a34b-bc45358565b2
ls: /vmfs/volumes/5f60667d-18d828eb/CS-RSTR-734bee3b-640c-4ff0-a34b-bc45358565b2: No such file or directory
```
* veeam: fix wrong disk size in debug message
* veeam: sync backup repository after operations are done
got exception of some operations which succeeds due to the following error
```
2024-01-19 10:59:52,846 DEBUG [o.a.c.b.v.VeeamClient] (API-Job-Executor-42:ctx-716501bb job-4373 ctx-2359b76d) (logid:b5e19a17) Veeam response for PowerShell commands [PowerShell Import-Module Veeam.Backup.PowerShell -WarningAction SilentlyContinue;$restorePoint = Get-VBRRestorePoint ^| Where-Object { $_.Id -eq '1d99106a-b5c8-4a1e-958d-066a987caa5f' };if ($restorePoint) { Remove-VBRRestorePoint -Oib $restorePoint -Confirm:$false;$repo = Get-VBRBackupRepository;Sync-VBRBackupRepository -Repository $repo;} else { ; Write-Output 'Failed to delete'; Exit 1;}] is: [^M
Restore Type Job Name State Start Time End Time Description ^M
------------ -------- ----- ---------- -------- ----------- ^M
ConfResynchronize Configuration Dat... Starting 19/01/2024 10:59:52 01/01/1900 00:00:00 ^M
^M
^M
Remove-VBRRestorePoint : Win32 internal error "Access is denied" 0x5 occurred while reading the console output buffer. ^M
Contact Microsoft Customer Support Services.^M
At line:1 char:196^M
+ ... orePoint) { Remove-VBRRestorePoint -Oib $restorePoint -Confirm:$false ...^M
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^M
+ CategoryInfo : ReadError: (:) [Remove-VBRRestorePoint], HostException^M
+ FullyQualifiedErrorId : ReadConsoleOutput,Veeam.Backup.PowerShell.Cmdlets.RemoveVBRRestorePoint^M
^M
].
```
* veeam: fix unable to detach volume when restore backup and attach to vm then detach the volume
It also happened when destroy the original or backup VM
```
2024-01-24 10:10:03,401 ERROR [c.c.s.r.VmwareStorageProcessor] (DirectAgent-74:ctx-95b24ac7 10.0.35.53, job-25995/job-25996, cmd: DettachCommand) (logid:7260ffb8) Failed to detach volume!
java.lang.RuntimeException: Unable to access file [de52fdd3386b3d67b27b3960ecdb08f4] i-2-723-VM/7c2197c129464035bab062edec536a09-flat.vmdk
at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:426)
at com.cloud.hypervisor.vmware.mo.DatastoreMO.moveDatastoreFile(DatastoreMO.java:290)
at com.cloud.storage.resource.VmwareStorageLayoutHelper.syncVolumeToRootFolder(VmwareStorageLayoutHelper.java:241)
at com.cloud.storage.resource.VmwareStorageProcessor.attachVolume(VmwareStorageProcessor.java:2150)
at com.cloud.storage.resource.VmwareStorageProcessor.dettachVolume(VmwareStorageProcessor.java:2408)
at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:174)
at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:71)
at com.cloud.hypervisor.vmware.resource.VmwareResource.executeRequest(VmwareResource.java:589)
at com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAgentAttache.java:315)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2024-01-24 10:10:03,402 INFO [c.c.h.v.u.VmwareHelper] (DirectAgent-74:ctx-95b24ac7 10.0.35.53, job-25995/job-25996, cmd: DettachCommand) (logid:7260ffb8) [ignored]failed to get message for exception: Unable to access file [de52fdd3386b3d67b27b3960ecdb08f4] i-2-723-VM/7c2197c129464035bab062edec536a09-flat.vmdk
```
* vmware: create restored volume with new UUID and attach to VM
This PR fixes several issues in the testing of Veeam 11 and Veeam12
- Import Veeam.Backup.PowerShell and silently ignore the warning messages
- Fix issue when assign vm to backup offerings, which caused by separator (\r\n)
- Fix authorization failure in veeam 12a, which is because v1_4 is not supported in veeam 12a any more
- Fix exception if backup name has space
- Fix backup metrics in veeam12, which is because powershell command does not return the values needed
- Fix Incorrect datetime value, which is because powershell command returns a datetime which is not supported in Java
- Fix issue during backup restoration if VM has both ROOT and DATA disks.
This PR also has the following update
- Add integration test test/integration/smoke/test_backup_recovery_veeam.py
- Make some UI changes
- Add zone setting backup.plugin.veeam.version. If it is not set, CloudStack will get veeam version via powershell commands.
- Add zone setting backup.plugin.veeam.task.poll.interval and backup.plugin.veeam.task.poll.max.retry
1. Problem description
In Apache CloudStack (ACS), when a VM is deployed in a host with the KVM hypervisor, an XML file is created in the assigned host, which has a property shares that defines the weight of the VM to access the host CPU. The value of this property has no unit, and it is a relative measure to calculate how much CPU a given VM will have in the host. However, this value has a limit, which depends on the version of cgroup utilized by the host's kernel. The problem lies at the range value of shares that varies between both versions: [2, 264144] for cgroups version 1; and [1, 10000] for cgroups version 2. Currently, ACS calculates the value of shares using Equation 1, presented below, where CPU is the number of cores and speed is the CPU frequency; both specified in the VM's compute offering. Therefore, if a compute offering has, for example, 6 cores at 2 GHz, the shares value will be 12000 and an exception will be thrown by libvirt if the host utilizes cgroup v2. The second version is becoming the default one in current Linux distributions; thus, it is necessary to address this limitation.
Equation 1
shares = CPU * speed
Fixes: #6744
2. Proposed changes
To address the problem described, we propose to apply a scale conversion considering the max shares of the host. Using the same formula currently utilized by ACS, it is possible to calculate the maximum shares of a VM for a given host. In other words, using the number of cores and the nominal speed of the host's CPU as the upper limit of shares allowed to a VM. Then, this value will be scaled to the allowed interval of [1, 10000] of cgroup v2 by using a linear scale conversion.
The VM shares would be calculated as Equation 2, presented below, where VM requested shares is the requested shares value calculated using Equation 1, cgroup upper limit is fixed with a value of 10000 (cgroups v2 upper limit), and host max shares is the maximum shares value of the host, calculated using Equation 1. Using Equation 2, the only case where a VM passes the cgroup v2 limit is when the user requests more resources than the host has, which is not possible with the current implementation of ACS.
Equation 2
shares = (VM requested shares * cgroup upper limit)/host max shares
To implement the proposal, the following APIs will be updated: deployVirtualMachine, migrateVirtualMachine and scaleVirtualMachine. When a VM is being deployed, a new verification will be added to find a suitable host. The max shares of each host will be calculated, and the VM calculated shares will be verified if it does not surpass the host's value. Likewise, the migration of VMs will have a similar new verification. Lastly, the scale of VMs will also have the same verification for the VM's host.
To determine the max shares of a given host, we will use the same equation currently used in ACS for calculating the shares of VMs, presented in Section 1. When Equation 1 is used to determine the maximum shares of a host, CPU is the number of cores of the host, and speed is the nominal CPU speed, i.e., considering the CPU's base frequency.
It is important to note that these changes are only for hosts with the KVM hypervisor using cgroup v2 for now.
This PR fixes the test failures with CKS HA-cluster upgrade.
In production, the CKS HA cluster should have at least 3 control VMs as well.
The etcd cluster requires 3 members to achieve reliable HA. The etcd daemon in control VMs uses RAFT protocol to determine the roles of nodes. During upgrade of CKS with HA, the etcd become unreliable if there are only 2 control VMs.
Making a diskful resource was meant as an optimization,
but cannot work on non hyperconverged setups,
as the storage nodes (diskful) are not part of the cloudstack cluster.
A TODO was overseen and never implemented,
which could trigger the following bug:
If Linstor didn't create a resource (diskless or diskfull) on
the cloudstack choosen node, it would not be able to copy the
template data there, it even seems no error was
triggered and the new template file silently just became
empty/corrupt.
This fixes the following cases in which Solidfire storage integration
caused issues when using Solidfire datadisks with VMware:
1. Take Volume Snapshot of Solidfire data disk
2. Delete an active Instance with Solidfire data disk attached
3. Attach used existing Solidfire data disk to a running/stopped VM
4. Stop and Start an instance with Solidfire data disks attached
5. Expand disk by resizing Solidfire data disk by providing size
6. Expand disk by changing disk offering for the Solidfire data disk
Additional changes:
- Use VMFS6 as managed datastore type if the host supports
- Refactor detection and splitting of managed storage ds name in storage
processor
- Restrict storage rescanning for managed datastore when resizing
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* VMware: add support for 8.0b (8.0.0.2)
* VMware 8: add new guest os mappings in VirtualMachineGuestOsIdentifier
The full list can be found at https://developer.vmware.com/apis/1355/vsphere
* VMware: get guest os mappings of parent version
* VMware8: remove guest os mappings for 8.0.0.2
* VMware8: fix code smells
* vmware: remove annotations in VmwareVmImplementerTest which caused 0.0% code coverage
* VMware8: add a unit test case
* VMware: add support for 8.0c (8.0.0.3)
* VMware8: move to CloudStackVersion.getVMwareParentVersion
* VMware: add support for 8.0u1 (8.0.1.0)
* Copy engine/schema/src/main/java/com/cloud/upgrade/GuestOsMapper.java from PR 6979
* Copy engine/schema/src/main/java/com/cloud/storage/dao/GuestOSHypervisorDao.java from PR 6979
* VMware: ignore the last number in VMware versions
* VMware: copy guest os mapping from 8.0 to 8.0.1
* VMware: add unit tests in VmwareVmImplementerTest.java
* Copy engine/schema/src/test/java/com/cloud/upgrade/GuestOsMapperTest.java from PR 6979
* VMware8: retry vm poweron if fails due to exception "File system specific implementation of Ioctl[file] failed"
This fixes a weird issue on vmware8. When power on a vm, sometimes it fails due to error
2023-04-27 07:04:43,207 ERROR [c.c.h.v.r.VmwareResource] (DirectAgent-442:ctx-cdd42b03 10.0.32.133, job-105/job-106, cmd: StartCommand) (logid:8a24a607) StartCommand failed due to [Exception: java.lang.RuntimeException
Message: File system specific implementation of Ioctl[file] failed
].
java.lang.RuntimeException: File system specific implementation of Ioctl[file] failed
at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:426)
at com.cloud.hypervisor.vmware.mo.VirtualMachineMO.powerOn(VirtualMachineMO.java:288)
in vmware.log on ESXi host, it shows
2023-04-27T09:20:41.713Z In(05)+ vmx - Power on failure messages: File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of LookupAndOpen[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - File system specific implementation of Ioctl[file] failed
2023-04-27T09:20:41.713Z In(05)+ vmx - Failed to lock the file
2023-04-27T09:20:41.713Z In(05)+ vmx - Cannot open the disk '/vmfs/volumes/7b29c876-ac102328/i-2-167-VM/ROOT-167.vmdk' or one of the snapshot disks it depends on.
2023-04-27T09:20:41.713Z In(05)+ vmx - Module 'Disk' power on failed.
2023-04-27T09:20:41.713Z In(05)+ vmx - Failed to start the virtual machine.
There is a KB article for it, but I still do not know why and how to fix it.
https://kb.vmware.com/s/article/1004232
* VMware: extract to method powerOnVM
* vmware: fix mistake in logs
* vmware8: use curl instead of wget to fix test failures
Traceback (most recent call last):
File "/root/test_internal_lb.py", line 555, in test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80
self.execute_internallb_roundrobin_tests(vpc_offering)
File "/root/test_internal_lb.py", line 641, in execute_internallb_roundrobin_tests
client_vm, applb.sourceipaddress, max_http_requests)
File "/root/test_internal_lb.py", line 497, in run_ssh_test_accross_hosts
(e, clienthost.public_ip))
AssertionError: list index out of range: SSH failed for VM with IP Address: 10.0.52.187
and
sshClient: DEBUG: {Cmd: /usr/bin/wget -T3 -qO- --user=admin --password=password http://10.1.2.253:8081/admin?stats via Host: 10.0.52.188} {returns: ["/usr/bin/wget: '/usr/lib/libpcre.so.1' is not an ELF file", "/usr/bin/wget: can't load library 'libpcre.so.1'"]}
* VMware: correct guest OS names in hypervisor mappings for VMware 8.0
el9 and variants were introduced by https://github.com/apache/cloudstack/pull/7059
they are supported with guest os identifiers since VMware 8.0
see https://vdc-repo.vmware.com/vmwb-repository/dcr-public/c476b64b-c93c-4b21-9d76-be14da0148f9/04ca12ad-59b9-4e1c-8232-fd3d4276e52c/SDK/vsphere-ws/docs/ReferenceGuide/vim.vm.GuestOsDescriptor.GuestOsIdentifier.html
* VMware: add Ubuntu 20.04 and 22.04 support for vmware 7.0+
* PR7380: only add guest os mappings for Ubuntu 20.04
* PR7380: Correct RHEL9 guest os names and others for VMware 8.0
* PR7380: correct guest os names on 8.0.0.1 as well
* PR7380: remove Windows 12 and Windows Server 2025 which are not released yet
steps to reproduce the issue:
- git clone https://github.com/apache/cloudstack.git
- cd cloudstack
- rm -rf .git/
- run `mvn -P developer,systemvm clean install`
Without this PR, it fails with error
```
> [ 8/10] RUN mvn -Pdeveloper -Dsimulator -DskipTests clean install:
668.1 [ERROR] Failed to execute goal pl.project13.maven:git-commit-id-plugin:4.9.10:revision (get-the-git-infos) on project cloud-plugin-storage-volume-storpool: .git directory is not found! Please specify a valid [dotGitDirectory] in your pom.xml -> [Help 1]
```
* Removed the hardcoded StorPool endpoint from tests
- removed the hardcoded enpoint of StorPool primary storage from tests
- added the git commit information into the maven build
* Convert indents to spaces
* update git-commit-id-plugin version