This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release.
We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status.
We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds).
For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand.
Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below.
19250403e6/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java (L406-L442)
This PR fixes moves resources stuck in transition state during async job cleanup
Problem:
During maintenance of the management server, other servers in the cluster or the same server after a restart initiate async job cleanup. However, this process leaves resources in a transitional state. The only recovery option currently available is to make direct database changes.
Solution:
This PR introduces a resolution by changing Volume, Virtual Machine, and Network resources from their transitional states. This adjustment enables the reattempt of failed operations without the need for manual database modifications.
This PR marks the multipath scripts as executable.
This fixes the issue that in 4.19.0.0-RC2, vms can not be stopped in ubuntu hosts.
2024-01-17 12:56:26,061 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-4:ctx-e3503563 job-38/job-39 ctx-42706275) (logid:81ede4e9) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Unable to stop the virtual machine due to java.lang.NullPointerException
at com.cloud.utils.script.Script.getExitValue(Script.java:74)
at com.cloud.hypervisor.kvm.storage.MultipathSCSIAdapterBase.runScript(MultipathSCSIAdapterBase.java:476)
at com.cloud.hypervisor.kvm.storage.MultipathSCSIAdapterBase.disconnectPhysicalDiskByPath(MultipathSCSIAdapterBase.java:226)
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.disconnectPhysicalDiskByPath(KVMStoragePoolManager.java:205)
at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.cleanupDisk(LibvirtComputingResource.java:3335)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStopCommandWrapper.execute(LibvirtStopCommandWrapper.java:101)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStopCommandWrapper.execute(LibvirtStopCommandWrapper.java:49)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:78)
at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1903)
There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs.
#8438 (comment)
#8433 (comment)
#7344 (comment)
While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).
On the agent side, it gets stuck waiting for a response from the Script execution.
To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.
SELECT *
FROM performance_schema.metadata_locks
INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID
WHERE PROCESSLIST_ID <> CONNECTION_ID() \G;
This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.
This PR fixes a regression caused by #8465 on advanced zones, import fails with:
2024-01-10 12:13:33,234 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-3:ctx-991bbe9f job-128 ctx-f49517d4) (logid:d7b8e716) Allocating nic for vm 142272e8-9e2e-407b-9d7e-e9a03b81653c in network Network {"id": 204, "name": "Isolated", "uuid": "9679fac5-e3ac-4694-a57b-beb635340f39", "networkofferingid": 10} during import
2024-01-10 12:13:33,239 ERROR [o.a.c.v.UnmanagedVMsManagerImpl] (API-Job-Executor-3:ctx-991bbe9f job-128 ctx-f49517d4) (logid:d7b8e716) Failed to import NICs while importing vm: i-2-31-VM
com.cloud.exception.InsufficientVirtualNetworkCapacityException: Unable to acquire Guest IP address for network Network {"id": 204, "name": "Isolated", "uuid": "9679fac5-e3ac-4694-a57b-beb635340f39", "networkofferingid": 10}Scope=interface com.cloud.dc.DataCenter; id=1
at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.importNic(NetworkOrchestrator.java:4582)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importNic(UnmanagedVMsManagerImpl.java:859)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importVirtualMachineInternal(UnmanagedVMsManagerImpl.java:1198)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importUnmanagedInstanceFromHypervisor(UnmanagedVMsManagerImpl.java:1511)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.baseImportInstance(UnmanagedVMsManagerImpl.java:1342)
at org.apache.cloudstack.vm.UnmanagedVMsManagerImpl.importUnmanagedInstance(UnmanagedVMsManagerImpl.java:1282)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Also, addresses the VNC password field set instead of a fixed string
This PR fixes reorder/list pools when cluster details are not set, while deploying vm / attaching volume.
Problem:
Attach volume to a VM fails, on infra with zone-wide pools & vm.allocation.algorithm=userdispersing as the cluster details are not set (passed as null) while reordering / listing pools by volumes.
Solution:
Ignore cluster details when not set, while reordering / listing pools by volumes.
This PR makes changes to use cluster's free metrics instead of used while computing imbalance for the cluster. This allows DRS to run for clusters where hosts doesn't have the same amount of metrics.
Currently, if a host is on alert, the option to reconnect it is not presented in the UI.
This PR addresses this issue by adding a reconnect button to the host if it is in an alert state.
To prevent errors during multi-user access, use account UUID to create/access user on the provider side. Also, update the existing secret key for a user that already exists.
This PR enables support for user data content upto 1048576 bytes - updates jetty maxFormContentSize value to 1048576 bytes (default is 200000 bytes).
CloudStack can support max user data content to 1048576 bytes (the size can be configurable through vm.userdata.max.length setting, max 1048576), but it's limited due to the default Jetty max content size, which is 200000 bytes.
Configuration Reference from jetty doc:
https://eclipse.dev/jetty/documentation/jetty-9/index.html#configuring-specific-webapp-deployment (check with maxFormContentSize here)
This PR fixes KVM manage/unmanage functionality on 4.19.0 RC1 - was introduced on #7976 but the latest merge commits on the PR removed the execution for KVM
Fixes#8412
Add support for 8.0.0.2 explicitly to prevent falling over to the parent version
Adds log when hypervisor capabilities fail over to the parent version
---------
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Non-admin account may not have access to the readyForShutdown API therefore UI should not schedule the job.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
GuestOS mappings are retrieved from the parent hypervisor version when a minor, patch hypervisor version doesn't exist.
Fixes#8412
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
This PR fixes the template selection regression for VMware Ingestion in the UI on 4.19.0 RC1 and adds back the default template selection for VMware
Fixes: #8428#8432
During management server initialization, some EmptyStackException and BeansException exceptions are always thrown; These exceptions do not affect ACS, however, they can cause confusion when observing the management server startup logs with DEBUG level on. The causes of the exceptions have been addressed.
Co-authored-by: João Jandre <joao@scclouds.com.br>
This PR changes the password.policy.regex default value to empty. With an empty value for the configuration, it is skipped during the password policy check, only when the configuration is set to something different than a blank string, the regex will get checked.
This way, when creating a user on org.apache.cloudstack.ldap.LdapAuthenticator#authenticate() we won't get an error by default, as an empty value for the password is passed.
With this change, a fix is added for failures seen with test_08_migrate_vm or other migration-related tests because the target host is in `Connecting` state,
#8356 (comment)
#8374 (comment)
and more
This simply improves the log statement that prints debug statements
during beginning of a stats collector run for hosts or VMs.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Failures seen in #7344 were debugged and it was seen since one of the
host is in Alert state. VM deployment fails with affinity group.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>