This PR allows attaching of GPU devices via PCI, mdev or VF to an Instance for KVM.
It allows the operator to discover the GPU devices on the KVM host and create a Compute Offering with GPU support based on the available GPU devices on the host. Once the operator has created the Compute offering, it can be used by users to launch Instances with GPU devices.
* Cumulative enhancements fix for ScaleIO: MDM add/remove, Host prepare/unprepare, validate Storage Pool can be created in Agent.
- Implemented validation to fail Host disconnect from Storage Pool if there are Volumes attached and SDC client MDM removal requires scini service to be restarted
- Implemented Storage Pool validation by checking whether MDM addresses from configuration file and from memory (using CLI) matches, otherwise file ModifyStoragePool command.
- Introduced configuration key to apply timeout after making MDM changes for ScaleIO: powerflex.mdm.change.apply.timeout.ms (default 1000ms)
- Implemented logic to apply timeout after making MDM changes for ScaleIO in prepare and unprepare logic
- Added detection of MDM removal support via CLI
- If MDM removal support via CLI supported then use CLI, fall back to edit drv_cfg.txt and restart scini instead
Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>
Co-authored-by: mprokopchuk <mprokopchuk@apple.com>
* Management Server - Prepare for Maintenance and Cancel Maintenance improvements:
- Added new setting 'management.server.maintenance.ignore.maintenance.hosts' to ignore hosts in maintenance states while preparing management server for maintenance. This skips agent transfer and agents count check for hosts in maintenance.
- Rebalance indirect agents after cancel maintenance, using rebalance parameter in cancelMaintenance API
- Force maintenance after maintenance window timeout, using forced parameter in prepareForMaintenance API.
- Propagate 'indirect.agent.lb.check.interval' setting change to the host agents.
* rebases fixes
* code improvements, cleanup
* [UI] Set rebalance true by default in cancel maintenance dialog
* Update MS state after executing cluster cmd in the target MS, and some code improvements
* code improvements
* Ensure the host lb algorithm 'shuffle' is applied once before disabling the indirect agent lb check background task
* KVM incremental snapshot feature
* fix log
* fix merge issues
* fix creation of folder
* fix snapshot update
* Check for hypervisor type during parent search
* fix some small bugs
* fix tests
* Address reviews
* do not remove storPool snapshots
* add support for downloading diff snaps
* Add multiple zones support
* make copied snapshots have normal names
* address reviews
* Fix in progress
* continue fix
* Fix bulk delete
* change log to trace
* Start fix on multiple secondary storages for a single zone
* Fix multiple secondary storages for a single zone
* Fix tests
* fix log
* remove bitmaps when deleting snapshots
* minor fixes
* update sql to new file
* Fix merge issues
* Create new snap chain when changing configuration
* add verification
* Fix snapshot operation selector
* fix bitmap removal
* fix chain on different storages
* address reviews
* fix small issue
* fix test
---------
Co-authored-by: João Jandre <joao@scclouds.com.br>
* cloudstack: add support for EL10
This adds support for Fedora 40 and (upcoming) EL10 distro to be used
as mgmt/usage server, mysql/nfs & KVM host. Python3 version has changed
to 3.12.9 which isn't automatically determining the python-path.
* python: WIP code, this fails right now
Need to discuss/check if we can skip this code. Where/how is cgroup
setup used with KVM agent.
* prep cloudutils to be EL10 ready
Fixes issue for Fedora, it was running old EL6 hooks which isn't
applicable for modern Fedora version that are closer to EL8/9/10
* Update last agents during ms maintenance, and some code improvements
* Send 503 (Service Unavailable) response status when maintenance or shutdown is initiated
[Any load balancer in the clustered environment can avoid routing requests to this MS node]
* Migrate systemvm agents before routing host agents, and some code improvements
* Added events for ms maintenance and shutdown operations
* Added the following ms maintenance and shutdown improvements
- block new agent connections during prepare for maintenance of ms
- maintain avoids ms list
- propagate updated management servers list and lb algorithm in host and indirect.agent.lb.algorithm settings respectively, to systemvm (non-routing) agents
- updated setup ms list and migrate agent connections to executor service
- migrate agent connection through executor, and send the answer to the ms host that initiated the migration
- re-initialize ssl handshake executor if it is shutdown
- don't allow prepare for maintenance or shutdown when other management server nodes are in preparing states
- don't allow trigger shutdown when management server is up and other management server nodes are in preparing states
- stop agent connections monitor on ms maintenance
- update avoid ms list in ready command
- updated connected host from the client connection
- update last agents in ms metrics from the database
- updated some agent config descriptions
- update last management server in the hosts during shutdown
- added agents and lastagents in management server response
- updated management server maintenance & shutdown unit tests
- some code improvements
* refactored code / addressed comments
* removed shutdown testcase (maybe, calling System.exit)
* Revert "removed shutdown testcase (maybe, calling System.exit)"
This reverts commit e14b0717152ef6c8be102d61c80f42803a53172e.
* avoid system.exit during shutdown test
* code improvements
* testcase fix
* Fix cutoff time in agent connections monitor thread
* api,agent,server,engine-schema: scalability improvements
Following changes and improvements have been added:
- Improvements in handling of PingRoutingCommand
1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
3. Optimized scanning stalled VMs
- Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers`
- Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine
- Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0.
- Added caching for account/use role API access with expiration after write set to 60 seconds.
- Added caching for some recurring DB retrievals
1. CapacityManager - listing service offerings - beneficial in host capacity calculation
2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
3. DownloadListener - hypervisors for zone - beneficial for host joins
5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands
- Optimized MS list retrieval for agent connect
- Optimize finding ready systemvm template for zone
- Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
- Changes in agent-agentmanager connection with NIO client-server classes
1. Optimized the use of the executor service
2. Refactore Agent class to better handle connections.
3. Do SSL handshakes within worker threads
5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end.
6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used.
- Improvements in StatsCollection - minimize DB retrievals.
- Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
- Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.
- Minor improvements in resource limit calculations wrt DB retrievals
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* test1, domaindetails, capacitymanager fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* test2 - agent tests
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* capacitymanagertest fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* change
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix missing changes
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address comments
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* revert marvin/setup.py
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix indent
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* use space in sql
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address duplicate
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* update host logs
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* revert e36c6a5d07
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix npe in capacity calculation
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* move schema changes to 4.20.1 upgrade
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* build fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address comments
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix build
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* add some more tests
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* checkstyle fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* remove unnecessary mocks
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* build fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* replace statics
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* engine/orchestration,utils: limit number of concurrent new agent
connections
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* refactor - remove unused
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* unregister closed connections, monitor & cleanup
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* add check for outdated vm filter in power sync
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* agent: synchronize sendRequest wait
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
---------
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* Support for Management Server Maintenance
- New APIs: prepareForMaintenance and cancelMaintenance, with required parameter - managementserverid.
- New management server states for maintenance: PreparingForMaintenance, Maintenance.
- listHosts API with optional parameter – managementserverid, to list the hosts connected to the management server.
- Support management server maintenance when more than one active management servers available.
- Triggers transfer agents to other available management servers for maintenance, new agent command MigrateAgentConnectionCommand to initiate transfer of indirect agents.
- New global config 'management.server.maintenance.timeout', to set the timeout (in mins) for the management server maintenance window, default: 60 mins.
- UI changes: Prepare and Cancel Maintenance in Management Server section, Connected Agents tab, New fields for hosts and management servers.
* Updated pending jobs check timer task with ScheduledExecutorService
* keep maintenance state on trigger shutdown call when ms is in maintenance
* add pending jobs count to ms response
* during ms heartbeat, update state to up only when it's down
* allow vm work jobs of async job created before prepare for maintenance
* Revert "keep maintenance state on trigger shutdown call when ms is in maintenance"
This reverts commit 607e13364679eac897f4d146bb3325ea7a61ba17.
* skip maintenance test when multiple management servers are not available, and not configured in host setting for kvm
* 4.20:
merge errors fixed
Restrict the migration of volumes attached to VMs in Starting state (#9725)
server, plugin: enhance storage stats for IOPS (#10034)
Introducing granular command timeouts global setting (#9659)
Improve logging to include more identifiable information (#9873)
* Improve logging to include more identifiable information for kvm plugin
* Update logging for scaleio plugin
* Improve logging to include more identifiable information for default volume storage plugin
* Improve logging to include more identifiable information for agent managers
* Improve logging to include more identifiable information for Listeners
* Replace ids with objects or uuids
* Improve logging to include more identifiable information for engine
* Improve logging to include more identifiable information for server
* Fixups in engine
* Improve logging to include more identifiable information for plugins
* Improve logging to include more identifiable information for Cmd classes
* Fix toString method for StorageFilterTO.java
This introduces the multi-arch zones, allowing users to select the VM arch upon deployment.
Multi-arch zone support in CloudStack can allow admins to mix x86_64 & arm64 hosts within the same zone with the following changes proposed:
- All hosts in a clusters need to be homogenous, wrt host CPU type (amd64 vs arm64) and hypevisor
- Arch-aware templates & ISOs:
- Add support for a new arch field (default set of: amd64 and arm64), when unspecified defaults to amd64 and for existing templates & iso
- Allow admins to edit the arch type of the registered template & iso
- Arch-aware clusters and host:
- Add new attribute field for cluster and hosts (kvm host agents can automatically report this, arch of the first host of the cluster is cluster's architecture), defaults to amd64 when not specified
- Allow admins to edit the arch of an existing cluster
- VM deployment form (UI):
- In a multi-arch zone/env, the VM deployment form can allow some kind of template/iso filtration in the UI
- Users should be able to select arch: amd64 & arm64; but this is shown only in a multi-arch zone (env)
- VM orchestration and lifecycle operations:
- Use of VM/template's arch to correctly decide where to provision the VM (on the correct strictly arch-matching host/clusters) & other lifecycle operations (such as migration from/to arch-matching hosts)
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* Create/Export OVA file of the VM on external vCenter host, to temporary conversion location (NFS)
* Fixed ova issue on untar/extract ovf from ova file
"tar -xf" cmd on ova fails with "ovf: Not found in archive" while extracting ovf file
* Updated VMware to KVM instance migration using OVA
* Refactoring and cleanup
* test fixes
* Consider zone wide pools in the destination cluster for instance conversion
* Remove local storage pool support as temporary conversion location
- OVA export not possible as the pool is not accessible outside host, NFS pools are supported.
* cleanup unused code
* some improvements, and refactoring
* import nic unit tests
* vmware guru unit tests
* Separate clone VM and create template file for VMware migration
- Export OVA (of the cloned VM) to the conversion location takes time.
- Do any validations with cloned VM before creating the template (and fail early).
- Updated unit tests.
* Check conversion support on host before clone vm / create template on vmware (and fail early)
* minor code improvements
* Auto select the host with instance conversion capability
* Skip instance conversion supported response param for non-KVM hosts
* Show supported conversion hosts in the UI
* Skip persistence map update if network doesn't exist
* Added support to export OVA from KVM host, through ovftool (when installed in KVM host)
* Updated importvm api param 'usemsforovaexport' to 'forcemstodownloadvmfiles', to be generic
* Updated hardcoded UI messages with message labels
* Updated UI to support importvm api param - forcemstodownloadvmfiles
* Improved instance conversion support checks on ubuntu hosts, and for windows guest vms
* Use OVF template (VM disks and spec files) for instance conversion from VMware, instead of OVA file
- this would further increase the migration performance (as it reduces the time for OVA preparation / archiving of the VM files into a single file)
* OVF export tool parallel threads code improvements
* Updated 'convert.vmware.instance.to.kvm.timeout' config default value to 3 hrs
* Config values check & code improvements
* Updated import log, with time taken and vm details
* Support for parallel downloads of VMware VM disk files while exporting OVF from MS, and other changes below.
- Skip clone for powered off VMs
- Fixes to support standalone host (with its default datacenter)
- Some code improvements
* rebase fixes
* rebase fixes
* minor improvement
* code improvements - threads configuration, and api parameter changes to import vm files
* typo fix in error msg
* Normalize logs
All classes that could have their loggers inherited from their fathers had their own loggers deleted;
Most loggers didn't have to be static, so most of them were normalized so that they wouldn't be;
All loggers are protected now;
Static logger's name are now 'LOGGER';
Non-static logger's name are now 'logger';
New class DbUpgradeAbstractImpl created so that all Upgraders extend it and inherit its logger
* Upgrade log4j
* fix errors caused by the merge
* Refactor cglibThrowableRenderer functionality to log4j2 and upgrade the last configuration files
* fix sonarcloud bug
* Fix errors caused by merge, remove some unused loggers, and rename a variable that was mistakenly renamed on the normalization commit
* Readd snmpTrapAppender, remove TestAppender
* Regenerate changes
* regenerate changes
* refactor last custom appender
* fix systemvm configuration xml
* Regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* Fix utils pom
* fix some tests
* regenerate changes
* Fix jar being printed on exception
* fix logging in system VMs, fix commands not having log4j2 classpath.
* regenerate changes
* Fix some unwanted renomeations
* fix end of file
* regenerate changes
* regenerate changes
* fix merge error
* regenerate changes
* fix tests
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* readd reload4j to tungsten as juniper depends on it
* Regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* re-add reload4j dependency to network-contrail, as juniper depends on it
* regenerate changes
* regenerate changes
* regenerate changes
* fix typo
* regenerate changes
* regenerate changes
* Fix end of files
* regenerate changes
* add logj42 to cloud-utils-SHADED.jar
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* Regenerate changes
* Regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* Regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* Regenerate changes
* fix some tests
* Regenerate changes
* Regenerate changes
* fix test
* Regenerate changes
* Regenerate changes
This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release.
We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status.
We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds).
For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand.
Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below.
19250403e6/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java (L406-L442)
This PR adds the capability in CloudStack to convert VMware Instances disk(s) to KVM using virt-v2v and import them as CloudStack instances. It enables CloudStack operators to import VMware instances from vSphere into a KVM cluster managed by CloudStack. vSphere/VMware setup might be managed by CloudStack or be a standalone setup.
CloudStack will let the administrator select a VM from an existing VMware vCenter in the CloudStack environment or external vCenter requesting vCenter IP, Datacenter name and credentials.
The migrated VM will be imported as a KVM instance
The migration is done through virt-v2v: https://access.redhat.com/articles/1351473, https://www.ovirt.org/develop/release-management/features/virt/virt-v2v-integration.html
The migration process timeout can be set by the setting convert.instance.process.timeout
Before attempting the virt-v2v migration, CloudStack will create a clone of the source VM on VMware. The clone VM will be removed after the registration process finishes.
CloudStack will delegate the migration action to a KVM host and the host will attempt to migrate the VM invoking virt-v2v. In case the guest OS is not supported then CloudStack will handle the error operation as a failure
The migration process using virt-v2v may not be a fast process
CloudStack will not perform any check about the guest OS compatibility for the virt-v2v library as indicated on: https://access.redhat.com/articles/1351473.
Co-authored-by: Stephan Krug <stephan.krug@scclouds.com.br>
Co-authored-by: GaOrtiga <49285692+GaOrtiga@users.noreply.github.com>
Co-authored-by: dahn <daan.hoogland@gmail.com>
This PR allows an admin to reserve some hypervisor host CPUs for system use. Another way to think of it is limiting the number of CPUs allocatable to VMs. This can be useful if the admin wants to do other things with the hypervisor's CPU, for example reserve some cores for running hyperconverged storage processes.
Co-authored-by: Marcus Sorensen <mls@apple.com>
* Trigger out of band VM state update via libvirt event when VM stops
* Add License headers, refactor nested try
---------
Co-authored-by: Marcus Sorensen <mls@apple.com>