* api,agent,server,engine-schema: scalability improvements
Following changes and improvements have been added:
- Improvements in handling of PingRoutingCommand
1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
3. Optimized scanning stalled VMs
- Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers`
- Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine
- Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0.
- Added caching for account/use role API access with expiration after write set to 60 seconds.
- Added caching for some recurring DB retrievals
1. CapacityManager - listing service offerings - beneficial in host capacity calculation
2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
3. DownloadListener - hypervisors for zone - beneficial for host joins
5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands
- Optimized MS list retrieval for agent connect
- Optimize finding ready systemvm template for zone
- Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
- Changes in agent-agentmanager connection with NIO client-server classes
1. Optimized the use of the executor service
2. Refactore Agent class to better handle connections.
3. Do SSL handshakes within worker threads
5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end.
6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used.
- Improvements in StatsCollection - minimize DB retrievals.
- Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
- Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.
- Minor improvements in resource limit calculations wrt DB retrievals
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* test1, domaindetails, capacitymanager fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* test2 - agent tests
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* capacitymanagertest fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* change
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix missing changes
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address comments
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* revert marvin/setup.py
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix indent
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* use space in sql
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address duplicate
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* update host logs
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* revert e36c6a5d07
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix npe in capacity calculation
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* move schema changes to 4.20.1 upgrade
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* build fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address comments
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix build
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* add some more tests
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* checkstyle fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* remove unnecessary mocks
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* build fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* replace statics
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* engine/orchestration,utils: limit number of concurrent new agent
connections
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* refactor - remove unused
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* unregister closed connections, monitor & cleanup
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* add check for outdated vm filter in power sync
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* agent: synchronize sendRequest wait
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
---------
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* Delete local storage properties in agent.properties during delete pool
* Fix stale entry when add local storage failed
* Smaller methods
* Comment added
Adds framework layer change to allow retrieving and storing IOPS stats for storage pools. Custom `PrimaryStoreDriver` can implement method - `getStorageIopsStats` for returning IOPS stats. Existing method `getUsedIops` can also be overridden by such plugins when only used IOPS is returned.
For testing purpose, implementation has been added for simulator hypervisor plugin to return capacity and used IOPS for a pool.
For local storage pool, implementation has been added using iostat to return currently used IOPS.
StoragePoolResponse class has been updated to return IOPS values which allows showing IOPS values in UI for different storage pool related views and APIs.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* Improve logging to include more identifiable information for kvm plugin
* Update logging for scaleio plugin
* Improve logging to include more identifiable information for default volume storage plugin
* Improve logging to include more identifiable information for agent managers
* Improve logging to include more identifiable information for Listeners
* Replace ids with objects or uuids
* Improve logging to include more identifiable information for engine
* Improve logging to include more identifiable information for server
* Fixups in engine
* Improve logging to include more identifiable information for plugins
* Improve logging to include more identifiable information for Cmd classes
* Fix toString method for StorageFilterTO.java
Particular Linstor needs can use this information to only allow
dual volume access for live migration and not enable it in general,
which can and will lead to data corruption if for some reason
2 VMs get started on 2 different hosts.
If a secondary storage pool is used by e.g.
2 concurrent snapshot->template actions,
if the first action finished it removed the netfs mount
point for the other action.
Now the storage pools are usage ref-counted and will only
deleted if there are no more users.
This fixes the issue when create a ovs network
```
2024-10-29 16:02:45,089 WARN [resource.wrapper.LibvirtOvsFetchInterfaceCommandWrapper] (agentRequest-Handler-2:null) (logid:e716722e) Network interface: ''cloudbr1'' not found
```
This is a regression of a previous security release
see "framework/cluster: improve cluster service, integration API server"
since we now use NetworkInterface.getByName to get network interface, we should NOT add single quotes before/after the label.
qemu has a bug versions prior 7.0 with discard enabled and using the IDE bus.
It would crash the qemu process and kill the virtual machine,
this is most noticeable on installing a windows guest from the
Windows ISO installer.
* linstor: enable discard for Linstor storage pools
All Linstor storage backends support discard, so it can be safely enabled.
* linstor: enable discard for Linstor storage pools CHANGELOG.md
* Make volume attachment disk controller selection consistent with VM creation and start
* Update vmware-base/src/main/java/com/cloud/hypervisor/vmware/util/VmwareHelper.java
Co-authored-by: dahn <daan.hoogland@gmail.com>
* Choose disk controllers after converting osdefault
* Rename function
---------
Co-authored-by: dahn <daan.hoogland@gmail.com>
* Add logs to LibvirtComputingResource's metrics collecting process
* Apply Joao's suggestions
Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
* Adjust some logs
* Print memory statistics log in one line
---------
Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
This introduces the multi-arch zones, allowing users to select the VM arch upon deployment.
Multi-arch zone support in CloudStack can allow admins to mix x86_64 & arm64 hosts within the same zone with the following changes proposed:
- All hosts in a clusters need to be homogenous, wrt host CPU type (amd64 vs arm64) and hypevisor
- Arch-aware templates & ISOs:
- Add support for a new arch field (default set of: amd64 and arm64), when unspecified defaults to amd64 and for existing templates & iso
- Allow admins to edit the arch type of the registered template & iso
- Arch-aware clusters and host:
- Add new attribute field for cluster and hosts (kvm host agents can automatically report this, arch of the first host of the cluster is cluster's architecture), defaults to amd64 when not specified
- Allow admins to edit the arch of an existing cluster
- VM deployment form (UI):
- In a multi-arch zone/env, the VM deployment form can allow some kind of template/iso filtration in the UI
- Users should be able to select arch: amd64 & arm64; but this is shown only in a multi-arch zone (env)
- VM orchestration and lifecycle operations:
- Use of VM/template's arch to correctly decide where to provision the VM (on the correct strictly arch-matching host/clusters) & other lifecycle operations (such as migration from/to arch-matching hosts)
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>