* api,agent,server,engine-schema: scalability improvements
Following changes and improvements have been added:
- Improvements in handling of PingRoutingCommand
1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
3. Optimized scanning stalled VMs
- Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers`
- Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine
- Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0.
- Added caching for account/use role API access with expiration after write set to 60 seconds.
- Added caching for some recurring DB retrievals
1. CapacityManager - listing service offerings - beneficial in host capacity calculation
2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
3. DownloadListener - hypervisors for zone - beneficial for host joins
5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands
- Optimized MS list retrieval for agent connect
- Optimize finding ready systemvm template for zone
- Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
- Changes in agent-agentmanager connection with NIO client-server classes
1. Optimized the use of the executor service
2. Refactore Agent class to better handle connections.
3. Do SSL handshakes within worker threads
5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end.
6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used.
- Improvements in StatsCollection - minimize DB retrievals.
- Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
- Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.
- Minor improvements in resource limit calculations wrt DB retrievals
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* test1, domaindetails, capacitymanager fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* test2 - agent tests
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* capacitymanagertest fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* change
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix missing changes
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address comments
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* revert marvin/setup.py
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix indent
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* use space in sql
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address duplicate
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* update host logs
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* revert e36c6a5d07
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix npe in capacity calculation
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* move schema changes to 4.20.1 upgrade
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* build fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* address comments
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* fix build
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* add some more tests
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* checkstyle fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* remove unnecessary mocks
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* build fix
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* replace statics
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* engine/orchestration,utils: limit number of concurrent new agent
connections
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* refactor - remove unused
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* unregister closed connections, monitor & cleanup
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* add check for outdated vm filter in power sync
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* agent: synchronize sendRequest wait
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
---------
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* Improve logging to include more identifiable information for kvm plugin
* Update logging for scaleio plugin
* Improve logging to include more identifiable information for default volume storage plugin
* Improve logging to include more identifiable information for agent managers
* Improve logging to include more identifiable information for Listeners
* Replace ids with objects or uuids
* Improve logging to include more identifiable information for engine
* Improve logging to include more identifiable information for server
* Fixups in engine
* Improve logging to include more identifiable information for plugins
* Improve logging to include more identifiable information for Cmd classes
* Fix toString method for StorageFilterTO.java
* Prevent addition of duplicate PF rules on scale up and no rules left behind on scale down (#32)
* fix missing dependency injection
* NSX: Fix concurrency issues on port forwarding rules deletion (#37)
* Fix concurrency issues on port forwarding rules deletion
* Refactor objectExists
* Fix unit test
* Fix test
* Small fixes
* CKS: Externalize control and worker node setup wait time and installation attempts (#38)
* NSX: Add shared network support (#41)
* NSX: Fix number of physical networks for Guest traffic checks and leftover rules on CKS cluster deletion (#45)
* Fix pf rules removal on CKS cluster deletion
* Fix check for number of physical networks for guest traffic
* Fix unit test
* fix logger
* NSX: Handle CheckHealthCommand to avoid host disconnection and errors on APIs
* NSX: Handle CheckHealthCommand to avoid host disconnection and errors on APIs
* Remove unused string
* fix logger
* Update UDP active monitor to ICMP
* Fix NPE on restarting VPC with additional public IPs
* NSX / VPC: Reuse Source NAT IP from systemVM range on restarts
* CKS: Public IP not found for VPC networks
* Externalize retries and inverval for NSX segment deletion (#67)
* remove unused import
* remove duplicate imports
* remove unused import
* revert externalizing cks settings
* fix test
* Refactor log messages
* Address comments
* Fix issue caused due to forward merge: 90fe1d
---------
Co-authored-by: Nicolas Vazquez <nicovazquez90@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
this fixes the exception in smoke test test_affinity_groups
```
2024-08-19T08:34:15,132 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-87:[ctx-f7804a8e, job-9232]) (logid:b71ddec8) Unexpected exception while executing org.apache.cloudstack.api.command.admin.vm.DeployVMCmdByAdmin com.cloud.utils.exception.CloudRuntimeException: Unable to find on DB, due to: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') FOR UPDATE' at line 1
at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:441)
at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:368)
at com.cloud.utils.db.GenericDaoBase.search(GenericDaoBase.java:357)
at com.cloud.utils.db.GenericDaoBase.lockRows(GenericDaoBase.java:343)
at org.apache.cloudstack.affinity.dao.AffinityGroupDaoImpl.listByIds(AffinityGroupDaoImpl.java:171)
```
* Normalize logs
All classes that could have their loggers inherited from their fathers had their own loggers deleted;
Most loggers didn't have to be static, so most of them were normalized so that they wouldn't be;
All loggers are protected now;
Static logger's name are now 'LOGGER';
Non-static logger's name are now 'logger';
New class DbUpgradeAbstractImpl created so that all Upgraders extend it and inherit its logger
* Upgrade log4j
* fix errors caused by the merge
* Refactor cglibThrowableRenderer functionality to log4j2 and upgrade the last configuration files
* fix sonarcloud bug
* Fix errors caused by merge, remove some unused loggers, and rename a variable that was mistakenly renamed on the normalization commit
* Readd snmpTrapAppender, remove TestAppender
* Regenerate changes
* regenerate changes
* refactor last custom appender
* fix systemvm configuration xml
* Regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* Fix utils pom
* fix some tests
* regenerate changes
* Fix jar being printed on exception
* fix logging in system VMs, fix commands not having log4j2 classpath.
* regenerate changes
* Fix some unwanted renomeations
* fix end of file
* regenerate changes
* regenerate changes
* fix merge error
* regenerate changes
* fix tests
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* readd reload4j to tungsten as juniper depends on it
* Regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* re-add reload4j dependency to network-contrail, as juniper depends on it
* regenerate changes
* regenerate changes
* regenerate changes
* fix typo
* regenerate changes
* regenerate changes
* Fix end of files
* regenerate changes
* add logj42 to cloud-utils-SHADED.jar
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* regenerate changes
* Regenerate changes
* Regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* Regenerate changes
* Regenerate changes
* regenerate changes
* Regenerate changes
* Regenerate changes
* fix some tests
* Regenerate changes
* Regenerate changes
* fix test
* Regenerate changes
* Regenerate changes
This pull request (PR) implements a Distributed Resource Scheduler (DRS) for a CloudStack cluster. The primary objective of this feature is to enable automatic resource optimization and workload balancing within the cluster by live migrating the VMs as per configuration.
Administrators can also execute DRS manually for a cluster, using the UI or the API.
Adds support for two algorithms - condensed & balanced. Algorithms are pluggable allowing ACS Administrators to have customized control over scheduling.
Implementation
There are three top level components:
Scheduler
A timer task which:
Generate DRS plan for clusters
Process DRS plan
Remove old DRS plan records
DRS Execution
We go through each VM in the cluster and use the specified algorithm to check if DRS is required and to calculate cost, benefit & improvement of migrating that VM to another host in the cluster. On the basis of cost, benefit & improvement, the best migration is selected for the current iteration and the VM is migrated. The maximum number of iterations (live migrations) possible on the cluster is defined by drs.iterations which is defined as a percentage (as a value between 0 and 1) of total number of workloads.
Algorithm
Every algorithms implements two methods:
needsDrs - to check if drs is required for cluster
getMetrics - to calculate cost, benefit & improvement of a migrating a VM to another host.
Algorithms
Condensed - Packs all the VMs on minimum number of hosts in the cluster.
Balanced - Distributes the VMs evenly across hosts in the cluster.
Algorithms use drs.level to decide the amount of imbalance to allow in the cluster.
APIs Added
listClusterDrsPlan
id - ID of the DRS plan to list
clusterid - to list plans for a cluster id
generateClusterDrsPlan
id - cluster id
iterations - The maximum number of iterations in a DRS job defined as a percentage (as a value between 0 and 1) of total number of workloads. Defaults to value of cluster's drs.iterations setting.
executeClusterDrsPlan
id - ID of the cluster for which DRS plan is to be executed.
migrateto - This parameter specifies the mapping between a vm and a host to migrate that VM. Format of this parameter: migrateto[vm-index].vm=<uuid>&migrateto[vm-index].host=<uuid>.
Config Keys Added
ClusterDrsPlanExpireInterval
Key drs.plan.expire.interval
Scope Global
Default Value 30 days
Description The interval in days after which old DRS records will be cleaned up.
ClusterDrsEnabled
Key drs.automatic.enable
Scope Cluster
Default Value false
Description Enable/disable automatic DRS on a cluster.
ClusterDrsInterval
Key drs.automatic.interval
Scope Cluster
Default Value 60 minutes
Description The interval in minutes after which a periodic background thread will schedule DRS for a cluster.
ClusterDrsIterations
Key drs.max.migrations
Scope Cluster
Default Value 50
Description Maximum number of live migrations in a DRS execution.
ClusterDrsAlgorithm
Key drs.algorithm
Scope Cluster
Default Value condensed
Description DRS algorithm to execute on the cluster. This PR implements two algorithms - balanced & condensed.
ClusterDrsLevel
Key drs.imbalance
Scope Cluster
Default Value 0.5
Description Percentage (as a value between 0.0 and 1.0) of imbalance allowed in the cluster. 1.0 means no imbalance
is allowed and 0.0 means imbalance is allowed.
ClusterDrsMetric
Key drs.imbalance.metric
Scope Cluster
Default Value memory
Description The cluster imbalance metric to use when checking the drs.imbalance.threshold. Possible values are memory and cpu.
JaCoCo used for code coverage calculation in the project doesn't support PowerMockito classes.
This PR attemps to reduce usage of PowerMockito.
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
* Cleaup and code-formatting POM files
* Remove obsolete mycila license-maven-plugin
* Remove obsolete console-proxy/plugin project
* Move console-proxy-rdbconsole under console-proxy parent
* Use correct parent path for rdpconsole
* Order alphabetally items in setnextversion.sh
* Unifiy License header in POMs
* Alphabetic order of modules definition
* Extract all defined versions into parent pom
* Remove obsolete files: version-info.in, configure-info.in
* Remove redundant defaultGoal
* Remove useless checkstyle plugin from checkstyle project
* Order alphabetally items in pom.xml
* Add aditional SPACEs to fix debian build
* Don't execute checkstyle on parent projects
* Use UTF-8 encoding in building checkstyle project
* Extract plugin versions into properties
* Execute PMD plugin on all the projects with -Penablefindbugs
* Upgrade maven plugins to latest version
* Make sure to always look for apache parent pom from repository
* Fix incorrect version grep in debian packaging
* Fix rebase conflicts
* Fix rebase conflicts
* Remove PMD for now to be fixed on another PR