31 Commits

Author SHA1 Message Date
Abhishek Kumar
0b5a5e8043
api,agent,server,engine-schema: scalability improvements (#9840)
* api,agent,server,engine-schema: scalability improvements

Following changes and improvements have been added:

- Improvements in handling of PingRoutingCommand

    1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
    2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
    3. Optimized scanning stalled VMs

- Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers`

- Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine

- Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0.

- Added caching for account/use role API access with expiration after write set to 60 seconds.

- Added caching for some recurring DB retrievals

    1. CapacityManager - listing service offerings - beneficial in host capacity calculation
    2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
    3. DownloadListener - hypervisors for zone - beneficial for host joins
    5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands

- Optimized MS list retrieval for agent connect

- Optimize finding ready systemvm template for zone

- Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks

- Changes in agent-agentmanager connection with NIO client-server classes

    1. Optimized the use of the executor service
    2. Refactore Agent class to better handle connections.
    3. Do SSL handshakes within worker threads
    5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end.
    6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used.

- Improvements in StatsCollection - minimize DB retrievals.

- Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.

- Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.

- Minor improvements in resource limit calculations wrt DB retrievals

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* test1, domaindetails, capacitymanager fix

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* test2 - agent tests

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* capacitymanagertest fix

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* change

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* fix missing changes

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* address comments

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* revert marvin/setup.py

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* fix indent

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* use space in sql

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* address duplicate

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* update host logs

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* revert e36c6a5d07

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* fix npe in capacity calculation

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* move schema changes to 4.20.1 upgrade

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* build fix

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* address comments

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* fix build

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* add some more tests

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* checkstyle fix

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* remove unnecessary mocks

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* build fix

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* replace statics

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* engine/orchestration,utils: limit number of concurrent new agent
connections

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* refactor - remove unused

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* unregister closed connections, monitor & cleanup

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* add check for outdated vm filter in power sync

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* agent: synchronize sendRequest wait

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

---------

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
2025-02-01 12:28:41 +05:30
Vishesh
a4224e58cc
Improve logging to include more identifiable information (#9873)
* Improve logging to include more identifiable information for kvm plugin

* Update logging for scaleio plugin

* Improve logging to include more identifiable information for default volume storage plugin

* Improve logging to include more identifiable information for agent managers

* Improve logging to include more identifiable information for Listeners

* Replace ids with objects or uuids


* Improve logging to include more identifiable information for engine

* Improve logging to include more identifiable information for server

* Fixups in engine

* Improve logging to include more identifiable information for plugins

* Improve logging to include more identifiable information for Cmd classes

* Fix toString method for StorageFilterTO.java
2025-01-06 16:42:37 +05:30
Alain-Christian Courtines
afdf4d7d46
Add limit configuration for number of projects (#9172) 2024-07-15 08:47:14 +02:00
Vishesh
cc8dc84f64
server: fix resource reservation leakage (#9169)
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
2024-06-10 12:30:06 +05:30
Vishesh
21af134087
Fix exceeding of resource limits with powerflex (#9008)
* Fix exceeding of resource limits with powerflex

* Add e2e tests

* Update server/src/main/java/com/cloud/vm/UserVmManagerImpl.java

Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>

* fixup

---------

Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>
2024-05-08 20:55:19 +05:30
Vishesh
cfdb33a052
Fixup resource limit checks (#8935) 2024-04-25 12:59:35 +02:00
Vishesh
63a0797b18
Introduce scheduled executor wrapper with dynamic interval (#8916)
* Introduce scheduled executor wrapper with dynamic interval

* Add validation for configkey
2024-04-17 15:15:37 +05:30
Vishesh
ebaf5a47b9
Speedup resource count calculation (#8903)
* Speed up resource count calculation

* Refactor resource count calculation

* Start transaction for updateCountByDeltaForIds
2024-04-17 14:21:30 +05:30
Vishesh
e87c6cfcb1
Fix resource count discrepancies (#8302)
* Fix resource count discrepancies

* Fixup while removing vm

* Fix discrepancies when starting VMs

* Fixup tests

* Fix failing tests

* Don't take lock when amount is negative

---------

Co-authored-by: dahn <daan@onecht.net>
2024-03-13 18:22:44 +05:30
Abhishek Kumar
592038a304
api,server,ui: granular resource limit management (#8362)
Feature spec: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Granular+Resource+Limit+Management

Introduces the concept of tagged resource limits for granular resource limit management. Limits can be enforced on accounts and domains for the deployment of entities for a tagged resource. Current tagged resource limits can be used for the following resource types,

Host limits
- user_vm
- cpu
- memory

Storage limits
- volume
- primary_storage

Following global settings can used to specify tags for which limit needs to be enforced,

Host: `resource.limit.host.tags`
Storage: `resource.limit.storage.tags`

Option for specifying tagged resource limits and viewing tagged resource usage are made available in the UI.

Enhances the use of templatetag for VM deployment and template creation

Adds option to list service/compute offerings that can be used with a given template. A new parameter named templateid has been added.

Adds option to list disk offering with suitability flag for a virtual machine. A new parameter named virtualmachineid has been added to the listDiskOfferings API which when passed returns suitableforvirtualmachine param in the response.
2024-02-19 14:17:34 +05:30
João Jandre
49cecaed06
Normalize loggers and upgrade log4j 1.2 to log4j 2.19 (#7131)
* Normalize logs

All classes that could have their loggers inherited from their fathers had their own loggers deleted;
Most loggers didn't have to be static, so most of them were normalized so that they wouldn't be;
All loggers are protected now;
Static logger's name are now 'LOGGER';
Non-static logger's name are now 'logger';
New class DbUpgradeAbstractImpl created so that all Upgraders extend it and inherit its logger

* Upgrade log4j

* fix errors caused by the merge

* Refactor cglibThrowableRenderer functionality to log4j2 and upgrade the last configuration files

* fix sonarcloud bug

* Fix errors caused by merge, remove some unused loggers, and rename a variable that was mistakenly renamed on the normalization commit

* Readd snmpTrapAppender, remove TestAppender

* Regenerate changes

* regenerate changes

* refactor last custom appender

* fix systemvm configuration xml

* Regenerate changes

* Regenerate changes

* regenerate changes

* Regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* Fix utils pom

* fix some tests

* regenerate changes

* Fix jar being printed on exception

* fix logging in system VMs, fix commands not having log4j2 classpath.

* regenerate changes

* Fix some unwanted renomeations

* fix end of file

* regenerate changes

* regenerate changes

* fix merge error

* regenerate changes

* fix tests

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* readd reload4j to tungsten as juniper depends on it

* Regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* re-add reload4j dependency to network-contrail, as juniper depends on it

* regenerate changes

* regenerate changes

* regenerate changes

* fix typo

* regenerate changes

* regenerate changes

* Fix end of files

* regenerate changes

* add logj42 to cloud-utils-SHADED.jar

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* regenerate changes

* Regenerate changes

* Regenerate changes

* Regenerate changes

* regenerate changes

* Regenerate changes

* regenerate changes

* Regenerate changes

* Regenerate changes

* Regenerate changes

* regenerate changes

* Regenerate changes

* Regenerate changes

* fix some tests

* Regenerate changes

* Regenerate changes

* fix test

* Regenerate changes

* Regenerate changes
2024-02-08 09:55:41 -03:00
Nicolas Vazquez
dccd37af50
Run ResourceCountCheckTask only in the longest running management server (#7977)
* Run recalculation recurrent task only in the longest running management server

* Address review comments
2023-10-12 14:21:39 +05:30
GaOrtiga
819dd7b75c
server: remove supportedOwner from Resource.ResourceType (#7416) 2023-08-30 11:29:16 +02:00
Daan Hoogland
7b64236469 Merge release branch 4.18 to main
* 4.18:
  server: remove registered userdata when cleanup an account (#7777)
  server: Use max secondary storage defined on the account during upload  (#7441)
  test: upgrade kubernetes versions to 1.25.0/1.26.0 (#7685)
  kvm: Added VNI Devices as normal bridge slave devs (#7836)
  noVNC: fix JP keyboard on vmware7+ which uses websocket URL (#7694)
2023-08-10 14:50:46 +02:00
João Jandre
fdb23dae40
server: Use max secondary storage defined on the account during upload (#7441) 2023-08-10 11:39:40 +02:00
Rohit Yadav
62a8f4ef72 Merge remote-tracking branch 'origin/4.18' 2023-07-24 15:57:37 +05:30
Abhishek Kumar
cee7a713aa
server: clear resource reservation and increment resource count in a transaction (#7724)
This PR addresses rare case of potential overlap of resource reservation and resource count.
For different resource types there could be some delay between incrementing of the resource count and clearing of the earlier done reservation. This may result in failures when there are parallel deployments happening.

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
2023-07-21 10:55:51 +05:30
mprokopchuk
70d5470f48
If ResourceCountCheckTask throws an exception the scheduled task is not going to run again until the management servers are restarted. (#7670)
Co-authored-by: Maxim Prokopchuk <mprokopchuk@apple.com>
2023-07-04 08:45:15 +02:00
Oscar Sandoval
b6443a2b1f
increase log detail for limit checking, fix getDomainReservation() (#7506)
In troubleshooting ops issues we see logs like:

Maximum domain resource limits of Type 'user_vm' for Domain Id = 763 is exceeded: Domain Resource Limit = (1 bytes) 1, Current Domain Resource Amount = (0 bytes) 0, Requested Resource Amount = (1 bytes) 1."

However there is one missing value (currentResourceReservation) that is used in the calculation of limit check but it is not logged, which leads to confusion. Above we see we are using “0” and requested 1, with our limit being 1, but was rejected. Without logging all the values used in the calculation we don’t understand why it failed.

Additionally, if we had this log above it would be clearer that a second bug is occurring. When we query for domain level resource reservations in “getDomainReservation” the actual SearchBuilder is the listAccountAndTypeSearch, not the listDomainAndTypeSearch. As a result, when we call getDomainReservation the query returns any outstanding domain reservation for any account, as domain ID is not a valid filter for the account search.

This PR:

Increases detailed information in log for checking resource limit to include reservations information for functions: checkDomainResourceLimit() and checkAccountResourceLimit

Fixes getDomainReservation() to use listDomainAndTypeSearch instead of listAccountAndTypeSearch

Co-authored-by: Oscar Sandoval <osandovalocana@apple.com>
2023-05-12 12:53:18 +05:30
dahn
bbc1260576
Resource reservation framework (#6694)
This PR addresses parallel resource allocation as a generalization of the problem and solution described in #6644. Instead of the Global lock on the resources a reservation record is created which is added in the resource check count in the ResourceLimitService/ResourceLimitManagerImpl. As a convenience a CheckedReservation is created. This is an implementation of AutoClosable and can be used as a guard in a try-with-resource fashion. The close method of the CheckedReservation wil delete the reservation record.

Co-authored-by: Boris Stoyanov - a.k.a Bobby <bss.stoyanov@gmail.com>
2022-09-16 15:44:35 +05:30
Suresh Kumar Anaparti
75da982d73
Updated resource counter to include correct size after volume creation/resize and other improvements (#6587)
* Updated resource counter to include correct size after volume creation/resize and other improvements
- Recalculate resource counters for root domain in the periodic task
- Update correct size in the primary_storage resource counter after volume creation/resize
- Some code improvements

* review and sonarcloud issues

Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>
Co-authored-by: Daan Hoogland <daan@onecht.net>
2022-08-16 10:41:42 +02:00
Harikrishna
089e9647f1
Fix global setting reference for max secondary storage (#6496)
* Fix global setting reference for max secondary storage usage based on account or project

* Changed a variable naming

* Replaced config enum usage with configkey class for global settings

* Fixed grammar mistake

* Fixed code smells
2022-06-30 11:42:58 +05:30
JoaoJandre
5f07ddaca9
Refactor account type (#6048)
* Refactor account type

* Added license.

* Address reviews

* Address review.

Co-authored-by: João Paraquetti <joao@scclouds.com.br>
Co-authored-by: Joao <JoaoJandre@gitlab.com>
2022-03-09 11:14:19 -03:00
Pearl Dsilva
74bb80687d
resource limit: Fix resource limit check on VM start (#5428)
* resource limit: Fix resource limit check on VM start

* add check to validate if cpu/memory are within limits for custom offering + exception handling

* unit tests

Co-authored-by: utchoang <hoangnm@unitech.vn>
2021-09-24 09:51:16 +05:30
Suresh Kumar Anaparti
958182481e cloudstack: make code more inclusive
Inclusivity changes for CloudStack

- Change default git branch name from 'master' to 'main' (post renaming/changing default git branch to 'main' in git repo)
- Rename some offensive words/terms as appropriate for inclusiveness.

This PR updates the default git branch to 'main', as part of #4887.

Signed-off-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
2021-06-08 15:47:20 +05:30
Spaceman1984
b586eb22f1
Human readable sizes in logs (#4207)
This PR adds outputting human readable byte sizes in the management server logs, agent logs, and usage records. A non-dynamic global variable is added (display.human.readable.sizes) to control switching this feature on and off. This setting is sent to the agent on connection and is only read from the database when the management server is started up. The setting is kept in memory by the use of a static field on the NumbersUtil class and is available throughout the codebase.

Instead of seeing things like:
2020-07-23 15:31:58,593 DEBUG [c.c.a.t.Request] (AgentManager-Handler-12:null) (logid:) Seq 8-1863645820801253428: Processing: { Ans: , MgmtId: 52238089807, via: 8, Ver: v1, Flags: 10, [{"com.cloud.agent.api.NetworkUsageAnswer":{"routerName":"r-224-VM","bytesSent":"106496","bytesReceived":"0","result":"true","details":"","wait":"0",}}] }

The KB MB and GB values will be printed out:

2020-07-23 15:31:58,593 DEBUG [c.c.a.t.Request] (AgentManager-Handler-12:null) (logid:) Seq 8-1863645820801253428: Processing: { Ans: , MgmtId: 52238089807, via: 8, Ver: v1, Flags: 10, [{"com.cloud.agent.api.NetworkUsageAnswer":{"routerName":"r-224-VM","bytesSent":"(104.00 KB) 106496","bytesReceived":"(0 bytes) 0","result":"true","details":"","wait":"0",}}] }

FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Human+Readable+Byte+sizes
2020-08-13 15:55:16 +05:30
Wei Zhou
ac581d1546
New feature: Resource count (CPU/RAM) take only running vms into calculation (#3760)
* marvin: check resource count of more types

* New feature: add flag resource.count.running.vms.only to count resource consumption of only running vms

Stopped VMs do not use CPU/RAM actually.
A new global configuration resource.count.running.vms.only is added to determine whether resource (cpu/memory) of only running vms (including Starting/Stopping) will be taken into calculation of resource consumption.

* Add integration test for resource count of only running vms
2020-01-30 10:36:50 +01:00
Rohit Yadav
323d381767 Merge remote-tracking branch 'origin/4.11'
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
2018-10-29 16:27:08 +05:30
Rafael Weingärtner
7f934c0e86 Formatting to make checkstyle happy 2018-02-01 11:41:56 -02:00
Rafael Weingärtner
5d545023fc [CLOUDSTACK-9338] ACS not accounting resources of VMs with custom service offering
ACS is accounting the resources properly when deploying VMs with custom service offerings. However, there are other methods (such as updateResourceCount) that do not execute the resource accounting properly, and these methods update the resource count for an account in the database. Therefore, if a user deploys VMs with custom service offerings, and later this user calls the “updateResourceCount” method, it (the method) will only account for VMs with normal service offerings, and update this as the number of resources used by the account. This will result in a smaller number of resources to be accounted for the given account than the real used value. The problem becomes worse because if the user starts to delete these VMs, it is possible to reach negative values of resources allocated (breaking all of the resource limiting for accounts). This is a very serious attack vector for public cloud providers!
2018-02-01 10:59:16 -02:00
Marc-Aurèle Brothier
893a88d225 CLOUDSTACK-10105: Use maven standard project structure in all projects (#2283)
Remove maven standard module (which only a few were using) and get ride of maven customization for the projects structure.

- moved all directories to src/main/java, src/main/resources, src/main/scripts, src/test/java, src/test/resources
- grep scan to search for src/com and src/org left over
- grep for <project>/scripts to fix pom.xml configuration
- remove custom <build> configuration in pom.xml

Signed-off-by: Marc-Aurèle Brothier <m@brothier.org>
2018-01-20 03:19:27 +05:30