The watchdog timer adds functionality where the Hypervisor can detect if an
instance has crashed or stopped functioning.
The watchdog timer adds functionality where the Hypervisor can detect if an
instance has crashed or stopped functioning.
When the Instance has the 'watchdog' daemon running it will send heartbeats
to the /dev/watchdog device.
If these heartbeats are no longer received by the HV it will reset the Instance.
If the Instance never sends the heartbeats the HV does not take action. It only
takes action if it stops sending heartbeats.
This is supported since Libvirt 0.7.3 and can be defined in the XML format as
described in the docs: https://libvirt.org/formatdomain.html#elementsWatchdog
To the 'devices' section this will be added:
In the agent.properties the action to be taken can be defined:
vm.watchdog.action=reset
The same goes for the model. The Intel i6300esb is however the most commonly used.
vm.watchdog.model=i6300esb
When the Instance has the 'watchdog' daemon running it will send heartbeats
to the /dev/watchdog device.
If these heartbeats are no longer received by the HV it will reset the Instance.
If the Instance never sends the heartbeats the HV does not take action. It only
takes action if it stops sending heartbeats.
This is supported since Libvirt 0.7.3 and can be defined in the XML format as
described in the docs: https://libvirt.org/formatdomain.html#elementsWatchdog
To the 'devices' section this will be added:
<watchdog model='i6300esb' action='reset'/>
In the agent.properties the action to be taken can be defined:
vm.watchdog.action=reset
The same goes for the model. The Intel i6300esb is however the most commonly used.
vm.watchdog.model=i6300esb
Signed-off-by: Wido den Hollander <wido@widodh.nl>
This causes VM deployment failure on the host that was disabled while adding the storage repository.
In the attachCluster function of the PrimaryDataStoreLifeCycle, we were only selecting hosts that are up and are in enabled state. Here if we select all up hosts, it will populate the DB properly and will fix this issue. Also added a unit test for attachCluster function.
If there are multiple files with the same name on vmware datastore, search operation may select any one file during volume related operations. This involves volume attach/detach, volume download, volume snapshot etc.
While using NetApp as the backup solution. This has .snapshot folder on the datastore and sometimes files from this folder gets selected during volume operations and the operation fails. Because of wrong selection of file following exception can be observed while volume deletion.
2017-02-23 19:39:05,750 ERROR [c.c.s.r.VmwareStorageProcessor] (DirectAgent-304:ctx-a1dbf5d8 ac.local) delete volume failed due to Exception: java.lang.RuntimeException
Message: Cannot delete file [4cbcd46d44c53f5c8244c0aad26a97e1] .snapshot/hourly.2017-02-23_1605/r-97-VM/ROOT-97.vmdk
To fix this behavior I have added a global configuration by name vmware.search.exclude.folders which can be comma separated list of folder paths.
I have also added a unit test to test the new method.
- All tests should pass on KVM, Simulator
- Add test cases covering FSM state transitions and actions
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
- Removed three bg thread tasks, uses FSM event-trigger based scheduling
- On successful recovery, kicks VM HA
- Improves overall HA scheduling and task submission, lower DB access
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Host-HA offers investigation, fencing and recovery mechanisms for host that for
any reason are malfunctioning. It uses Activity and Health checks to determine
current host state based on which it may degrade a host or try to recover it. On
failing to recover it, it may try to fence the host.
The core feature is implemented in a hypervisor agnostic way, with two separate
implementations of the driver/provider for Simulator and KVM hypervisors. The
framework also allows for implementation of other hypervisor specific provider
implementation in future.
The Host-HA provider implementation for KVM hypervisor uses the out-of-band
management sub-system to issue IPMI calls to reset (recover) or poweroff (fence)
a host.
The Host-HA provider implementation for Simulator provides a means of testing
and validating the core framework implementation.
Signed-off-by: Abhinandan Prateek <abhinandan.prateek@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
This introduces a new certificate authority framework that allows
pluggable CA provider implementations to handle certificate operations
around issuance, revocation and propagation. The framework injects
itself to `NioServer` to handle agent connections securely. The
framework adds assumptions in `NioClient` that a keystore if available
with known name `cloud.jks` will be used for SSL negotiations and
handshake.
This includes a default 'root' CA provider plugin which creates its own
self-signed root certificate authority on first run and uses it for
issuance and provisioning of certificate to CloudStack agents such as
the KVM, CPVM and SSVM agents and also for the management server for
peer clustering.
Additional changes and notes:
- Comma separate list of management server IPs can be set to the 'host'
global setting. Newly provisioned agents (KVM/CPVM/SSVM etc) will get
radomized comma separated list to which they will attempt connection
or reconnection in provided order. This removes need of a TCP LB on
port 8250 (default) of the management server(s).
- All fresh deployment will enforce two-way SSL authentication where
connecting agents will be required to present certificates issued
by the 'root' CA plugin.
- Existing environment on upgrade will continue to use one-way SSL
authentication and connecting agents will not be required to present
certificates.
- A script `keystore-setup` is responsible for initial keystore setup
and CSR generation on the agent/hosts.
- A script `keystore-cert-import` is responsible for import provided
certificate payload to the java keystore file.
- Agent security (keystore, certificates etc) are setup initially using
SSH, and later provisioning is handled via an existing agent connection
using command-answers. The supported clients and agents are limited to
CPVM, SSVM, and KVM agents, and clustered management server (peering).
- Certificate revocation does not revoke an existing agent-mgmt server
connection, however rejects a revoked certificate used during SSL
handshake.
- Older `cloudstackmanagement.keystore` is deprecated and will no longer
be used by mgmt server(s) for SSL negotiations and handshake. New
keystores will be named `cloud.jks`, any additional SSL certificates
should not be imported in it for use with tomcat etc. The `cloud.jks`
keystore is stricly used for agent-server communications.
- Management server keystore are validated and renewed on start up only,
the validity of them are same as the CA certificates.
New APIs:
- listCaProviders: lists all available CA provider plugins
- listCaCertificate: lists the CA certificate(s)
- issueCertificate: issues X509 client certificate with/without a CSR
- provisionCertificate: provisions certificate to a host
- revokeCertificate: revokes a client certificate using its serial
Global settings for the CA framework:
- ca.framework.provider.plugin: The configured CA provider plugin
- ca.framework.cert.keysize: The key size for certificate generation
- ca.framework.cert.signature.algorithm: The certificate signature algorithm
- ca.framework.cert.validity.period: Certificate validity in days
- ca.framework.cert.automatic.renewal: Certificate auto-renewal setting
- ca.framework.background.task.delay: CA background task delay/interval
- ca.framework.cert.expiry.alert.period: Days to check and alert expiring certificates
Global settings for the default 'root' CA provider:
- ca.plugin.root.private.key: (hidden/encrypted) CA private key
- ca.plugin.root.public.key: (hidden/encrypted) CA public key
- ca.plugin.root.ca.certificate: (hidden/encrypted) CA certificate
- ca.plugin.root.issuer.dn: The CA issue distinguished name
- ca.plugin.root.auth.strictness: Are clients required to present certificates
- ca.plugin.root.allow.expired.cert: Are clients with expired certificates allowed
UI changes:
- Button to download/save the CA certificates.
Misc changes:
- Upgrades bountycastle version and uses newer classes
- Refactors SAMLUtil to use new CertUtils
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Since libvirt 1.2.2 libvirt will properly create volumes
using RBD format 2.
We can use libvirt to creates the volumes which strips a bit of
code from the CloudStack Agent's responsbility.
RBD format 2 is already used by all volumes created by CloudStack.
This format is the most recent format of RBD and is still actively
being developed.
This removes the support for Ubuntu 12.04 as that does not have the
proper libvirt version available.
Signed-off-by: Wido den Hollander wido@widodh.nl
We can use libvirt to creates the volumes which strips a bit of
code from the CloudStack Agent's responsbility.
RBD format 2 is already used by all volumes created by CloudStack.
This format is the most recent format of RBD and is still actively
being developed.
This removes the support for Ubuntu 12.04 as that does not have the
proper libvirt version available.
Signed-off-by: Wido den Hollander <wido@widodh.nl>
Consider the CPU and memory overcommit ratios with total cpu/ram values
or thresholds for host metrics. This will fix incorrect notification
(cells turning yellow/red) in the metrics view.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Libvirt / Qemu (KVM) does not collect statistics about these either.
On some systems it might even yield a 'internal error' from libvirt
when attempting to gather block statistics from such devices.
For example Ubuntu 16.04 (Xenial) has a issue with this.
Skip them when looping through all devices.
Signed-off-by: Wido den Hollander <wido@widodh.nl>
The 'force' option provided with the stopVirtualMachine API command is
often assumed to be a hard shutdown sent to the hypervisor, when in fact
it is for CloudStacks' internal use. CloudStack should be able to send
the 'hard' power-off request to the hosts.
When forced parameter on the stopVM API is true, power off (hard shutdown)
a VM. This uses initial changes from #1635 to pass the forced parameter
to hypervisor plugin via the StopCommand, and fixes force stop (poweroff)
handling for KVM, VMware and XenServer.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>