12 Commits

Author SHA1 Message Date
Remi Bergsma
7bce656b40 make sure sync cannot block reboot
The recent discussed improvement has the risk that if 'sync' hangs, the reboot may be delayed in the same way as the 'reboot' command would do. To work around, we're adding a 5 second timeout. If it cannot sync in 5 seconds, it will not succeed anyway and we should proceed the reset.

@snuf: Could we use your OVM3 heartbeat script for other hypervisors as well? One way to do it seems like a nice idea :-)
2015-04-09 12:18:21 +02:00
Remi Bergsma
c59308b0ee write logfile just before rebooting the host
As discussed with @wido @pyr and @nuxro added an extra log line.

Tested it and it logs fine (tested to local disk) when syncing first:
Apr  3 15:31:23 mcctest2 heartbeat: kvmheartbeat.sh system because it was unable to write the heartbeat to the storage

By the way, it did also log to the agent.log but this extra log has the benefit of ending up in the system log so you'll probably find it easier there. Existing logs:
2015-04-03 15:27:23,943 WARN  [kvm.resource.KVMHAMonitor] (Thread-24:null) write heartbeat failed: timeout, retry: 0
2015-04-03 15:28:23,944 WARN  [kvm.resource.KVMHAMonitor] (Thread-24:null) write heartbeat failed: timeout, retry: 1
2015-04-03 15:29:23,946 WARN  [kvm.resource.KVMHAMonitor] (Thread-24:null) write heartbeat failed: timeout, retry: 2
2015-04-03 15:30:23,948 WARN  [kvm.resource.KVMHAMonitor] (Thread-24:null) write heartbeat failed: timeout, retry: 3
2015-04-03 15:31:23,950 WARN  [kvm.resource.KVMHAMonitor] (Thread-24:null) write heartbeat failed: timeout, retry: 4
2015-04-03 15:31:23,950 WARN  [kvm.resource.KVMHAMonitor] (Thread-24:null) write heartbeat failed: timeout; reboot the host

This closes #145

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
2015-04-04 14:17:37 +05:30
Remi Bergsma
2b41f98346 reboot much faster in case of storage failure
When storage cannot be reached, it does not make sense to reboot as it will try to flush buffers, umount NFS mounts, etc. This will not work and thus cause a long delay. With this change, the box will reboot immediately (like pressing the reset button).
2015-04-01 19:45:16 +02:00
Edison Su
f497c7c031 Bug: HA takes a lot of time to migrate VMs (trigger HA) to another KVM
host if there are multiple storage pools in a cluster.

The issue is as follows:
1. When CloudStack detects that a host is not responding to ping
requests it'll send a fence command for this host to another host in the
cluster.
2. The agent takes a long time to respond to this check if the storage
is fenced. This is because the agent checks if the first host is writing
to its heartbeat file on all pools in the cluster. It is doing this in a
sequential manner on all storage pool.

Making a fix to get rid of sleep, wait during HA. The behavior is now
similar to Xenserver.

RB: https://reviews.apache.org/r/6133/
Send-by:devdeep.singh@citrix.com
2012-07-25 10:17:09 -07:00
David Nalley
d630fa8697 license header changes for scripts folder from Chip Childers 2012-06-23 00:58:00 -04:00
frank
2f634c0913 Switch to Apache license 2012-04-03 04:50:05 -07:00
frank
52610ffcb3 add copyright header to shell scripts 2012-01-11 18:41:53 -08:00
Edison Su
47380dc20e fix add host 2011-05-12 15:03:15 -04:00
Edison Su
d8ee7d9fc3 if storage network disconnected, reboot the host 2011-04-14 17:46:54 -04:00
Frank
92155522f2 Add license header to files 2011-04-14 11:23:14 -07:00
edison
007783f6cf add more logs when taking heartbeat, and make ha enabled even in oss 2010-11-10 09:49:03 -08:00
edison
4bc63e5c32 Enable KVM HA on nfs storage 2010-11-09 22:03:22 -08:00