[ClusterLabs] Antw: Re: SLES11 SP4:SBD fencing problem with Xen (NMI not handled)?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Wed Aug 1 07:59:12 EDT 2018


Hi!

One thing I found out in the meantime is that hpwdt ("HP iLO2+ HW Watchdog
Timer") calls panic() in hpwdt_pretimeout(). However panic() never returns, and
so the notify_die() from do_nmi() never finishes. Possibly this never worked,
be it Xen or not.

The interesting thing is that the HP hardware watchdog is the one that
triggered the NMI AFAIK.

As the server should also have some IPMI watchdog, I wonder whether I could
use that instead.

Regards,
Ulrich

>>> Edwin Török <edvin.torok at citrix.com> schrieb am 30.07.2018 um 11:20 in
Nachricht <44d67d56-d7a7-3af3-64ef-4f24ed0aba6e at citrix.com>:
> On 30/07/18 08:24, Ulrich Windl wrote:
>> Hi!
>> 
>> We have a strange problem on one cluster node running Xen PV VMs (SLES11 
> SP4): After updating the kernel and adding new SBD devices (to replace an
old 
> storage system), the system just seems to freeze.
> 
> Hi,
> 
> Which version of Xen are you using and what Linux distribution is run in
> Dom0?
> 
>> Closter inspection showed that SBD seems to send an NMI (for reasons still

> to be examined), and the current Xen/Kernel seems to be unable to handle the

> NMI in a way that forces a restart of the server (see attached screen
shot).
> 
> Can you show us your kernel boot cmdline, and loaded modules?
> Which watchdog module did you load? Have you tried xen_wdt?
> See https://www.suse.com/support/kb/doc/?id=7016880 
> 
> Best regards,
> ‑‑Edwin
> 
>> 
>> The last message I see in the node's cluster log is this:
>> Jul 27 11:33:32 [15731] h01        cib:     info: 
> cib_file_write_with_digest:      Reading cluster configuration file 
> /var/lib/pacemaker/cib/cib.YESngs (digest:
/var/lib/pacemaker/cib/cib.Yutv8O)
>> 
>> Other nodes have these messages:
>> Jul 27 11:33:32 h05 dlm_controld.pcmk[15810]: dlm_process_node: Skipped 
> active node 739512330: born‑on=3864, last‑seen=3936, this‑event=3936, 
> last‑event=3932
>> 
>> Jul 27 11:33:32 h10 dlm_controld.pcmk[20397]: dlm_process_node: Skipped 
> active node 739512325: born‑on=3856, last‑seen=3936, this‑event=3936, 
> last‑event=3932
>> 
>> Can anybody bring some light into this issue?:
>> 1) Under what circumstances is an NMI sent by SBD?
>> 2) What is the reaction expected after receiving an NMI?
>> 3) If it did work before, what could have gone wrong?
>> 
>> I wanted to get some feedback from here before asking SLES support...
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





More information about the Users mailing list