[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Klaus Wenninger kwenning at redhat.com
Tue Apr 23 04:34:14 EDT 2024


On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe <
christophe.nolibos at thalesgroup.com> wrote:

> Classified as: {OPEN}
>
>
>
> Other strange thing.
>
> On RHEL 7, corosync is restarted while the “Restart=on-failure » line is
> commented.
>
> I think also that something changed in the pacemaker behavior, or
> somewhere else.
>

That is how it was working before introduction of the reconnection to
corosync.
Previously pacemaker would fail and systemd would restart it checking the
services
pacemaker depends on. And finding corosync not running it would be
restarted.

Klaus


>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* lundi 22 avril 2024 12:41
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> You are right : the “Restart=on-failure” line is commented and so,
> disabled per default.
>
> Uncommenting it resolves my issue.
>
>
>
> Maybe pacemaker changed behavior here without syncing enough with corosync
> behavior.
>
> We'll look into that to see which approach is better - restart corosync on
> failure - or have
>
> pacemaker be restarted by systemd which should in turn restart corosync as
> well.
>
>
>
> Klaus
>
>
>
> Thanks a lot.
>
> Christophe.
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* lundi 22 avril 2024 11:06
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> ‘kill -9’ command.
>
> Is it gracefully exit?
>
>
>
> Looking as if corosync-unit-file has Restart=on-failure disabled per
> default.
>
> I'm not aware of another mechanism that would restart corosync and I
>
> think default behavior is not to restart.
>
> Comments suggest just to enable if using watchdog but that might just
>
> reference the RestartSec to provoke a watchdog-reboot instead of a
>
> restart via systemd.
>
> Any signal that isn't handled by the process - so that the exit-code could
>
> be set to 0 - should be fine.
>
>
>
> Klaus
>
>
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* jeudi 18 avril 2024 20:17
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
> NOLIBOS Christophe <christophe.nolibos at thalesgroup.com> schrieb am Do.,
> 18. Apr. 2024, 19:01:
>
> Classified as: {OPEN}
>
>
>
> Hummm… my RHEL 8.8 OS has been hardened.
>
> I am wondering if the problem does not come from that.
>
>
>
> On another side, I get the same issue (i.e. corosync not restarted by
> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>
>
>
> I’m checking.
>
>
>
> How did, you kill corosync? If it exits gracefully might not be restarted.
> Check journal. Sry cant try am on my mobile ATM. Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* Users <users-bounces at clusterlabs.org> *De la part de* NOLIBOS
> Christophe via Users
> *Envoyé :* jeudi 18 avril 2024 18:34
> *À :* Klaus Wenninger <kwenning at redhat.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Cc :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
> So, the issue is on systemd?
>
>
>
> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
> 1.1.13-10, corosync is correctly restarted by systemd.
>
>
>
> [RHEL7 ~]# journalctl -f
>
> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>
> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>
> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
> scheduling restart.
>
> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>
> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
>
> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
> Manager.
>
> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster
> Manager...
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/pacemaker.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to
> /var/log/cluster/corosync.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* jeudi 18 avril 2024 18:12
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>; Cluster
> Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenning at redhat.com>
> wrote:
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> Well… why do you say that « Well if corosync isn't  there that this is to
> be expected and pacemaker won't recover corosync.”?
>
> In my mind, Corosync is managed by Pacemaker as any other cluster resource
> and the "pacemakerd: recover properly from > Corosync crash" fix
> implemented in version 2.1.2 seems confirm that.
>
>
>
> Nope. Startup of the stack is done by systemd. And pacemaker is just
> started after corosync is up and
>
> systemd should be responsible for keeping the stack up.
>
> For completeness: if you have sbd in the mix that is as well being started
> by systemd but kind of
>
> parallel with corosync as part of it (systemd terminology).
>
>
>
> The "recover" above is referring to pacemaker recovering from corosync
> going away and coming back.
>
>
>
>
>
> Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* NOLIBOS Christophe
> *Envoyé :* jeudi 18 avril 2024 17:56
> *À :* 'Klaus Wenninger' <kwenning at redhat.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Cc :* Ken Gaillot <kgaillot at redhat.com>
> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
>
>
> [~]$ systemctl status corosync
>
> ● corosync.service - Corosync Cluster Engine
>
>    Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
> vendor preset: disabled)
>
>    Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC;
> 53min ago
>
>      Docs: man:corosync
>
>            man:corosync.conf
>
>            man:corosync_overview
>
>   Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force
> (code=exited, status=0/SUCCESS)
>
>   Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
> (code=killed, signal=KILL)
>
> Main PID: 1324906 (code=killed, signal=KILL)
>
>
>
> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1
>
> Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8)
> was formed. Members joined: 1
>
> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 2
>
> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 2
>
> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 2
>
> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1
>
> Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
>
> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited,
> code=killed, status=9/KILL
>
> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result
> 'signal'.
>
> [~]$
>
>
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* jeudi 18 avril 2024 17:43
> *À :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Cc :* Ken Gaillot <kgaillot at redhat.com>; NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <
> users at clusterlabs.org> wrote:
>
> Classified as: {OPEN}
>
> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
> When I kill Corosync, no new corosync process is created and pacemaker is
> in failure.
> The only solution is to restart the pacemaker service.
>
> [~]$ pcs status
> Error: unable to get cib
> [~]$
>
> [~]$systemctl status pacemaker
> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
> vendor preset: disabled)
>    Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
>      Docs: man:pacemakerd
>            https://clusterlabs.org/pacemaker/doc/
>  Main PID: 1324923 (pacemakerd)
>     Tasks: 91
>    Memory: 132.1M
>    CGroup: /system.slice/pacemaker.service
> ...
> Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> [~]$
>
> Well if corosync isn't  there that this is to be expected and pacemaker
> won't recover corosync.
>
> Can you check what systemd thinks about corosync (status/journal).
>
>
>
> Klaus
>
>
> {OPEN}
>
> -----Message d'origine-----
> De : Ken Gaillot <kgaillot at redhat.com>
> Envoyé : jeudi 18 avril 2024 16:40
> À : Cluster Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> Cc : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
> What OS are you using? Does it use systemd?
>
> What does happen when you kill Corosync?
>
> On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
> > Classified as: {OPEN}
> >
> > Dear All,
> >
> > I have a question about the "pacemakerd: recover properly from
> > Corosync crash" fix implemented in version 2.1.2.
> > I have observed the issue when testing pacemaker version 2.0.5, just
> > by killing the ‘corosync’ process: Corosync was not recovered.
> >
> > I am using now pacemaker version 2.1.5-8.
> > Doing the same test, I have the same result: Corosync is still not
> > recovered.
> >
> > Please confirm the "pacemakerd: recover properly from Corosync crash"
> > fix implemented in version 2.1.2 covers this scenario.
> > If it is, did I miss something in the configuration of my cluster?
> >
> > Best Regard.
> >
> > Christophe.
> >
> >
> >
> > {OPEN}
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> {OPEN}
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240423/4f932050/attachment-0001.htm>


More information about the Users mailing list