[ClusterLabs] Problem with the cluster becoming mostly unresponsive

Fri May 14 15:04:21 EDT 2021

Hi all,

  I'm run into an issue a couple of times now, and I'm not really sure
what's causing it. I've got a RHEL 8 cluster that, after a while, will
show one or more resources as 'FAILED'. When I try to do a cleanup, it
marks the resources as stopped, despite them still running. After that,
all attempts to manage the resources cause no change. The pcs command
seems to have no effect, and in some cases refuses to return.

The logs from the nodes (filtered for 'pcs' and 'pacem' since boot) are
here (resources running on node 2):

- https://www.alteeve.com/files/an-a02n01.pacemaker_hang.2021-05-14.txt
- https://www.alteeve.com/files/an-a02n02.pacemaker_hang.2021-05-14.txt

  For example, it took 20 minutes for the 'pcs cluster stop' to
complete. (Note that I tried restarting the pcsd daemon while waiting)

  BTW, I see the errors about fence_delay metadata, that will be fixed
and I don't believe it's related.

  Any advice on what happened, how to avoid it, and how to clean up
without a full cluster restart, should it happen again?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould