[ClusterLabs] Antw: Re: Continuous master monitor failure of a resource in case some other resource is being promoted

Wed Feb 27 01:58:15 EST 2019

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 26.02.2019 um 16:27 in Nachricht
<a55b93efe2a218a7a41e090f1a292e9e066ae493.camel at redhat.com>:
[...]
> 
> Actions that have been *scheduled* but not *initiated* can be aborted.
> But anytime a resource agent has been invoked, we wait for that process
> to complete.

I guess it's to receive the regular exit code.

> 
[...]
> 
> With the current design, the only time pacemaker kills an already
> running process is if its timeout is reached. Scheduled actions can be
> cancelled, but not in-flight actions. That makes sense because killing
> a resource agent in the middle of a start/stop/promote/etc. could leave
> things in a problematic state that would require recovery.

Reading that I was wondering about two things:
1) To RAs have to be reentrant? I.e. Is it allowed to call a "monitor" while a "start" is still processing? AFAIK the docs don't say anything about it, and most users assume the calling sequence is strictly sequential.
2) Given 1), one could add an "asynchronous" "cancel" operation that tries to stop any current action with a state "as clean as possible". Of course a kill signal handler could try similar, but I guess very few RAs do that.

An ocf-tester that does reentrant testing while producing readable logs is another challenge ;-)

> 
>> I understand that operations *on the same resource* need
>> serialization,
>> but between completely independent resources?
> 
> Not within a single transition, but a new transition can't be done
> (with the current model) until in-flight actions have completed.
> 
> Thinking about it some more, it would be easier to get around the
> problem if we made record-pending permanently true (which is the
> default in 2.0 but not 1.1). The scheduler could then be sure it knew
> about all in-flight actions, and calculate a new transition where
> actions that depend on that one are properly ordered. We'd have to add
> the concept of waiting for an action that isn't scheduled in the
> current transition.
> 
> This jogged my memory that we already have a BZ for this aspect of the
> issue:
> 
> https://bugs.clusterlabs.org/show_bug.cgi?id=5208 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org