[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

Vladislav Bogdanov bubble at hoster-ok.com
Fri Jan 1 11:34:39 CET 2016


31.12.2015 15:33:45 CET, Bogdan Dobrelya <bdobrelia at mirantis.com> wrote:
>On 31.12.2015 14:48, Vladislav Bogdanov wrote:
>> blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc,
>quick google search should point you to Andrew's blog with all
>information about that feature.
>> Next, if you use ocf-shellfuncs in your RA, you could enable tracing
>for resource itself, just add 'trace_ra=1' to every operation config
>(start and monitor).
>
>Thank you, I will try to play with these things once I have the issue
>reproduced again. Cannot provide CIB as I don't have the env now.
>
>But still let me ask again, do anyone know or heard of anything like
>known/fixed bugs about corosync with pacemaker stop running monitor
>actions for a resource at some point, while notifications are still
>logged?
>
>Here is example:
>node-16 crmd:
>2015-12-29T13:16:49.113679+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_27000: unknown error
>(node=node-16.test.domain.local, call=254, rc=1, cib-updat
>e=1454, confirmed=false)
>node-17:
>2015-12-29T13:16:57.603834+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_103000: unknown error
>(node=node-17.test.domain.local, call=181, rc=1, cib-upda
>te=297, confirmed=false)
>node-18:
>2015-12-29T13:20:16.870619+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_103000: not running
>(node=node-18.test.domain.local, call=187, rc=7, cib-update
>=306, confirmed=false)
>node-20:
>2015-12-29T13:20:51.486219+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_30000: not running
>(node=node-20.test.domain.local, call=180, rc=7, cib-update=
>308, confirmed=false)
>
>after that point only notifications got logged for affected nodes, like
>Operation p_rabbitmq-server_notify_0: ok
>(node=node-20.test.domain.local, call=287, rc=0, cib-update=0,
>confirmed=t
>rue)
>
>While the node-19 was not affected, and actions
>monitor/stop/start/notify logged OK all the time, like:
>2015-12-29T14:30:00.973561+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_30000: not running
>(node=node-19.test.domain.local, call=423, rc=7, cib-update=438,
>confirmed=false)
>2015-12-29T14:30:01.631609+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_notify_0: ok
>(node=node-19.test.domain.local, call=424, rc=0, cib-update=0,
>confirmed=true)
>2015-12-29T14:31:19.084165+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_stop_0: ok (node=node-19.test.domain.local,
>call=427, rc=0, cib-update=439, confirmed=true)
>2015-12-29T14:32:53.120157+00:00 notice:    notice: process_lrm_event:
>Operation p_rabbitmq-server_start_0: unknown error
>(node=node-19.test.domain.local, call=428, rc=1, cib-update=441,
>confirmed=true)

Well, not running and not logged is not the same thing. I do not have access to code right now, but I'm pretty sure that successful recurring monitors are not logged after the first run. trace_ra for monitor op should prove that. If not, then it should be a bug. I recall something was fixed in that area recently.

Best,
Vladislav




More information about the Users mailing list