<div dir="ltr">Just to add maybe a helpful observation: either &quot;cib&quot; or &quot;pengine&quot; process goes to ~100% CPU when this remote nodes errors happen.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz <span dir="ltr">&lt;<a href="mailto:radoslaw.garbacz@xtremedatainc.com" target="_blank">radoslaw.garbacz@xtremedatainc.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>Hi,<br><br></div>I encountered the same problem with pacemaker built from github at around August 22.<br><br></div>Remote nodes go offline occasionally and stay so, their logs show same errors. The cluster is on AWS ec2 instances, the network works and is an unlikely reason.<br><br></div><div>Have there be any commits on github recently (after August 22) addressing this issue?<br><br><br></div>Logs:<br>[...]<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_abort:        crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_header:        Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab&#39;d 30636463<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_abort:        crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_header:        Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab&#39;d 30636463<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_abort:        crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_header:        Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab&#39;d 30636463<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info: lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info: ipc_proxy_remove_provider:    <wbr>    ipc proxy connection for client ca8df213-6da7-4c42-8cb3-<wbr>b8bc0887f2ce pid 21815 destroyed because cluster node disconnected.<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info: cancel_recurring_action:  Cancelling ocf operation monitor_all_monitor_191000<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_send_tls:     Connection terminated rc = -53<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_send_tls:     Connection terminated rc = -10<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_send:  Failed to send remote msg, rc = -10<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: lrmd_tls_send_msg:        Failed to send remote lrmd tls msg, rc = -10<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning: send_client_notify:       Notification of client remote-lrmd-ip-10-237-223-67:<wbr>3121/b6034d3a-e296-492f-b296-<wbr>725735d17e22 failed<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice: lrmd_remote_client_destroy:   <wbr>    LRMD client disconnecting remote client - name: remote-lrmd-ip-10-237-223-67:<wbr>3121 id: b6034d3a-e296-492f-b296-<wbr>725735d17e22<br>Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0<br>Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error: handle_new_connection:    Error in connection setup (19626-21815-14): Remote I/O error (121)<br>Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0<br>Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error: handle_new_connection:    Error in connection setup (19626-21815-14): Remote I/O error (121)<br>[...]<br><div><br><br><br></div></div><div class="gmail_extra"><div><div class="h5"><br><div class="gmail_quote">On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan <span dir="ltr">&lt;<a href="mailto:narayanamoorthys@gmail.com" target="_blank">narayanamoorthys@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Don&#39;t see any issues in network traffic.<div><br></div><div>Some more logs where the XML tags are incomplete:</div><div><br></div><div><div>2016-06-09T03:06:03.096449+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;fs-postgresql_last_0&quot; operation_key=&quot;fs-postgresql_s<wbr>top_0&quot; operation=&quot;stop&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;225:116:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1&quot; transition-magic=&quot;0:0;225:116:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;489&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;0&quot; last-run=&quot;1459491026&quot; last-rc-change=&quot;1459491026&quot; exec-time=&quot;158&quot; queue-time=&quot;0&quot; op-digest=&quot;dfb0c861</div><div>2016-06-09T03:06:03.097136+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;fs-postgresql_last_failure<wbr>_0&quot; operation_key=&quot;fs-postgresql_m<wbr>onitor_0&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;41:4:7:8fbf83f<wbr>d-241b-4623-8bbe-31d92e4dfce1&quot; transition-magic=&quot;0:0;41:4:7:8<wbr>fbf83fd-241b-4623-8bbe-31d92e4<wbr>dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;5&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;0&quot; last-run=&quot;1459429072&quot; last-rc-change=&quot;1459429072&quot; exec-time=&quot;315&quot; queue-time=&quot;0&quot; op-digest=&quot;df</div><div>2016-06-09T03:06:03.097361+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;fs-postgresql_monitor_1000<wbr>0&quot; operation_key=&quot;fs-postgresql_m<wbr>onitor_10000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;224:107:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1&quot; transition-magic=&quot;0:0;224:107:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;365&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;10000&quot; last-rc-change=&quot;1459490849&quot; exec-time=&quot;185&quot; queue-time=&quot;0&quot; op-digest=&quot;cd8d3642c</div><div>2016-06-09T03:06:03.097582+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     &lt;/lrm_resource&gt;</div><div>2016-06-09T03:06:03.097690+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     &lt;lrm_resource id=&quot;vip-admin-database-default<wbr>-proposal-controller&quot; type=&quot;IPaddr2&quot; class=&quot;ocf&quot; provider=&quot;heartbeat&quot;&gt;</div><div>2016-06-09T03:06:03.097797+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;vip-admin-database-default<wbr>-proposal-controller_last_0&quot; operation_key=&quot;vip-admin-datab<wbr>ase-default-proposal-controlle<wbr>r_stop_0&quot; operation=&quot;stop&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;228:116:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1&quot; transition-magic=&quot;0:0;228:116:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;487&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;0&quot; last-run=&quot;1459491026&quot; last-rc-chan</div><div>2016-06-09T03:06:03.098013+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;vip-admin-database-default<wbr>-proposal-controller_monitor_<wbr>10000&quot; operation_key=&quot;vip-admin-datab<wbr>ase-default-proposal-controlle<wbr>r_monitor_10000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;227:107:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1&quot; transition-magic=&quot;0:0;227:107:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;369&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;10000&quot; last-rc-chang</div><div>2016-06-09T03:06:03.098230+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     &lt;/lrm_resource&gt;</div><div>2016-06-09T03:06:03.098337+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     &lt;lrm_resource id=&quot;postgresql&quot; type=&quot;pgsql&quot; class=&quot;ocf&quot; provider=&quot;heartbeat&quot;&gt;</div><div>2016-06-09T03:06:03.098468+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;postgresql_last_0&quot; operation_key=&quot;postgresql_stop<wbr>_0&quot; operation=&quot;stop&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;231:116:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1&quot; transition-magic=&quot;0:0;231:116:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;481&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;0&quot; last-run=&quot;1459491025&quot; last-rc-change=&quot;1459491025&quot; exec-time=&quot;1334&quot; queue-time=&quot;0&quot; op-digest=&quot;f2317cad3d54c</div><div>2016-06-09T03:06:03.099061+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       &lt;lrm_rsc_op id=&quot;postgresql_monitor_10000&quot; operation_key=&quot;postgresql_moni<wbr>tor_10000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_re<wbr>source&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;230:107:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1&quot; transition-magic=&quot;0:0;230:107:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1&quot; on_node=&quot;d00-50-56-94-24-dd&quot; call-id=&quot;372&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;10000&quot; last-rc-change=&quot;1459490852&quot; exec-time=&quot;424&quot; queue-time=&quot;0&quot; op-digest=&quot;873ed4f07792aa8</div></div><div><br></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 8, 2016 at 10:28 PM, Narayanamoorthy Srinivasan <span dir="ltr">&lt;<a href="mailto:narayanamoorthys@gmail.com" target="_blank">narayanamoorthys@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>No recent network changes. Will check for abnormal traffic using wireshark.<br><br></div>I also notice that the XML lines are partial (no ending &#39;&gt;&#39;, closing &quot; and sometimes partial words) in logs. Any lines &gt; 472 characters are truncated to 472 characters. Wondering is it due to anyother limitations. <br><br></div>I can post some line tomorrow when i am back to work.<br><br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 8, 2016 at 8:00 PM, Ken Gaillot <span dir="ltr">&lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>On 06/08/2016 06:14 AM, Narayanamoorthy Srinivasan wrote:<br>
&gt; I have a pacemaker cluster with two pacemaker remote nodes. Recently the<br>
&gt; remote nodes started throwing below errors and SDB started self-fencing.<br>
&gt; Appreciate if someone throws light on what could be the issue and the fix.<br>
&gt;<br>
&gt; OS - SLES 12 SP1<br>
&gt; Pacemaker Remote version - pacemaker-remote-1.1.13-14.7.x<wbr>86_64<br>
&gt;<br>
&gt; 2016-06-08T14:11:46.009073+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
&gt; error : AttValue: &#39; expected<br>
&gt; 2016-06-08T14:11:46.009314+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt; key=&quot;neutron-ha-tool_monitor_0<wbr>&quot; operation=&quot;monitor&quot;<br>
&gt; crm-debug-origin=&quot;do_update_<br>
&gt; 2016-06-08T14:11:46.009443+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt;                                                      ^<br>
&gt; 2016-06-08T14:11:46.009567+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
&gt; error : attributes construct error<br>
&gt; 2016-06-08T14:11:46.009697+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt; key=&quot;neutron-ha-tool_monitor_0<wbr>&quot; operation=&quot;monitor&quot;<br>
&gt; crm-debug-origin=&quot;do_update_<br>
&gt; 2016-06-08T14:11:46.009824+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt;                                                      ^<br>
&gt; 2016-06-08T14:11:46.009948+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
&gt; error : Couldn&#39;t find end of Start Tag lrm_rsc_op line 1<br>
&gt; 2016-06-08T14:11:46.010070+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt; key=&quot;neutron-ha-tool_monitor_0<wbr>&quot; operation=&quot;monitor&quot;<br>
&gt; crm-debug-origin=&quot;do_update_<br>
&gt; 2016-06-08T14:11:46.010191+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt;                                                      ^<br>
&gt; 2016-06-08T14:11:46.010460+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
&gt; error : Premature end of data in tag lrm_resource line 1<br>
&gt; 2016-06-08T14:11:46.010718+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt; key=&quot;neutron-ha-tool_monitor_0<wbr>&quot; operation=&quot;monitor&quot;<br>
&gt; crm-debug-origin=&quot;do_update_<br>
&gt; 2016-06-08T14:11:46.010977+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error:<br>
&gt;                                                      ^<br>
&gt; 2016-06-08T14:11:46.011234+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
&gt; pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
&gt; error : Premature end of data in tag lrm_resources line 1<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; Thanks &amp; Regards<br>
&gt; Moorthy<br>
<br>
</div></div>This sounds like the network traffic between the cluster nodes and the<br>
remote nodes is being corrupted. Have there been any network changes<br>
lately? Switch/firewall/etc. equipment/settings? MTU?<br>
<br>
You could try using a packet sniffer such as wireshark to see if the<br>
traffic looks abnormal in some way. The payload is XML so it should be<br>
more or less readable.<br>
<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman<wbr>/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br><br clear="all"><br>-- <br><div data-smartmail="gmail_signature">Thanks &amp; Regards<br>Moorthy</div>
</div>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div data-smartmail="gmail_signature">Thanks &amp; Regards<br>Moorthy</div>
</div>
</div></div><br>______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman<wbr>/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br></div></div><div data-smartmail="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div>XtremeData Incorporation<br></div></div>
</div>
</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div>XtremeData Incorporation<br></div></div>
</div>