[Linux-ha-jp] スプリットブレイン時のSTONITHエラーについて

Back to archive index

Masamichi Fukuda - elf-systems masamichi_fukud****@elf-s*****
2015年 3月 17日 (火) 14:38:47 JST


山内さん

お疲れ様です、福田です。

stonith-helperのシェバング行に-xを追加すれば良いのでしょうか?
stonith-helperの先頭行を#!/bin/bash -xにしてクラスタを起動してみました。

crm_monでは先ほどと変わりはないようです。

# crm_mon -rfA
Last updated: Tue Mar 17 14:14:39 2015
Last change: Tue Mar 17 14:01:43 2015
Stack: heartbeat
Current DC: lbv2.beta.com (82ffc36f-1ad8-8686-7db0-35686465c624) - parti
tion with quorum
Version: 1.1.12-561c4cf
2 Nodes configured
8 Resources configured

Online: [ lbv1.beta.com lbv2.beta.com ]

Full list of resources:

 Resource Group: HAvarnish
     vip_208    (ocf::heartbeat:IPaddr2):       Started lbv1.beta.com
     varnishd   (lsb:varnish):  Started lbv1.beta.com
 Resource Group: grpStonith1
     Stonith1-1 (stonith:external/stonith-helper):      Stopped
     Stonith1-2 (stonith:external/xen0):        Stopped
 Resource Group: grpStonith2
     Stonith2-1 (stonith:external/stonith-helper):      Stopped
     Stonith2-2 (stonith:external/xen0):        Stopped
 Clone Set: clone_ping [ping]
     Started: [ lbv1.beta.com lbv2.beta.com ]

Node Attributes:
* Node lbv1.beta.com:
    + default_ping_set                  : 100
* Node lbv2.beta.com:
    + default_ping_set                  : 100

Migration summary:
* Node lbv2.beta.com:
   Stonith1-1: migration-threshold=1 fail-count=1000000 last-failure='Tue
Mar 17
 14:12:16 2015'
* Node lbv1.beta.com:
   Stonith2-1: migration-threshold=1 fail-count=1000000 last-failure='Tue
Mar 17
 14:12:21 2015'

Failed actions:
    Stonith1-1_start_0 on lbv2.beta.com 'unknown error' (1): call=31, st
atus=Error, last-rc-change='Tue Mar 17 14:12:14 2015', queued=0ms,
exec=1065ms
    Stonith2-1_start_0 on lbv1.beta.com 'unknown error' (1): call=26, st
atus=Error, last-rc-change='Tue Mar 17 14:12:19 2015', queued=0ms,
exec=1081ms

その他のログを探してみました。
heartbeat起動時です。

# less /var/log/pm_logconv.out
Mar 17 14:11:28 lbv1.beta.com info: Starting Heartbeat 3.0.6.
Mar 17 14:11:33 lbv1.beta.com info: Link lbv2.beta.com:eth1 is up.
Mar 17 14:11:34 lbv1.beta.com info: Start "ccm" process. (pid=13264)
Mar 17 14:11:34 lbv1.beta.com info: Start "lrmd" process. (pid=13267)
Mar 17 14:11:34 lbv1.beta.com info: Start "attrd" process. (pid=13268)
Mar 17 14:11:34 lbv1.beta.com info: Start "stonithd" process. (pid=13266)
Mar 17 14:11:34 lbv1.beta.com info: Start "cib" process. (pid=13265)
Mar 17 14:11:34 lbv1.beta.com info: Start "crmd" process. (pid=13269)

# less /var/log/error
Mar 17 14:12:20 lbv1 crmd[13269]:    error: process_lrm_event: Operation
Stonith2-1_start_0 (node=lbv1.beta.com, call=26, status=4, cib-update=19,
confirmed=true) Error

syslogからstonithをgrepしたものです

Mar 17 14:11:34 lbv1 heartbeat: [13255]: info: Starting child client
"/usr/local/heartbeat/libexec/pacemaker/stonithd" (0,0)
Mar 17 14:11:34 lbv1 heartbeat: [13266]: info: Starting
"/usr/local/heartbeat/libexec/pacemaker/stonithd" as uid 0  gid 0 (pid
13266)
Mar 17 14:11:34 lbv1 stonithd[13266]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: heartbeat
Mar 17 14:11:34 lbv1 heartbeat: [13255]: info: the send queue length from
heartbeat to client stonithd is set to 1024
Mar 17 14:11:40 lbv1 stonithd[13266]:   notice: setup_cib: Watching for
stonith topology changes
Mar 17 14:11:40 lbv1 stonithd[13266]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Mar 17 14:11:40 lbv1 stonithd[13266]:  warning: handle_startup_fencing:
Blind faith: not fencing unseen nodes
Mar 17 14:11:40 lbv1 stonithd[13266]:  warning: handle_startup_fencing:
Blind faith: not fencing unseen nodes
Mar 17 14:11:41 lbv1 stonithd[13266]:   notice: stonith_device_register:
Added 'Stonith2-1' to the device list (1 active devices)
Mar 17 14:11:41 lbv1 stonithd[13266]:   notice: stonith_device_register:
Added 'Stonith2-2' to the device list (2 active devices)
Mar 17 14:12:04 lbv1 stonithd[13266]:   notice: xml_patch_version_check:
Versions did not change in patch 0.5.0
Mar 17 14:12:20 lbv1 stonithd[13266]:   notice: log_operation: Operation
'monitor' [13386] for device 'Stonith2-1' returned: -201 (Generic Pacemaker
error)
Mar 17 14:12:20 lbv1 stonithd[13266]:  warning: log_operation:
Stonith2-1:13386 [ Performing: stonith -t external/stonith-helper -S ]
Mar 17 14:12:20 lbv1 stonithd[13266]:  warning: log_operation:
Stonith2-1:13386 [ failed to exec "stonith" ]
Mar 17 14:12:20 lbv1 stonithd[13266]:  warning: log_operation:
Stonith2-1:13386 [ failed:  2 ]


宜しくお願いします。

以上




2015年3月17日 13:32 <renay****@ybb*****>:

> 福田さん
>
> お疲れ様です。山内です。
>
> ということは、stonith-helperのstartに問題があるようですね。
>
> stonith-helperの先頭に
>
> #!/bin/bash -x
>
>
> を入れて、クラスタを起動すると何かわかるかも知れません。
>
> ちなみに、stonith-helperのログもどこかに出ていると思うのですが。。。
>
>
>
> 以上です。
>
> ----- Original Message -----
> >From: Masamichi Fukuda - elf-systems <masamichi_fukud****@elf-s*****>
> >To: 山内英生 <renay****@ybb*****>; "
> linux****@lists*****" <linux****@lists*****>
> >Date: 2015/3/17, Tue 12:31
> >Subject: Re: [Linux-ha-jp] スプリットブレイン時のSTONITHエラーについて
> >
> >
> >山内さん
> >cc:松島さん
> >
> >こんにちは、福田です。
> >
> >同じディレクトリにxen0はありました。
> >
> ># pwd
> >/usr/local/heartbeat/lib/stonith/plugins/external
> >
> ># ls
> >drac5           ibmrsa          kdumpcheck  riloe          vmware
> >dracmc-telnet  ibmrsa-telnet  libvirt      ssh          xen0
> >hetzner        ipmi          nut      stonith-helper  xen0-ha
> >hmchttp        ippower9258    rackpdu      vcenter
> >
> >宜しくお願いします。
> >
> >以上
> >
> >
> >
> >2015-03-17 10:53 GMT+09:00 <renay****@ybb*****>:
> >
> >福田さん
> >>cc:松島さん
> >>
> >>お疲れ様です。山内です。
> >>
> >>>標準出力や標準エラー出力はありませんでした。
> >>>
> >>>stonith-helperがおかしいのでしょうか。
> >>>stonith-helperはシェルスクリプトなのでインストールはあまり気にしていなかったのですが。
> >>>stonith-helperはここに配置されています。
> >>>/usr/local/heartbeat/lib/stonith/plugins/external/stonith-helper
> >>
> >>このディレクトリにxen0もありますか?
> >>無いようでしたら、問題がありますので、一度、stonith-helperのファイルを属性などはそのまま、xen0と同じディレクトリに
> >>コピーしてみてください。
> >>
> >>それで稼働するなら、pm_extrasのインストールに問題があるということになります。
> >>
> >>以上です。
> >>
> >>----- Original Message -----
> >>>From: Masamichi Fukuda - elf-systems <masamichi_fukud****@elf-s*****>
> >>>To: 山内英生 <renay****@ybb*****>; "
> linux****@lists*****" <linux****@lists*****>
> >>
> >>>Date: 2015/3/17, Tue 10:31
> >>>Subject: Re: [Linux-ha-jp] スプリットブレイン時のSTONITHエラーについて
> >>>
> >>>
> >>>山内さん
> >>>cc:松島さん
> >>>
> >>>おはようございます、福田です。
> >>>crmの例をありがとうございます。
> >>>
> >>>早速、こちらの環境に合わせてみました。
> >>>
> >>>$ cat test.crm
> >>>### Cluster Option ###
> >>>property \
> >>>    no-quorum-policy="ignore" \
> >>>    stonith-enabled="true" \
> >>>    startup-fencing="false" \
> >>>    stonith-timeout="710s" \
> >>>    crmd-transition-delay="2s"
> >>>
> >>>### Resource Default ###
> >>>rsc_defaults \
> >>>    resource-stickiness="INFINITY" \
> >>>    migration-threshold="1"
> >>>
> >>>### Group Configuration ###
> >>>group HAvarnish \
> >>>    vip_208 \
> >>>    varnishd
> >>>
> >>>group grpStonith1 \
> >>>    Stonith1-1 \
> >>>    Stonith1-2
> >>>
> >>>group grpStonith2 \
> >>>    Stonith2-1 \
> >>>    Stonith2-2
> >>>
> >>>### Clone Configuration ###
> >>>clone clone_ping \
> >>>    ping
> >>>
> >>>### Fencing Topology ###
> >>>fencing_topology \
> >>>    lbv1.beta.com: Stonith1-1 Stonith1-2 \
> >>>    lbv2.beta.com: Stonith2-1 Stonith2-2
> >>>
> >>>### Primitive Configuration ###
> >>>primitive vip_208 ocf:heartbeat:IPaddr2 \
> >>>    params \
> >>>        ip="192.168.17.208" \
> >>>        nic="eth0" \
> >>>        cidr_netmask="24" \
> >>>    op start interval="0s" timeout="90s" on-fail="restart" \
> >>>    op monitor interval="5s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="100s" on-fail="fence"
> >>>
> >>>primitive varnishd lsb:varnish \
> >>>    op start interval="0s" timeout="90s" on-fail="restart" \
> >>>    op monitor interval="10s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="100s" on-fail="fence"
> >>>
> >>>primitive ping ocf:pacemaker:ping \
> >>>    params \
> >>>        name="default_ping_set" \
> >>>        host_list="192.168.17.254" \
> >>>        multiplier="100" \
> >>>        dampen="1" \
> >>>    op start interval="0s" timeout="90s" on-fail="restart" \
> >>>    op monitor interval="10s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="100s" on-fail="fence"
> >>>
> >>>primitive Stonith1-1 stonith:external/stonith-helper \
> >>>    params \
> >>>        pcmk_reboot_retries="1" \
> >>>        pcmk_reboot_timeout="40s" \
> >>>        hostlist="lbv1.beta.com" \
> >>>        dead_check_target="192.168.17.132 10.0.17.132" \
> >>>        standby_check_command="/usr/local/sbin/crm_resource -r varnishd
> -W | grep -q `hostname`" \
> >>>        run_online_check="yes" \
> >>>    op start interval="0s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>
> >>>primitive Stonith1-2 stonith:external/xen0 \
> >>>    params \
> >>>        pcmk_reboot_timeout="60s" \
> >>>        hostlist="lbv1.beta.com:/etc/xen/lbv1.cfg" \
> >>>        dom0="xen0.beta.com" \
> >>>    op start interval="0s" timeout="60s" on-fail="restart" \
> >>>    op monitor interval="3600s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>
> >>>primitive Stonith2-1 stonith:external/stonith-helper \
> >>>    params \
> >>>        pcmk_reboot_retries="1" \
> >>>        pcmk_reboot_timeout="40s" \
> >>>        hostlist="lbv2.beta.com" \
> >>>        dead_check_target="192.168.17.133 10.0.17.133" \
> >>>        standby_check_command="/usr/local/sbin/crm_resource -r varnishd
> -W | grep -q `hostname`" \
> >>>        run_online_check="yes" \
> >>>    op start interval="0s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>
> >>>primitive Stonith2-2 stonith:external/xen0 \
> >>>    params \
> >>>        pcmk_reboot_timeout="60s" \
> >>>        hostlist="lbv2.beta.com:/etc/xen/lbv2.cfg" \
> >>>        dom0="xen0.beta.com" \
> >>>    op start interval="0s" timeout="60s" on-fail="restart" \
> >>>    op monitor interval="3600s" timeout="60s" on-fail="restart" \
> >>>    op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>
> >>>### Resource Location ###
> >>>location HA_location-1 HAvarnish \
> >>>    rule 200: #uname eq lbv1.beta.com \
> >>>    rule 100: #uname eq lbv2.beta.com
> >>>
> >>>location HA_location-2 HAvarnish \
> >>>    rule -INFINITY: not_defined default_ping_set or default_ping_set lt
> 100
> >>>
> >>>location HA_location-3 grpStonith1 \
> >>>    rule -INFINITY: #uname eq lbv1.beta.com
> >>>
> >>>location HA_location-4 grpStonith2 \
> >>>    rule -INFINITY: #uname eq lbv2.beta.com
> >>>
> >>>
> >>>これを流しこんだところ、昨日とはメッセージが異なります。
> >>>pingのメッセージはなくなっていました。
> >>>
> >>># crm_mon -rfA
> >>>Last updated: Tue Mar 17 10:21:28 2015
> >>>Last change: Tue Mar 17 10:21:09 2015
> >>>Stack: heartbeat
> >>>Current DC: lbv2.beta.com (82ffc36f-1ad8-8686-7db0-35686465c624) -
> parti
> >>>tion with quorum
> >>>Version: 1.1.12-561c4cf
> >>>2 Nodes configured
> >>>8 Resources configured
> >>>
> >>>
> >>>Online: [ lbv1.beta.com lbv2.beta.com ]
> >>>
> >>>Full list of resources:
> >>>
> >>> Resource Group: HAvarnish
> >>>     vip_208    (ocf::heartbeat:IPaddr2):       Started lbv1.beta.com
> >>>     varnishd   (lsb:varnish):  Started lbv1.beta.com
> >>> Resource Group: grpStonith1
> >>>     Stonith1-1 (stonith:external/stonith-helper):      Stopped
> >>>     Stonith1-2 (stonith:external/xen0):        Stopped
> >>> Resource Group: grpStonith2
> >>>     Stonith2-1 (stonith:external/stonith-helper):      Stopped
> >>>     Stonith2-2 (stonith:external/xen0):        Stopped
> >>> Clone Set: clone_ping [ping]
> >>>     Started: [ lbv1.beta.com lbv2.beta.com ]
> >>>
> >>>Node Attributes:
> >>>* Node lbv1.beta.com:
> >>>    + default_ping_set                  : 100
> >>>* Node lbv2.beta.com:
> >>>    + default_ping_set                  : 100
> >>>
> >>>Migration summary:
> >>>* Node lbv2.beta.com:
> >>>   Stonith1-1: migration-threshold=1 fail-count=1000000
> last-failure='Tue Mar 17
> >>> 10:21:17 2015'
> >>>* Node lbv1.beta.com:
> >>>   Stonith2-1: migration-threshold=1 fail-count=1000000
> last-failure='Tue Mar 17
> >>> 10:21:17 2015'
> >>>
> >>>Failed actions:
> >>>    Stonith1-1_start_0 on lbv2.beta.com 'unknown error' (1): call=31,
> st
> >>>atus=Error, last-rc-change='Tue Mar 17 10:21:15 2015', queued=0ms,
> exec=1082ms
> >>>    Stonith2-1_start_0 on lbv1.beta.com 'unknown error' (1): call=31,
> st
> >>>atus=Error, last-rc-change='Tue Mar 17 10:21:16 2015', queued=0ms,
> exec=1079ms
> >>>
> >>>
> >>>/var/log/ha-debugのログです。
> >>>
> >>>IPaddr2(vip_208)[7851]: 2015/03/17_10:21:22 INFO: Adding inet address
> 192.168.17.208/24 with broadcast address 192.168.17.255 to device eth0
> >>>IPaddr2(vip_208)[7851]: 2015/03/17_10:21:22 INFO: Bringing device eth0
> up
> >>>IPaddr2(vip_208)[7851]: 2015/03/17_10:21:22 INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
> /var/run/resource-agents/send_arp-192.168.17.208 eth0 192.168.17.208 auto
> not_used not_used
> >>>
> >>>標準出力や標準エラー出力はありませんでした。
> >>>
> >>>stonith-helperがおかしいのでしょうか。
> >>>stonith-helperはシェルスクリプトなのでインストールはあまり気にしていなかったのですが。
> >>>stonith-helperはここに配置されています。
> >>>/usr/local/heartbeat/lib/stonith/plugins/external/stonith-helper
> >>>
> >>>
> >>>
> >>>宜しくお願いします。
> >>>
> >>>以上
> >>>
> >>>
> >>>
> >>>2015-03-17 9:45 GMT+09:00 <renay****@ybb*****>:
> >>>
> >>>福田さん
> >>>>
> >>>>おはようございます。山内です。
> >>>>
> >>>>念の為、手元にある複数のstonithを利用した場合の例を抜粋してお送りします。
> >>>>(実際には、改行に気を付けてください)
> >>>>
> >>>>以下の例は、PM1.1系での設定で、
> >>>>nodeaは、prmStonith1-1、 prmStonith1-2の順でstonithが実行されます。
> >>>>nodebは、prmStonith2-1、 prmStonith2-2の順でstonithが実行されます。
> >>>>
> >>>>stonith自体は、helperとsshです。
> >>>>
> >>>>
> >>>>(snip)
> >>>>### Group Configuration ###
> >>>>group grpStonith1 \
> >>>>prmStonith1-1 \
> >>>>prmStonith1-2
> >>>>
> >>>>group grpStonith2 \
> >>>>prmStonith2-1 \
> >>>>prmStonith2-2
> >>>>
> >>>>### Fencing Topology ###
> >>>>fencing_topology \
> >>>>nodea: prmStonith1-1 prmStonith1-2 \
> >>>>nodeb: prmStonith2-1 prmStonith2-2
> >>>>(snp)
> >>>>primitive prmStonith1-1 stonith:external/stonith-helper \
> >>>>params \
> >>>>
> >>>>pcmk_reboot_retries="1" \
> >>>>pcmk_reboot_timeout="40s" \
> >>>>hostlist="nodea" \
> >>>>dead_check_target="192.168.28.60 192.168.28.70" \
> >>>>standby_check_command="/usr/sbin/crm_resource -r prmRES -W | grep -qi
> `hostname`" \
> >>>>run_online_check="yes" \
> >>>>op start interval="0s" timeout="60s" on-fail="restart" \
> >>>>op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>>
> >>>>primitive prmStonith1-2 stonith:external/ssh \
> >>>>params \
> >>>>pcmk_reboot_timeout="60s" \
> >>>>hostlist="nodea" \
> >>>>op start interval="0s" timeout="60s" on-fail="restart" \
> >>>>op monitor interval="3600s" timeout="60s" on-fail="restart" \
> >>>>op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>>
> >>>>primitive prmStonith2-1 stonith:external/stonith-helper \
> >>>>params \
> >>>>pcmk_reboot_retries="1" \
> >>>>pcmk_reboot_timeout="40s" \
> >>>>hostlist="nodeb" \
> >>>>dead_check_target="192.168.28.61 192.168.28.71" \
> >>>>standby_check_command="/usr/sbin/crm_resource -r prmRES -W | grep -qi
> `hostname`" \
> >>>>run_online_check="yes" \
> >>>>op start interval="0s" timeout="60s" on-fail="restart" \
> >>>>op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>>
> >>>>primitive prmStonith2-2 stonith:external/ssh \
> >>>>params \
> >>>>pcmk_reboot_timeout="60s" \
> >>>>hostlist="nodeb" \
> >>>>op start interval="0s" timeout="60s" on-fail="restart" \
> >>>>op monitor interval="3600s" timeout="60s" on-fail="restart" \
> >>>>op stop interval="0s" timeout="60s" on-fail="ignore"
> >>>>(snip)
> >>>>location rsc_location-grpStonith1-2 grpStonith1 \
> >>>>rule -INFINITY: #uname eq nodea
> >>>>location rsc_location-grpStonith2-3 grpStonith2 \
> >>>>rule -INFINITY: #uname eq nodeb
> >>>>
> >>>>
> >>>>以上です。
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>--
> >>>
> >>>ELF Systems
> >>>Masamichi Fukuda
> >>>mail to: masamichi_fukud****@elf-s*****
> >>>
> >>>
> >>
> >>
> >>_______________________________________________
> >>Linux-ha-japan mailing list
> >>Linux****@lists*****
> >>http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
> >>
> >
> >
> >--
> >
> >ELF Systems
> >Masamichi Fukuda
> >mail to: masamichi_fukud****@elf-s*****
> >
> >
>
> _______________________________________________
> Linux-ha-japan mailing list
> Linux****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
>



-- 
ELF Systems
Masamichi Fukuda
mail to: *masamichi_fukud****@elf-s***** <elfsy****@gmail*****>*
-------------- next part --------------
HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...
Télécharger 



Linux-ha-japan メーリングリストの案内
Back to archive index