OKADA Satoshi
okada****@oss*****
2008年 11月 18日 (火) 12:27:07 JST
島田さん 岡田です。こんにちは。 ログからV2モードで動作していると推定します。 #VM環境は詳しくないので、はずしてたらすみません。 V2モードではノード間でやり取りするデータ量が大きくなっており、 シリアルケーブルでの通信では処理しきれないことがあるようです。 通常のLAN接続の設定で動作させてみてはいかがでしょうか。 watchdogとの関連はちょっと分かりません。 ご質問の内容からそれますが、ログから気がついた点です。 もしかしたら、ipfailを使ってらっしゃるかもしれませんが、 V2モードではpingdがipfailの代わりとなっていますので、 そちらの利用も検討してもらえれば良いかと思います。 http://www.linux-ha.org/ja/pingd_ja 設定ファイル(ha.cf, cib.xml)を可能な範囲で教えていただける ともう少し、原因に近づけるかもしれません。 また、バージョンによって動作が違う可能性がありますので、 Heartbeatのバージョンも教えていただければと思います。 島田雄太 さんは書きました: > 島田と申します > > linux-haの挙動で質問があります。 > > 現在、VMwareのホストOS2台を用いて、その上にそれぞれRedhatES4をゲストOSとして > 立ち上げています。 > > heartbeatとDRBDを用いて、その上にpostgresを使って構築をしております。 > VM上でシリアルケーブルを認識させているのですが、通常運用時に以下のようなエラーが出ており > WatchDog上で障害だと判断し、切り替わりが発生してしまったように見受けられます > > なぜ、そうなったのか、調査したいと考えております。 > > > そもそも設定がおかしいというのも考えられるのでしょうか。 > どういった情報をお伝えすればよいか分からない為、不適切なご相談になってしまうようでしたら > 必要な情報をご教授いただけますでしょうか。 > > ご迷惑をお掛け致しますが、宜しくお願い致します > > 1.コンソール上において以下のようなメッセージが出ていました > 「1xxxdb01 kernel: ttyS0: 1 input overrun(s)」 > というメッセージが表示されており、最後に > 「SoftDog: Initiaing system reboot.」 > というメッセージが表示されていた > > 2./var/log/messageについて > > Nov 17 05:37:03 1mctdb01 kernel: drbd0: resync bitmap: bits=12799789 > words=399994 > Nov 17 05:37:03 1mctdb01 kernel: klogd 1.4.1, ---------- state change ---------- > Nov 17 05:37:03 1mctdb01 kernel: drbd0: size = 48 GB (51199155 KB) > Nov 17 05:37:03 1mctdb01 kernel: drbd0: 0 KB marked out-of-sync by on > disk bit-map. > Nov 17 05:37:03 1mctdb01 kernel: drbd0: Found 4 transactions (192 > active extents) in activity log. > Nov 17 05:37:03 1mctdb01 kernel: drbd0: Marked additional 507 MB as > out-of-sync based on AL. > Nov 17 05:37:03 1mctdb01 kernel: drbd0: drbdsetup [4700]: cstate > Unconfigured --> StandAlone > Nov 17 05:37:03 1mctdb01 drbd: s0 > Nov 17 05:37:03 1mctdb01 drbd: n0 > Nov 17 05:37:03 1mctdb01 kernel: drbd0: drbdsetup [4713]: cstate > StandAlone --> Unconnected > Nov 17 05:37:03 1mctdb01 kernel: drbd0: drbd0_receiver [4714]: cstate > Unconnected --> WFConnection > Nov 17 05:37:03 1mctdb01 drbd: ] > Nov 17 05:37:03 1mctdb01 drbd: . > Nov 17 05:37:04 1mctdb01 drbd: WARN: stdin/stdout is not a TTY; using > /dev/console > Nov 17 05:37:05 1mctdb01 kernel: drbd0: drbd0_receiver [4714]: cstate > WFConnection --> WFReportParams > Nov 17 05:37:05 1mctdb01 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 74 > Nov 17 05:37:05 1mctdb01 kernel: drbd0: Connection established. > Nov 17 05:37:05 1mctdb01 kernel: drbd0: I am(S): > 1:00000002:00000001:0000001e:0000000e:11 > Nov 17 05:37:05 1mctdb01 kernel: drbd0: Peer(P): > 1:00000002:00000001:0000001e:0000000f:10 > Nov 17 05:37:05 1mctdb01 kernel: drbd0: drbd0_receiver [4714]: cstate > WFReportParams --> WFBitMapT > Nov 17 05:37:05 1mctdb01 kernel: drbd0: Secondary/Unknown --> Secondary/Primary > Nov 17 05:37:05 1mctdb01 rc: Starting drbd: succeeded > Nov 17 05:37:05 1mctdb01 kernel: drbd0: drbd0_receiver [4714]: cstate > WFBitMapT --> SyncTarget > Nov 17 05:37:05 1mctdb01 kernel: drbd0: Resync started as SyncTarget > (need to sync 519348 KB [129837 bits set]). > Nov 17 05:37:06 1mctdb01 logd: [4739]: info: logd started with default > configuration. > Nov 17 05:37:06 1mctdb01 logd: [4739]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Nov 17 05:37:06 1mctdb01 logd: [4748]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Nov 17 05:37:06 1mctdb01 kernel: Software Watchdog Timer: 0.07 > initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0) > Nov 17 05:37:09 1mctdb01 sendmail[4833]: sql_select option missing > Nov 17 05:37:09 1mctdb01 sendmail[4833]: auxpropfunc error no > mechanism available > Nov 17 05:37:09 1mctdb01 sendmail: sendmail startup succeeded > Nov 17 05:37:09 1mctdb01 sendmail: sm-client startup succeeded > Nov 17 05:37:09 1mctdb01 gpm[4855]: *** info [startup.c(95)]: > Nov 17 05:37:09 1mctdb01 gpm[4855]: Started gpm successfully. Entered > daemon mode. > Nov 17 05:37:09 1mctdb01 gpm[4855]: *** info [mice.c(1766)]: > Nov 17 05:37:09 1mctdb01 gpm[4855]: imps2: Auto-detected intellimouse PS/2 > Nov 17 05:37:10 1mctdb01 lrmd: [4858]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Nov 17 05:37:10 1mctdb01 attrd: [4860]: info: > G_main_add_SignalHandler: Added signal handler for signal 15 > Nov 17 05:37:10 1mctdb01 ccm: [4856]: info: Hostname: 1mctdb01.tdc-asp.jp > Nov 17 05:37:10 1mctdb01 attrd: [4860]: info: register_with_ha: > Hostname: 1mctdb01.tdc-asp.jp > Nov 17 05:37:10 1mctdb01 gpm: gpm startup succeeded > Nov 17 05:37:10 1mctdb01 cib: [4857]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Nov 17 05:37:10 1mctdb01 cib: [4857]: info: G_main_add_TriggerHandler: > Added signal manual handler > Nov 17 05:37:10 1mctdb01 cib: [4857]: info: G_main_add_SignalHandler: > Added signal handler for signal 17 > Nov 17 05:37:10 1mctdb01 cib: [4857]: info: main: Retrieval of a > per-action CIB: disabled > Nov 17 05:37:10 1mctdb01 ipfail: [4862]: ERROR: No managed resources > Nov 17 05:37:10 1mctdb01 lrmd: [4858]: info: G_main_add_SignalHandler: > Added signal handler for signal 17 > Nov 17 05:37:10 1mctdb01 ccm: [4856]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Nov 17 05:37:10 1mctdb01 lrmd: [4858]: info: G_main_add_SignalHandler: > Added signal handler for signal 10 > Nov 17 05:37:10 1mctdb01 lrmd: [4858]: info: G_main_add_SignalHandler: > Added signal handler for signal 12 > Nov 17 05:37:10 1mctdb01 cib: [4857]: ERROR: crm_is_writable: > /var/lib/heartbeat/crm/cib.xml must be owned and r/w by user hacluster > Nov 17 05:37:10 1mctdb01 cib: [4857]: info: retrieveCib: Reading > cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > Nov 17 05:37:10 1mctdb01 attrd: [4860]: info: register_with_ha: UUID: > c73aec4b-6947-40a2-b1e2-853d10c4d22d > Nov 17 05:37:10 1mctdb01 stonithd: [4859]: info: > G_main_add_SignalHandler: Added signal handler for signal 10 > Nov 17 05:37:10 1mctdb01 stonithd: [4859]: info: > G_main_add_SignalHandler: Added signal handler for signal 12 > Nov 17 05:37:10 1mctdb01 lrmd: [4858]: info: Started. > Nov 17 05:37:10 1mctdb01 crmd: [4861]: info: main: CRM Hg Version: > node: 552305612591183b1628baa5bc6e903e0f1e26a3 > Nov 17 05:37:10 1mctdb01 crmd: [4861]: info: crmd_init: Starting crmd > Nov 17 05:37:10 1mctdb01 crmd: [4861]: info: G_main_add_SignalHandler: > Added signal handler for signal 15 > Nov 17 05:37:10 1mctdb01 crmd: [4861]: info: > G_main_add_TriggerHandler: Added signal manual handler > Nov 17 05:37:10 1mctdb01 crmd: [4861]: info: G_main_add_SignalHandler: > Added signal handler for signal 17 > > 3./var/ha-log > heartbeat[4806]: 2008/11/17_05:37:07 WARN: Logging daemon is disabled > --enabling logging daemon is recommended > heartbeat[4806]: 2008/11/17_05:37:07 info: ************************** > heartbeat[4806]: 2008/11/17_05:37:07 info: Configuration validated. > Starting heartbeat 2.1.3 > heartbeat[4807]: 2008/11/17_05:37:07 info: heartbeat: version 2.1.3 > heartbeat[4807]: 2008/11/17_05:37:07 info: Heartbeat generation: 1211968635 > heartbeat[4807]: 2008/11/17_05:37:08 info: glib: Starting serial > heartbeat on tty /dev/ttyS0 (19200 baud) > heartbeat[4807]: 2008/11/17_05:37:08 info: glib: ucast: write socket > priority set to IPTOS_LOWDELAY on eth1 > heartbeat[4807]: 2008/11/17_05:37:08 info: glib: ucast: bound send > socket to device: eth1 > heartbeat[4807]: 2008/11/17_05:37:08 info: glib: ucast: bound receive > socket to device: eth1 > heartbeat[4807]: 2008/11/17_05:37:08 info: glib: ucast: started on > port 694 interface eth1 to 192.168.201.2 > heartbeat[4807]: 2008/11/17_05:37:08 info: glib: ping heartbeat started. > heartbeat[4807]: 2008/11/17_05:37:08 info: G_main_add_TriggerHandler: > Added signal manual handler > heartbeat[4807]: 2008/11/17_05:37:08 info: G_main_add_TriggerHandler: > Added signal manual handler > heartbeat[4807]: 2008/11/17_05:37:08 notice: Using watchdog device: > /dev/watchdog > heartbeat[4807]: 2008/11/17_05:37:08 info: G_main_add_SignalHandler: > Added signal handler for signal 17 > heartbeat[4807]: 2008/11/17_05:37:08 info: Local status now set to: 'up' > heartbeat[4807]: 2008/11/17_05:37:09 info: Link > 192.168.70.254:192.168.70.254 up. > heartbeat[4807]: 2008/11/17_05:37:09 info: Status update for node > 192.168.70.254: status ping > heartbeat[4807]: 2008/11/17_05:37:09 info: Link > 1mctdb02.xxx-asp.jp:/dev/ttyS0 up. > heartbeat[4807]: 2008/11/17_05:37:09 info: Status update for node > 1mctdb02.xxx-asp.jp: status active > heartbeat[4807]: 2008/11/17_05:37:09 info: Link 1mctdb02.xxx-asp.jp:eth1 up. > heartbeat[4807]: 2008/11/17_05:37:09 info: Comm_now_up(): updating > status to active > heartbeat[4807]: 2008/11/17_05:37:09 info: Local status now set to: 'active' > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/ccm" (400,400) > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/cib" (400,400) > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/lrmd -r" (0,0) > heartbeat[4857]: 2008/11/17_05:37:09 info: Starting > "/usr/lib/heartbeat/cib" as uid 400 gid 400 (pid 4857) > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/stonithd" (0,0) > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/attrd" (400,400) > heartbeat[4859]: 2008/11/17_05:37:09 info: Starting > "/usr/lib/heartbeat/stonithd" as uid 0 gid 0 (pid 4859) > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/crmd" (400,400) > heartbeat[4856]: 2008/11/17_05:37:09 info: Starting > "/usr/lib/heartbeat/ccm" as uid 400 gid 400 (pid 4856) > heartbeat[4807]: 2008/11/17_05:37:09 info: Starting child client > "/usr/lib/heartbeat/ipfail" (400,400) > heartbeat[4858]: 2008/11/17_05:37:09 info: Starting > "/usr/lib/heartbeat/lrmd -r" as uid 0 gid 0 (pid 4858) > heartbeat[4860]: 2008/11/17_05:37:09 info: Starting > "/usr/lib/heartbeat/attrd" as uid 400 gid 400 (pid 4860) > heartbeat[4807]: 2008/11/17_05:37:09 WARN: G_CH_dispatch_int: Dispatch > function for read child took too long to execute: 80 ms (> 50 ms) > (GSource: 0x9148158) > heartbeat[4862]: 2008/11/17_05:37:09 info: Starting > "/usr/lib/heartbeat/ipfail" as uid 400 gid 400 (pid 4862) > heartbeat[4861]: 2008/11/17_05:37:10 info: Starting > "/usr/lib/heartbeat/crmd" as uid 400 gid 400 (pid 4861) > lrmd[4858]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > attrd[4860]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > ccm[4856]: 2008/11/17_05:37:10 info: Hostname: 1mctdb01.xxx-asp.jp > attrd[4860]: 2008/11/17_05:37:10 info: register_with_ha: Hostname: > 1mctdb01.xxx-asp.jp > cib[4857]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > cib[4857]: 2008/11/17_05:37:10 info: G_main_add_TriggerHandler: Added > signal manual handler > cib[4857]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 17 > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: api_process_request: bad > request [getrsc] > cib[4857]: 2008/11/17_05:37:10 info: main: Retrieval of a per-action > CIB: disabled > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: MSG: Dumping message with 5 fields > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: MSG[0] : [t=hbapi-req] > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: MSG[1] : [reqtype=getrsc] > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: MSG[2] : [dest=1mctdb01.xxx-asp.jp] > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: MSG[3] : [pid=4862] > heartbeat[4807]: 2008/11/17_05:37:10 ERROR: MSG[4] : [from_id=ipfail] > ipfail[4862]: 2008/11/17_05:37:10 ERROR: No managed resources > heartbeat[4807]: 2008/11/17_05:37:10 WARN: Managed > /usr/lib/heartbeat/ipfail process 4862 exited with return code 100. > lrmd[4858]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 17 > lrmd[4858]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 10 > ccm[4856]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > lrmd[4858]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 12 > cib[4857]: 2008/11/17_05:37:10 ERROR: crm_is_writable: > /var/lib/heartbeat/crm/cib.xml must be owned and r/w by user hacluster > cib[4857]: 2008/11/17_05:37:10 info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.xml (digest: > /var/lib/heartbeat/crm/cib.xml.sig) > stonithd[4859]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: > Added signal handler for signal 10 > stonithd[4859]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: > Added signal handler for signal 12 > attrd[4860]: 2008/11/17_05:37:10 info: register_with_ha: UUID: > c73aec4b-6947-40a2-b1e2-853d10c4d22d > lrmd[4858]: 2008/11/17_05:37:10 info: Started. > crmd[4861]: 2008/11/17_05:37:10 info: main: CRM Hg Version: node: > 552305612591183b1628baa5bc6e903e0f1e26a3 > > crmd[4861]: 2008/11/17_05:37:10 info: crmd_init: Starting crmd > crmd[4861]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > crmd[4861]: 2008/11/17_05:37:10 info: G_main_add_TriggerHandler: Added > signal manual handler > crmd[4861]: 2008/11/17_05:37:10 info: G_main_add_SignalHandler: Added > signal handler for signal 17 > cib[4857]: 2008/11/17_05:37:10 WARN: validate_cib_digest: No on-disk > digest present > > > あとこういったエラーログも出ていました > > heartbeat[4807]: 2008/11/17_05:37:59 ERROR: ha_msg_addraw_ll: illegal field > heartbeat[4807]: 2008/11/17_05:37:59 ERROR: ha_msg_addraw(): > ha_msg_addraw_ll failed > heartbeat[4807]: 2008/11/17_05:37:59 ERROR: NV failure (string2msg_ll): > heartbeat[4807]: 2008/11/17_05:37:59 ERROR: Input string: [>>> > t=cib > cib_clientid=5977 > cib_callopt=1052672 > cib_callid=4 > cib_op=cib_apply_diff > cib_section=status > cib_delegated_from=1mctdb01.xxx-asp.jp > cib_clientname=5977 > (4)cib_calldata=eJyV0EEOwiAQBdC912ia2AQSmE5bbcJZCC2zIFE0QD2/QrqnfuDb6XWhS2CyKsKQWaCB5vJwESWo/aZYlArFILlqOJVfLG3/qUJJS3o8/1FmEhFgj8cXAPxpoiYfzC4234a2 > hg=483d2dbb > ts=49208487 > ld=0.42 0.13 0.03<< > t=cib > cib_clientid=ab2880a6-3d82-472f-8275-2c9846356fcb > cited_from=1mctdb01.xxx-asp.jp > cib_clientname=4861 > (4)cib_calldata=eJyVUtlu4yAUfe9vRJFaCSJW28mIb0Es1xk6trEAV8rfF9vpVG1nHvKCdLez3MvhcHhi7PJMXrSezAhaqyl60LmYAkhsleCVa7kBJyxuzqLFghiGLQWGO8k9JU54xjxi3dq9rDC > Kjq54S+ipeIdNnk+vM+Jb3aURe7DLFccUrmFSPupl9pVOJ8hxSQ5QS/jlmb0MaVSNRvMKlYMKU6/NScnY7zzs1+cm9VxMVXZj3HXPQUS+hv2sMY64KrnPPaFWdIpoQ4qb3+wGkasvlJZsphRdB/4KZk > e6m+HO+sBS8wZ+6MhaMcd9AYLKnvJXBOO5Coab7Nj+YanKqCfz2Asq/ZmWHQ1TGlaN9LNaD7urwl1R8MRbHTxwGS067+a1VjsW9g++JLXjP7zaYC6c0MNSHkvcWHK+SiHAFCO9vyvkpgzDbOey4lUCZ > NTyWg4/H49K/nHeAYBD8= > cib_isreplyto=1mctdb01.xxx-asp.jp > cib_update=true > (4)cib_update_diff=eJydU+1uozAQ/N/XiCI1kjl5/YXhxLNYxl5yqAEi41Tq258NtGma6KTLHwt2ZtezO+vdbvcCqn6lB2NGO6Axje+7joCq6ld2yN9FwGF6R9/sEpeJO+4nTmSqIw+ubxcmyBtm > ChPQOTReBnM5extxbiSlZL/fv5BVg/VDPxo8T+5PQwlAjq1/QP+VXVXll1rr/adWdq91QYnW5RNSgWi2tDhHGy/zmqpvUleElBoyb5w8mhzCR3quKFGaZf4pDI/0pDCRVbkxTMB5ugSHqwB2x70SiOT > 6Z9YjF7/jRCjxlTM7M50fab+ihC+1et/40HpaU3Oe5mjGKfbdh/FpNSKaZF6VWdMZg439NDYrTvgyPheGwmN7ORZT6I/92PhpG/tVlqKZGYMd5z5XMG/40ciyBlY7rtsWvSg4c1UhHPBCo7KFBN9J5B > w0SqLUj/zBHnvXJMG//6MK8EWvPZ1M6hiArHNJDZgOk/UhOYqxYb9SxwsSnHHJ57zNYp2A2baHbsX6MWJ4t6cUEHKj+P6Ic2wcRQq6LXmXJDDWKuc9lxKBSduBRMI3V/L9eb1TU+ENQzJo2fM6Jn/W5 > 8VvTVqjyyHYBrmSW3SiLVQlykJQy4oWkBVacg/UCc+Yfzrlejz50uH71dt8RMudsJ1T1noG0tOSMc4sp9RqqNSa8hfq6W3F > oseq=62 > _compression_algorithm=zlib > from_id=cib > to_id=cib > client_gen=3 > src=1mctdb02.xxx-asp.jp > (1)srcuuid=PhGaPNvdTVSt7wPintDUbg== > seq=4234a3 > hg=483d2dbb > ts=49208488 > ld=0.42 0.13 0.03 2/89 24799 > ttl=3 > auth=1 2a06b826 > <<< > ] > > _______________________________________________ > Linux-ha-japan mailing list > Linux****@lists***** > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan > -- NTTオープンソースソフトウェアセンタ 岡田 敏 <OKADA Satoshi>