[postgresforest-users 100] Re: オンラインリカバリ中エラー applyLogRecord() failed.について

Back to archive index

Satoshi.Nagatsuma nagat****@nttda*****
2007年 12月 18日 (火) 10:40:46 JST


長妻です。

>> オンラインリカバリを開始した際、phase1,phase2においてエラー
>> org.postgresforest.tool.recovery.RecoveryException: applyLogRecord()
>> failed.が発生しました。

> そちらの調査では、何か進展がありましたでしょうか。

リカバリーに関連するソースコード上の怪しそうな部分に
ざっと目は通したのですが、特に上記のようなエラーが
起きそうな部分は見当たりませんでした。

リカバリーマネージャの改善は今後も行っていく予定なので
その中でソースコード側から継続調査してみようと思っています。


永津 さんは書きました:
> お世話になっております。
> 永津と申します。
> 
> 以前下記のような形で質問させて頂きました。
> 
>> オンラインリカバリ実行中のエラーについて質問があります。
>>
>> マシン2台を用いて運用テストを行っています。
> 補足として、GSCをそれぞれのマシンに配置して冗長化の構成をとっております。
> 
>> それぞれにPostgresSQL4.0.2とPostgresSQL8.1.6をインストールしています。
>>
>> 1つのデータベースに対して、毎秒約9000件のレコードをInsertしている状態で
>> オンラインリカバリを開始した際、phase1,phase2においてエラー
>> org.postgresforest.tool.recovery.RecoveryException: applyLogRecord()
>> failed.が発生しました。
>> この後、start_vmまでオンラインリカバリを継続、終了いたしましたが、
>> データの差異は改善されませんでした。
>>
>> エラーは同じ状況で再現します。
>> また、phase1終了後の時点でInsertを開始した場合は、
>> 正常にリカバリを終了します。
> 
> 上記の質問について次のようなご回答を頂きました。
> 
>> 長妻です。
>>
>> 基本的にオンラインリカバリは、トランザクション(特に更新)が
>> 少ない時間帯に行っていただくことを想定しています。
>> 例えばWEBアプリケーションのバックエンドであれば、
>> 夜間帯のトランザクションが少ない時間帯、
>> なおかつバッチなどが走っていない状態で・・・
>> といった具合です。
>> 秒間9000件の更新はなかなか激しいですね(^^;
>> ちなみに、Phase1の時間に更新量が多いと、リカバリー最中に
>> vmが停止している時間が長くなってしまいます。
>>
>>
>> さて、それはともかく、このエラーについてはこちらでも
>> 調べてみましたが、今のところ根本的な原因がわかりません。
>> 追加で頂いたログも含め、同じ内容を2度INSERTしているという
>> 現象面だけは見えているのですが・・・
>> リカバリマネージャの不具合の可能性もあるため
>> もう少しこちらでも調べてみるつもりです。
>>
>> もし永津さんのほうでもなにかわかりましたら、
>> 情報をいただけると助かります。
> 
> チューニングをしてPostgresSQLパフォーマンスの向上を行う、
> レコードのinsert件数をデータベース全体で約1200件/秒まで下げる、
> などの方法を試みましたが、同様のエラーが発生しています。
> 
> そちらの調査では、何か進展がありましたでしょうか。
> 
> -debugオプションを追加してリカバリコマンドを実行したときに出力された
> メッセージログを今回追記いたします。
> 
> $ /usr/local/forest40/bin/ForestRecovery.sh -user forest -pass forest
>  -gsc_url jdbc:postgresql://192.168.0.165:5432/gsc -phase1 0 1 forest_db -debug
> 2007-11-28 20:50:03.722 DEBUG: 2 GSC(s) found.
> 2007-11-28 20:50:03.729 NOTICE: RECOVERING:
> 2007-11-28 20:50:03.729 NOTICE:   FROM
> [0]jdbc:postgresql://192.168.0.165:5432/forest_db
> 2007-11-28 20:50:03.729 NOTICE:     TO
> [1]jdbc:postgresql://192.168.0.134:5432/forest_db
> 2007-11-28 20:50:03.729 DEBUG: jdbc:postgresql://192.168.0.165:5432/forest_db
> 2007-11-28 20:50:03.729 DEBUG: jdbc:postgresql://192.168.0.134:5432/forest_db
> 2007-11-28 20:50:03.736 DEBUG: Backend PID=17602
> 2007-11-28 20:50:03.736 NOTICE: SERVER STATUS:
> 2007-11-28 20:50:03.737 NOTICE:   Src: Server 0, Status 1
> 2007-11-28 20:50:03.738 NOTICE:   Dst: Server 1, Status -1
> 2007-11-28 20:50:03.738 NOTICE: RECOVERY_MODE = 1
> 2007-11-28 20:50:03.739 NOTICE: RECOVERY / INITIALIZATION
> 2007-11-28 20:50:03.74 NOTICE: TABLE(S) TO COPY:
> 2007-11-28 20:50:03.74 NOTICE:   test_table1
> 2007-11-28 20:50:03.74 NOTICE:   test_table2
> 2007-11-28 20:50:03.74 NOTICE:   test_table3
> 2007-11-28 20:50:03.74 NOTICE:   test_table4
> 2007-11-28 20:50:03.74 NOTICE: RECOVERY / PHASE 1: start.
> 2007-11-28 20:50:03.741 NOTICE: GSC CHECK:
> 2007-11-28 20:50:03.742 NOTICE:   forest_servdb, forest_db: OK
> 2007-11-28 20:50:03.743 NOTICE:   forest_tablepartdtl, test_table1: OK
> 2007-11-28 20:50:03.744 NOTICE:   forest_tablepartdtl, test_table2: OK
> 2007-11-28 20:50:03.744 NOTICE:   forest_tablepartdtl, test_table3: OK
> 2007-11-28 20:50:03.745 NOTICE:   forest_tablepartdtl, test_table4: OK
> 2007-11-28 20:50:08.332 NOTICE: Log table already exists.
> 2007-11-28 20:50:08.333 NOTICE: Log table already exists.
> 2007-11-28 20:50:08.333 NOTICE: Log table already exists.
> 2007-11-28 20:50:08.334 DEBUG: beginRecoveryLogging: test_table1
> 2007-11-28 20:50:08.334 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:08.335 DEBUG: CREATE TRIGGER test_table1_logger AFTER
> INSERT OR UPDATE OR DELETE ON test_table1 FOR EACH ROW EXECUTE
> PROCEDURE postgresforest.logger()
> 2007-11-28 20:50:08.335 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:08.336 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:08.336 DEBUG:   Logger started, table => test_table1
> 2007-11-28 20:50:08.336 DEBUG: beginRecoveryLogging: test_table2
> 2007-11-28 20:50:08.336 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:08.336 DEBUG: CREATE TRIGGER test_table2_logger AFTER
> INSERT OR UPDATE OR DELETE ON test_table2 FOR EACH ROW EXECUTE
> PROCEDURE postgresforest.logger()
> 2007-11-28 20:50:08.336 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:08.337 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:08.337 DEBUG:   Logger started, table => test_table2
> 2007-11-28 20:50:08.338 DEBUG: beginRecoveryLogging: test_table3
> 2007-11-28 20:50:08.338 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:08.338 DEBUG: CREATE TRIGGER test_table3_logger AFTER
> INSERT OR UPDATE OR DELETE ON test_table3 FOR EACH ROW EXECUTE
> PROCEDURE postgresforest.logger()
> 2007-11-28 20:50:08.338 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:08.339 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:08.339 DEBUG:   Logger started, table => test_table3
> 2007-11-28 20:50:08.339 DEBUG: beginRecoveryLogging: test_table4
> 2007-11-28 20:50:08.339 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:08.339 DEBUG: CREATE TRIGGER test_table4_logger AFTER
> INSERT OR UPDATE OR DELETE ON test_table4 FOR EACH ROW EXECUTE
> PROCEDURE postgresforest.logger()
> 2007-11-28 20:50:08.339 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:08.34 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:08.34 DEBUG:   Logger started, table => test_table4
> 2007-11-28 20:50:08.34 NOTICE: COPYING ALL TABLES: start.
> 2007-11-28 20:50:08.341 NOTICE:   CREATE SNAPSHOT: start.
> 2007-11-28 20:50:08.341 DEBUG: getSnapshotXid: test_table1
> 2007-11-28 20:50:08.341 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:08.341 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:08.385 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:08.385 NOTICE:   COPYING test_table1
> 2007-11-28 20:50:08.386 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:08.386 DEBUG: copyTableToFile(): test_table1 to
> /tmp/test_table1.db
> 2007-11-28 20:50:08.386 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:50:08.386 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:20.459 DEBUG: copyTableToFile(): done.
> 2007-11-28 20:50:20.459 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:20.459 NOTICE:     done.
> 2007-11-28 20:50:20.459 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:50:20.459 NOTICE:   COPYING test_table2
> 2007-11-28 20:50:20.459 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:20.459 DEBUG: copyTableToFile(): test_table2 to
> /tmp/test_table2.db
> 2007-11-28 20:50:20.459 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:32.554 DEBUG: copyTableToFile(): done.
> 2007-11-28 20:50:32.554 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:32.554 NOTICE:     done.
> 2007-11-28 20:50:32.554 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:50:32.554 NOTICE:   COPYING test_table3
> 2007-11-28 20:50:32.554 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:32.554 DEBUG: copyTableToFile(): test_table3 to
> /tmp/test_table3.db
> 2007-11-28 20:50:32.554 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:44.606 DEBUG: copyTableToFile(): done.
> 2007-11-28 20:50:44.606 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:44.606 NOTICE:     done.
> 2007-11-28 20:50:44.606 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:50:44.607 NOTICE:   COPYING test_table4
> 2007-11-28 20:50:44.607 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:50:44.607 DEBUG: copyTableToFile(): test_table4 to
> /tmp/test_table4.db
> 2007-11-28 20:50:44.607 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:50:56.662 DEBUG: copyTableToFile(): done.
> 2007-11-28 20:50:56.662 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:50:56.662 NOTICE:     done.
> 2007-11-28 20:50:56.662 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:50:56.662 DEBUG:    Snapshot xid=71776536
> 2007-11-28 20:50:56.662 NOTICE:   CREATE SNAPSHOT: done.
> 2007-11-28 20:50:56.662 NOTICE:   REBUILD FROM SNAPSHOT: start.
> 2007-11-28 20:51:01.677 DEBUG: rebuildTableFromFile(): test_table1
> from /tmp/test_table1.db
> 2007-11-28 20:51:01.677 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:51:01.677 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:51:08.339 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:51:08.342 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:51:08.344 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:51:08.35 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:51:28.678 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:51:28.678 DEBUG: rebuildTableFromFile(): done.
> 2007-11-28 20:51:28.678 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:51:33.002 DEBUG: rebuildTableFromFile(): test_table2
> from /tmp/test_table2.db
> 2007-11-28 20:51:33.002 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:51:33.002 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:52:07.916 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:52:07.917 DEBUG: rebuildTableFromFile(): done.
> 2007-11-28 20:52:07.917 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:52:12.239 DEBUG: rebuildTableFromFile(): test_table3
> from /tmp/test_table3.db
> 2007-11-28 20:52:12.239 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:52:12.239 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:52:58.378 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:52:58.378 DEBUG: rebuildTableFromFile(): done.
> 2007-11-28 20:52:58.378 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:53:02.695 DEBUG: rebuildTableFromFile(): test_table4
> from /tmp/test_table4.db
> 2007-11-28 20:53:02.696 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been created.
> 2007-11-28 20:53:02.696 DEBUG: CancelThread is going to wait 60 secs.
> 2007-11-28 20:53:44.636 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db disabled.
> 2007-11-28 20:53:44.636 DEBUG: rebuildTableFromFile(): done.
> 2007-11-28 20:53:44.636 DEBUG: CancelThread: pid=17602,
> url=jdbc:postgresql://192.168.0.165:5432/forest_db has been finished
> without a cancel command.
> 2007-11-28 20:53:45.046 NOTICE:   REBUILD FROM SNAPSHOT: done.
> 2007-11-28 20:53:45.046 NOTICE: COPYING ALL TABLES: done.
> 2007-11-28 20:53:45.046 NOTICE: APPLYING LOG RECORD(S): start.
> 2007-11-28 20:53:45.245 DEBUG: getMaxMinTimestamp(): Xid
> [71776538...71842194] (input)
> 2007-11-28 20:53:45.392 DEBUG: getMaxMinTimestamp(): Timestamp
> [2007-11-28 20:50:04.853027...2007-11-28 20:50:04.853027] (result)
> 2007-11-28 20:53:45.392 DEBUG: openLogRecords(): xid=[71776538...71842194]
> 2007-11-28 20:53:46.241 DEBUG: # of log records = 65460
> org.postgresforest.tool.recovery.RecoveryException: applyLogRecord()
> failed. - /* 71776540/2007-11-28 20:50:04.853027 */ INSERT INTO
> test_table4(measure,login_id,ph_id,chs,aia_v1,conv_v1,aia_v2,conv_v2,rc,rc_ms,updt,correct,v_exe)
> values ('2006-05-15 02:00:00','testsah28       ','testsah28       ','1
>       ',null,'0',null,null,'2006-05-15 02:00:00','0','2006-05-15
> 02:00:00','0','0')
>         at org.postgresforest.tool.recovery.RecoveryManager.applyLogRecord(Unknown
> Source)
>         at org.postgresforest.tool.recovery.ForestRecovery.applyAllRecords(Unknown
> Source)
>         at org.postgresforest.tool.recovery.ForestRecovery.RecoveryPhase1(Unknown
> Source)
>         at org.postgresforest.tool.recovery.ForestRecovery.main(Unknown Source)
> $
> 
> また、障害インスタンス(オンラインリカバリ実行インスタンス)のサーバログには
> 以下のようなメッセージが出力されていました。
> 
> LOG:  unexpected EOF on client connection
> ERROR:  duplicate key violates unique constraint "test_table4_pkey"
> STATEMENT:  /* 71776540/2007-11-28 20:50:04.853027 */ INSERT INTO
> test_table4(measure,login_id,ph_id,chs,aia_v1,conv_v1,aia_v2,conv_v2,rc,rc_ms,updt,correct,v_exe)
> values ('2006-05-15 02:00:00','testsah28       ','testsah28       ','1
>       ',null,'0',n
> ull,null,'2006-05-15 02:00:00','0','2006-05-15 02:00:00','0','0')
> LOG:  unexpected EOF on client connection
> LOG:  unexpected EOF on client connection
> 
> _______________________________________________
> postgresforest-users mailing list
> postg****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/postgresforest-users
> 
> 




postgresforest-users メーリングリストの案内
Back to archive index