DurableExecutionを実装検証から理解する.pdf

Durable Execution を実装検証から理解する JAWS-UG名古屋 2026年1月31日（土） Lambda durable functions

自己紹介加藤寛士（かとうひろし） NECソリューションイノベータ株式会社 • クラウドエンジニア • JAWS-UG 名古屋運営
@Hircha12

長時間実行 Lambda上で最大1年間のワークフローを実行できる回復力処理が中断されても実行結果を保持し、後続処理を継続できるコスト効率待機中のコンピューティング料金は発生しない • エラーが発生したら、自動でリトライ復旧してくれる！？ • Lambdaの15分制限が、1年間に延長した！？
よくある疑問実際の動きを確認し、理解を深める Lambda Durable Functions とは

ポイントは replay と処理状態の管理 15分以内 1年開始 Lambda run replay
Lambda run replay Lambda run replay Lambda run 15分以内 15分以内 15分以内 Durable Executionは、Durable Operationの実行履歴として保存された状態を参照しながら、 Lambda関数をreplayすることで処理を継続する実行モデルである ※ replayはエラー時だけでなく、様々な条件によっても発生する Lambda Durable Functions の仕組み

Durable Execution replay Durable Operation Durable Operation 終了 • Durable
Executionが中断され、再開条件が満たされた場合、Executionはreplayにより再開される • Durable Operationの状態履歴の保持期間は、最大90日まで設定可能実行通常処理再実行 Durable Operation Durable Operation 通常処理 • Durable Operationを定義、制御するためのSDKが提供されている DurableOperationは完了していたら実行しない Replay と Durable Operation の実行モデル • 結果がExecutionの履歴として保持される処理 • Executionを中断・再開する処理

Durable Execution replay Durable Operation Durable Operation 終了 • Durable
Executionが中断され、再開条件が満たされた場合、Executionはreplayにより再開される • Durable Operationの状態履歴の保持期間は、最大90日まで設定可能実行通常処理再実行 Durable Operation Durable Operation 通常処理 • Durable Operationを定義、制御するためのSDKが提供されている DurableOperationは完了していたら実行しない Replay と Durable Operation の実行モデル • 結果がExecutionの履歴として保持される処理 • Executionを中断・再開する処理 ② 10秒待機 (Executionを中断) ③ 10秒後に再実行 ⑤ replayでは実行されない ④ replayでも実行される ⑥ replayでは実行されない ① Executionに結果を残す処理

Durable Execution のポイント • Durable Executionは、Lambda関数のreplayによる再実行を前提とした実行モデルである • replayはエラーハンドリングではなく、Durable Executionを継続させるために
実行モデルに組み込まれた制御機構である

https://github.com/aws/aws-durable-execution-sdk-python • サポートされているランタイム(2026/1/31時点) • SDK https://github.com/aws/aws-durable-execution-sdk-js Node.js nodejs22.x、nodejs24.x Python python3.13、python3.14
SDKとランタイム

SDKが提供しているコアとなる操作操作分類説明 Steps 実行結果を保存する操作実行結果をExecutionの履歴として保持し、Replay時に再実行されない境界となる Parallel / Map
複数のDurable Operationの実行結果をまとめて履歴として保持する Child contexts 操作をグループ化し、内部のDurable Operationの履歴管理を整理する Wait 再開条件を持つ操作 Executionを中断し、待機時間経過により再開される Callback Executionを中断し、外部からの応答により再開される Invoke 他のExecutionを呼び出し、その完了結果により再開される Logger Durableのためのログ出力 replayによるログ重複を自動排除する基本的な操作

AWSマネージメントコンソールからの作成永続実行の有効化にチェック ※後から変更できないため注意永続実行のタブが表示される基本的な実装 (python)

サンプル実装されているコードサンプルコードを動かすことで、基本的な考え方が理解できる基本的な実装 (python)

サンプル実装されているコード @durable_execution • handlerの実行モデルとcontextを差し替えるデコレータ • contextはDurableContextになる • DurableContextのLoggerについては後述サンプルコードを動かすことで、基本的な考え方が理解できる
基本的な実装 (python)

サンプル実装されているコード @durable_step • 関数をStepとして定義するデコレータ • contextはStepContextになる • StepContextのLoggerについては後述サンプルコードを動かすことで、基本的な考え方が理解できる基本的な実装
(python)

終了実行 Durable Operation Durable Operation 通常処理検証コード：検証を深めるために、サンプルコードにログ、異常系処理等を追加具体的に動きを見てみよう ①正常系

検証コード：検証を深めるために、サンプルコードにログ、異常系処理等を追加終了実行 Durable Operation Durable Operation 通常処理 ② DurableContextLogger出力
① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力 ⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了実行検証コードのイメージ具体的に動きを見てみよう ①正常系

検証コードを正常に動かした場合のCloudwatchlog 具体的に動きを見てみよう ①正常系

検証コードを正常に動かした場合のCloudwatchlog ② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④
StepContextLogger出力 ⑥ DurableContextLogger出力 ⑤ print出力 • ログ出力、step関数、wait関数の実行までの動きが読み取れる具体的に動きを見てみよう ①正常系

検証コードを正常に動かした場合のCloudwatchlog ① print出力 ⑤ print出力 ⑧ print出力 ⑨ DurableContextLogger出力 •
replayしても、step関数、wait関数が実行されていない動きが読み取れる • replay時に、DurableContextLogger出力の重複が抑制されているのが分かる • 重複していない出力も抑制されている具体的に動きを見てみよう ①正常系

② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力
⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了実行具体的に動きを見てみよう ①正常系

⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了実行 ② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力 ⑥ DurableContextLogger出力 ⑤ print出力実行 ② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力 ⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了再実行具体的に動きを見てみよう ①正常系

出力実装 { "timestamp": "2026-01-28T06:16:45Z", "level": "INFO", "message": "LOGGER: start
execution", "logger": "root", "requestId": "0c657120-8529-44a2-acc2-f80b1d059215", “executionArn”: “arn:aws:lambda:us-east…(省略)", "mode": "success" } context.logger.info("LOGGER: start execution", extra={"mode": mode}) • Lambda実行単位で設定されるID • replayごとに異なる具体的に動きを見てみよう ①正常系ログトレースの際は、requestId／operationIdに着目する • DurableContextLogger

出力実装ログトレースの際は、requestId／operationIdに着目する { "timestamp": "2026-01-28T06:16:45Z", "level": "INFO", "message": "STEP:
work() started", "logger": "root", "requestId": "0c657120-8529-44a2-acc2-f80b1d059215", “executionArn”: “arn:aws:lambda:us-east…(省略)", "operationName": "work-step", "attempt": 1, “operationId”: “1ced8f5be2db23a6…(省略)", "mode": "success" } step_ctx.logger.info("STEP: work() started", extra={"mode": mode}) • StepContextLogger • Lambda実行単位で設定されるID • replayごとに異なる • step単位で設定されるID • replayしても同じ • stepが実行された回数具体的に動きを見てみよう ①正常系

Tips 永続実行のタブから、オペレーション、イベント履歴をトレースすることができる

Tips バグの作り込み等により、実行が終わらない場合に、この画面から停止することができる

⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了実行 Runtime Errorを発生させる具体的に動きを見てみよう ②異常系

検証コードでRuntime Errorを発生させた場合のCloudwatchlog 具体的に動きを見てみよう ②異常系

検証コードでRuntime Errorを発生させた場合のCloudwatchlog ② DurableContextLogger出力 ① print出力 ③ step関数実行 ④ StepContextLogger出力
③ step関数実行 ④ StepContextLogger出力 ① print出力具体的に動きを見てみよう ②異常系 • replayされ、step関数が再実行されているのが読み取れる

検証コードでRuntime Errorを発生させた場合のCloudwatchlog 具体的に動きを見てみよう ②異常系 • replayされ、attemptがカウントアップしていることが分かる • replayしても、operationIdが変わらないことが分かる • replayしたことで、requestIdが変更しているのが分かる

⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了実行具体的に動きを見てみよう ②異常系 Runtime Error 発生

⑥ DurableContextLogger出力 ⑤ print出力 ⑨ DurableContextLogger出力 ⑧ print出力終了実行 ② DurableContextLogger出力 ① print出力 ③ step関数実行 ④ StepContextLogger出力実行終了具体的に動きを見てみよう ②異常系 Runtime Error 発生 ② DurableContextLogger出力 ① print出力 ③ step関数実行 ④ StepContextLogger出力実行 ② DurableContextLogger出力 ① print出力 ③ step関数実行 ④ StepContextLogger出力実行どうなって終わった？

具体的に動きを見てみよう ②異常系 5回のリトライで終了している (初回＋リトライで全6回) 指数バックオフの動きに見える永続実行のタブから、オペレーションを確認してみるデフォルトのretryアルゴリズムはSDK実装依存であり、明示的に設定することが推奨される

具体的に動きを見てみよう ③リトライ制御 Retry strategies(リトライ戦略) • stepのリトライ動作を構成する(StepConfig による制御) https://github.com/aws/aws-durable-execution-sdk-python/blob/main/docs/advanced/error-handling.md#retry-strategies github上のサンプル名前
説明 max_attempts 最大試行回数（初回試行を含む） initial_delay_seconds 最初の再試行までの初期遅延 max_delay_seconds 再試行間の最大遅延時間 backoff_rate 指数バックオフの乗数 jitter_strategy 遅延にランダム性を加えるジッター戦略 retryable_errors 再試行するエラーメッセージパターンのリスト retryable_error_types 再試行する例外の種類のリスト

具体的に動きを見てみよう ③リトライ制御 • max_attempts(最大試行回数)：3 • backoff_rate(指数バックオフの乗数)：2 Retry strategies(リトライ戦略)を変更して実行 2回のリトライで終了している (初回＋リトライで全3回)
指数バックオフの間隔が変化 SDKのバージョンや実装差分により、一部のパラメータが利用できない場合がある max_attempts=3の設定

まとめ：Durable Executionを正しく使うために • Durable Executionは、replayを前提とした実行モデルである • replayはエラーハンドリングではなく実行モデルの一部である • Lambdaの15分制限は変わらない →
Executionを分割・再構築して継続する • Durable operationの状態は、Executionの履歴として永続化される • 外部への影響を伴う処理は、replay時に再実行されない境界である stepに閉じ込める • 理解の近道は、ログとイベント履歴を読み解くこと

DurableExecutionを実装検証から理解する.pdf

DurableExecutionを実装検証から理解する.pdf

Hiroshi Kato

More Decks by Hiroshi Kato

Featured

Transcript

Durable Execution を実装検証から理解する JAWS-UG名古屋 2026年1月31日（土） Lambda durable functions

自己紹介加藤寛士（かとうひろし） NECソリューションイノベータ株式会社 • クラウドエンジニア • JAWS-UG 名古屋運営

ポイントは replay と処理状態の管理 15分以内 1年開始 Lambda run replay

Durable Execution replay Durable Operation Durable Operation 終了 • Durable

Durable Execution replay Durable Operation Durable Operation 終了 • Durable

Durable Execution のポイント • Durable Executionは、Lambda関数のreplayによる再実行を前提とした実行モデルである • replayはエラーハンドリングではなく、Durable Executionを継続させるために

https://github.com/aws/aws-durable-execution-sdk-python • サポートされているランタイム(2026/1/31時点) • SDK https://github.com/aws/aws-durable-execution-sdk-js Node.js nodejs22.x、nodejs24.x Python python3.13、python3.14

SDKが提供しているコアとなる操作操作分類説明 Steps 実行結果を保存する操作実行結果をExecutionの履歴として保持し、Replay時に再実行されない境界となる Parallel / Map

AWSマネージメントコンソールからの作成永続実行の有効化にチェック ※後から変更できないため注意永続実行のタブが表示される基本的な実装 (python)

サンプル実装されているコードサンプルコードを動かすことで、基本的な考え方が理解できる基本的な実装 (python)

サンプル実装されているコード @durable_step • 関数をStepとして定義するデコレータ • contextはStepContextになる • StepContextのLoggerについては後述サンプルコードを動かすことで、基本的な考え方が理解できる基本的な実装

終了実行 Durable Operation Durable Operation 通常処理検証コード：検証を深めるために、サンプルコードにログ、異常系処理等を追加具体的に動きを見てみよう ①正常系

検証コード：検証を深めるために、サンプルコードにログ、異常系処理等を追加終了実行 Durable Operation Durable Operation 通常処理 ② DurableContextLogger出力

検証コードを正常に動かした場合のCloudwatchlog 具体的に動きを見てみよう ①正常系

検証コードを正常に動かした場合のCloudwatchlog ② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④

検証コードを正常に動かした場合のCloudwatchlog ① print出力 ⑤ print出力 ⑧ print出力 ⑨ DurableContextLogger出力 •

② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力

② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力

出力実装 { "timestamp": "2026-01-28T06:16:45Z", "level": "INFO", "message": "LOGGER: start

出力実装ログトレースの際は、requestId／operationIdに着目する { "timestamp": "2026-01-28T06:16:45Z", "level": "INFO", "message": "STEP:

Tips 永続実行のタブから、オペレーション、イベント履歴をトレースすることができる

Tips バグの作り込み等により、実行が終わらない場合に、この画面から停止することができる

② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力

検証コードでRuntime Errorを発生させた場合のCloudwatchlog 具体的に動きを見てみよう ②異常系

検証コードでRuntime Errorを発生させた場合のCloudwatchlog ② DurableContextLogger出力 ① print出力 ③ step関数実行 ④ StepContextLogger出力

② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力

② DurableContextLogger出力 ① print出力 ③ step関数実行 ⑦ wait関数実行 ④ StepContextLogger出力

具体的に動きを見てみよう ③リトライ制御 • max_attempts(最大試行回数)：3 • backoff_rate(指数バックオフの乗数)：2 Retry strategies(リトライ戦略)を変更して実行 2回のリトライで終了している (初回＋リトライで全3回)

まとめ：Durable Executionを正しく使うために • Durable Executionは、replayを前提とした実行モデルである • replayはエラーハンドリングではなく実行モデルの一部である • Lambdaの15分制限は変わらない →