<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Durable-Execution | Roy Gabriel</title><link>https://roygabriel.dev/tags/durable-execution/</link><description>Roy Gabriel: DevOps Architect &amp; Applied AI Engineer. Technical blog on Go, MCP servers, Kubernetes, and production AI systems.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 27 Feb 2026 03:18:04 +0000</lastBuildDate><atom:link href="https://roygabriel.dev/tags/durable-execution/index.xml" rel="self" type="application/rss+xml"/><item><title>Durable Agents with Temporal: Retries, Idempotency, and Long-Running State</title><link>https://roygabriel.dev/blog/durable-agents-with-temporal/</link><pubDate>Sat, 06 Dec 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/durable-agents-with-temporal/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Agents are often framed as &amp;ldquo;reason + tools.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;In production, the actual problem is &lt;strong&gt;execution&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calls fail&lt;/li&gt;
&lt;li&gt;networks flake&lt;/li&gt;
&lt;li&gt;credentials expire&lt;/li&gt;
&lt;li&gt;humans need to approve steps&lt;/li&gt;
&lt;li&gt;tasks take hours/days&lt;/li&gt;
&lt;li&gt;systems restart&lt;/li&gt;
&lt;li&gt;you need a forensic trail of what happened&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your agent runtime is &amp;ldquo;one process with a loop,&amp;rdquo; you will eventually lose state and do the wrong side effect twice.&lt;/p&gt;
&lt;p&gt;This is why workflow engines exist.&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Agents are often framed as &amp;ldquo;reason + tools.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;In production, the actual problem is &lt;strong&gt;execution&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calls fail&lt;/li&gt;
&lt;li&gt;networks flake&lt;/li&gt;
&lt;li&gt;credentials expire&lt;/li&gt;
&lt;li&gt;humans need to approve steps&lt;/li&gt;
&lt;li&gt;tasks take hours/days&lt;/li&gt;
&lt;li&gt;systems restart&lt;/li&gt;
&lt;li&gt;you need a forensic trail of what happened&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your agent runtime is &amp;ldquo;one process with a loop,&amp;rdquo; you will eventually lose state and do the wrong side effect twice.&lt;/p&gt;
&lt;p&gt;This is why workflow engines exist.&lt;/p&gt;
&lt;p&gt;Temporal&amp;rsquo;s model - durable workflows with deterministic execution and event history - maps incredibly well to tool-using agents. Temporal explicitly requires workflow code to be deterministic and provides APIs for versioning long-running workflows. [1][2]&lt;/p&gt;
&lt;p&gt;This article is a production pattern: &lt;strong&gt;use Temporal to make agents durable.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Represent an agent run as a &lt;strong&gt;Temporal Workflow&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Make tool calls &lt;strong&gt;Activities&lt;/strong&gt; (retryable, timeout-bounded).&lt;/li&gt;
&lt;li&gt;Put side-effecting tools behind:&lt;/li&gt;
&lt;li&gt;idempotency keys&lt;/li&gt;
&lt;li&gt;preview -&amp;gt; apply&lt;/li&gt;
&lt;li&gt;durable &amp;ldquo;exactly-once&amp;rdquo; semantics (from the workflow&amp;rsquo;s perspective)&lt;/li&gt;
&lt;li&gt;Use Temporal&amp;rsquo;s retry policies for Activities and explicit failure handling. [3]&lt;/li&gt;
&lt;li&gt;Use event history and replay for forensics (Temporal events are first-class). [4]&lt;/li&gt;
&lt;li&gt;Use workflow versioning for safe evolution of long-running agents. [2]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-agents-need-durable-execution"&gt;Why agents need durable execution&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#mapping-an-agent-to-temporal"&gt;Mapping an agent to Temporal&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#determinism-and-why-it-matters"&gt;Determinism and why it matters&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#retries-timeouts-and-idempotency"&gt;Retries, timeouts, and idempotency&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#human-in-the-loop-as-a-first-class-step"&gt;Human-in-the-loop as a first-class step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#replay-audit-and-debugging"&gt;Replay, audit, and debugging&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#versioning-evolving-agents-safely"&gt;Versioning: evolving agents safely&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="why-agents-need-durable-execution"&gt;Why agents need durable execution&lt;/h2&gt;
&lt;p&gt;A few failure modes you&amp;rsquo;ll recognize:&lt;/p&gt;
&lt;h3 id="partial-side-effects"&gt;Partial side effects&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;agent creates a ticket&lt;/li&gt;
&lt;li&gt;process dies before storing the ticket ID&lt;/li&gt;
&lt;li&gt;agent retries and creates a duplicate&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="long-running-waits"&gt;Long-running waits&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;wait for PR approvals&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;wait for a CI pipeline&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;wait for a meeting to complete&amp;rdquo;
If your agent can&amp;rsquo;t wait durably, it becomes a polling daemon.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="human-approval"&gt;Human approval&lt;/h3&gt;
&lt;p&gt;Some steps should not be automated:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;apply to prod&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;send email&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;delete resources&amp;rdquo;
You need durable pause/resume with clean audit.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="mapping-an-agent-to-temporal"&gt;Mapping an agent to Temporal&lt;/h2&gt;
&lt;h3 id="workflow--agent-run"&gt;Workflow = agent run&lt;/h3&gt;
&lt;p&gt;One agent run becomes a single Temporal Workflow Execution. Temporal workflows are designed for long-running, durable coordination. [5]&lt;/p&gt;
&lt;p&gt;Inside the workflow you model steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;interpret goal&lt;/li&gt;
&lt;li&gt;choose tools&lt;/li&gt;
&lt;li&gt;call tools&lt;/li&gt;
&lt;li&gt;react to results&lt;/li&gt;
&lt;li&gt;request approvals&lt;/li&gt;
&lt;li&gt;finalize output&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="activities--tool-calls-and-external-io"&gt;Activities = tool calls and external IO&lt;/h3&gt;
&lt;p&gt;All external calls should be Activities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MCP tool calls&lt;/li&gt;
&lt;li&gt;HTTP calls&lt;/li&gt;
&lt;li&gt;DB writes&lt;/li&gt;
&lt;li&gt;notifications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Why? Activities are where retries and timeouts belong. Temporal defines retry policies as configuration for how and when to retry failures. [3]&lt;/p&gt;
&lt;h3 id="signals--external-events"&gt;Signals = external events&lt;/h3&gt;
&lt;p&gt;Use signals for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;human approvals&lt;/li&gt;
&lt;li&gt;&amp;ldquo;cancel&amp;rdquo;&lt;/li&gt;
&lt;li&gt;updated user intent&lt;/li&gt;
&lt;li&gt;out-of-band events (&amp;ldquo;incident resolved&amp;rdquo;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="queries--introspection"&gt;Queries = introspection&lt;/h3&gt;
&lt;p&gt;Expose workflow state:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;current step&lt;/li&gt;
&lt;li&gt;last tool call&lt;/li&gt;
&lt;li&gt;pending approvals&lt;/li&gt;
&lt;li&gt;budget remaining&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="determinism-and-why-it-matters"&gt;Determinism and why it matters&lt;/h2&gt;
&lt;p&gt;Temporal requires workflow code to be deterministic. [1] Determinism is what allows Temporal to replay history and rebuild state after worker crashes.&lt;/p&gt;
&lt;p&gt;Practical consequence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Don&amp;rsquo;t do IO in workflow code.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t read the current time directly in workflow code (use Temporal APIs).&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t call random generators without deterministic control.&lt;/li&gt;
&lt;li&gt;Keep workflow logic as &amp;ldquo;orchestration,&amp;rdquo; not execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you violate determinism, you can hit non-deterministic errors on replay. Temporal&amp;rsquo;s docs and community discussions emphasize this constraint and the need for careful changes. [1][2]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="retries-timeouts-and-idempotency"&gt;Retries, timeouts, and idempotency&lt;/h2&gt;
&lt;h3 id="retry-policies-activities"&gt;Retry policies (Activities)&lt;/h3&gt;
&lt;p&gt;Temporal retry policies control backoff and retry behavior for activity failures. [3]&lt;/p&gt;
&lt;p&gt;Use them intentionally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;retries for transient failures (rate limits, timeouts)&lt;/li&gt;
&lt;li&gt;limited retries for &amp;ldquo;probably broken&amp;rdquo; failures&lt;/li&gt;
&lt;li&gt;exponential backoff with jitter (avoid thundering herd)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="timeouts-are-not-optional"&gt;Timeouts are not optional&lt;/h3&gt;
&lt;p&gt;Set explicit timeouts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ScheduleToStart&lt;/li&gt;
&lt;li&gt;StartToClose&lt;/li&gt;
&lt;li&gt;ScheduleToClose&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without timeouts, retries can run &amp;ldquo;forever&amp;rdquo; in practice.&lt;/p&gt;
&lt;h3 id="idempotency-keys-for-side-effects"&gt;Idempotency keys for side effects&lt;/h3&gt;
&lt;p&gt;Your workflow can be retried/replayed. Your Activity can be retried. Upstream systems can time out after performing the operation.&lt;/p&gt;
&lt;p&gt;For side-effecting tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;generate an idempotency key in the workflow&lt;/li&gt;
&lt;li&gt;pass it into the tool Activity&lt;/li&gt;
&lt;li&gt;store &amp;ldquo;operation result&amp;rdquo; in workflow state&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When the Activity retries, it reuses the key so the upstream system deduplicates.&lt;/p&gt;
&lt;p&gt;This is the difference between &amp;ldquo;retries&amp;rdquo; and &amp;ldquo;duplicates.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="human-in-the-loop-as-a-first-class-step"&gt;Human-in-the-loop as a first-class step&lt;/h2&gt;
&lt;p&gt;For dangerous operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pause&lt;/li&gt;
&lt;li&gt;ask for approval with the plan summary&lt;/li&gt;
&lt;li&gt;resume when approved&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Temporal workflows can wait for signals without holding threads like a traditional process would.&lt;/p&gt;
&lt;p&gt;This is one of the cleanest ways to build:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;preview -&amp;gt; approve -&amp;gt; apply&amp;rdquo;
without building a bunch of custom state machinery.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="replay-audit-and-debugging"&gt;Replay, audit, and debugging&lt;/h2&gt;
&lt;p&gt;Temporal events are recorded as part of the workflow&amp;rsquo;s event history. [4]&lt;/p&gt;
&lt;p&gt;This yields production superpowers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reconstruct exactly what happened&lt;/li&gt;
&lt;li&gt;understand why a step was taken&lt;/li&gt;
&lt;li&gt;replay a run to test a bug fix&lt;/li&gt;
&lt;li&gt;implement &amp;ldquo;reset&amp;rdquo; patterns (carefully)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For agents, this is the difference between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;the model did something weird&amp;rdquo;
and&lt;/li&gt;
&lt;li&gt;&amp;ldquo;step 7 called tool X with args Y after tool Z returned response R&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="versioning-evolving-agents-safely"&gt;Versioning: evolving agents safely&lt;/h2&gt;
&lt;p&gt;Agent logic will change. Prompts will change. Tool contracts will change.&lt;/p&gt;
&lt;p&gt;If you have long-running agents, you need a strategy that doesn&amp;rsquo;t break in-flight executions.&lt;/p&gt;
&lt;p&gt;Temporal provides workflow versioning mechanisms because determinism means you can&amp;rsquo;t simply change workflow logic without thought. [2]&lt;/p&gt;
&lt;p&gt;Production approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;keep existing executions on old code paths&lt;/li&gt;
&lt;li&gt;route new executions to new paths&lt;/li&gt;
&lt;li&gt;migrate intentionally&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This prevents &amp;ldquo;deploy broke every running workflow.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="architecture"&gt;Architecture&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Agent runs modeled as workflows; tool calls as activities.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; External events modeled as signals; state exposed via queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="determinism"&gt;Determinism&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; No IO in workflow code (only orchestration).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Workflow changes use versioning strategy. [2]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="reliability"&gt;Reliability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Retry policies defined for Activities. [3]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Timeouts defined and bounded.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Idempotency keys used for side-effecting actions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="governance"&gt;Governance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Human approval gates exist for dangerous operations.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Audit trails include plan summaries and results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Event history used for debugging and incident analysis. [4]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Temporal - Workflow Definition (determinism requirement): &lt;a href="https://docs.temporal.io/workflow-definition" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/workflow-definition&lt;/a&gt;
[2] Temporal Go SDK - Versioning (evolving deterministic workflows safely): &lt;a href="https://docs.temporal.io/develop/go/versioning" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/develop/go/versioning&lt;/a&gt;
[3] Temporal - Retry Policies (how and when retries happen): &lt;a href="https://docs.temporal.io/encyclopedia/retry-policies" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/encyclopedia/retry-policies&lt;/a&gt;
[4] Temporal - Events reference (event history): &lt;a href="https://docs.temporal.io/references/events" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/references/events&lt;/a&gt;
[5] Temporal - Workflows overview: &lt;a href="https://docs.temporal.io/workflows" target="_blank" rel="noopener noreferrer"&gt;https://docs.temporal.io/workflows&lt;/a&gt;
&lt;/p&gt;</content:encoded></item></channel></rss>