Durable Agents with Temporal: Retries, Idempotency, and Long-Running State

December 6, 2025 · 5 min read
blog

Why this matters

Agents are often framed as “reason + tools.”

In production, the actual problem is execution:

  • calls fail
  • networks flake
  • credentials expire
  • humans need to approve steps
  • tasks take hours/days
  • systems restart
  • you need a forensic trail of what happened

If your agent runtime is “one process with a loop,” you will eventually lose state and do the wrong side effect twice.

This is why workflow engines exist.

Temporal’s model - durable workflows with deterministic execution and event history - maps incredibly well to tool-using agents. Temporal explicitly requires workflow code to be deterministic and provides APIs for versioning long-running workflows. [1][2]

This article is a production pattern: use Temporal to make agents durable.


TL;DR

  • Represent an agent run as a Temporal Workflow.
  • Make tool calls Activities (retryable, timeout-bounded).
  • Put side-effecting tools behind:
  • idempotency keys
  • preview -> apply
  • durable “exactly-once” semantics (from the workflow’s perspective)
  • Use Temporal’s retry policies for Activities and explicit failure handling. [3]
  • Use event history and replay for forensics (Temporal events are first-class). [4]
  • Use workflow versioning for safe evolution of long-running agents. [2]

Contents


Why agents need durable execution

A few failure modes you’ll recognize:

Partial side effects

  • agent creates a ticket
  • process dies before storing the ticket ID
  • agent retries and creates a duplicate

Long-running waits

  • “wait for PR approvals”
  • “wait for a CI pipeline”
  • “wait for a meeting to complete” If your agent can’t wait durably, it becomes a polling daemon.

Human approval

Some steps should not be automated:

  • “apply to prod”
  • “send email”
  • “delete resources” You need durable pause/resume with clean audit.

Mapping an agent to Temporal

Workflow = agent run

One agent run becomes a single Temporal Workflow Execution. Temporal workflows are designed for long-running, durable coordination. [5]

Inside the workflow you model steps:

  • interpret goal
  • choose tools
  • call tools
  • react to results
  • request approvals
  • finalize output

Activities = tool calls and external IO

All external calls should be Activities:

  • MCP tool calls
  • HTTP calls
  • DB writes
  • notifications

Why? Activities are where retries and timeouts belong. Temporal defines retry policies as configuration for how and when to retry failures. [3]

Signals = external events

Use signals for:

  • human approvals
  • “cancel”
  • updated user intent
  • out-of-band events (“incident resolved”)

Queries = introspection

Expose workflow state:

  • current step
  • last tool call
  • pending approvals
  • budget remaining

Determinism and why it matters

Temporal requires workflow code to be deterministic. [1] Determinism is what allows Temporal to replay history and rebuild state after worker crashes.

Practical consequence:

  • Don’t do IO in workflow code.
  • Don’t read the current time directly in workflow code (use Temporal APIs).
  • Don’t call random generators without deterministic control.
  • Keep workflow logic as “orchestration,” not execution.

If you violate determinism, you can hit non-deterministic errors on replay. Temporal’s docs and community discussions emphasize this constraint and the need for careful changes. [1][2]


Retries, timeouts, and idempotency

Retry policies (Activities)

Temporal retry policies control backoff and retry behavior for activity failures. [3]

Use them intentionally:

  • retries for transient failures (rate limits, timeouts)
  • limited retries for “probably broken” failures
  • exponential backoff with jitter (avoid thundering herd)

Timeouts are not optional

Set explicit timeouts:

  • ScheduleToStart
  • StartToClose
  • ScheduleToClose

Without timeouts, retries can run “forever” in practice.

Idempotency keys for side effects

Your workflow can be retried/replayed. Your Activity can be retried. Upstream systems can time out after performing the operation.

For side-effecting tools:

  • generate an idempotency key in the workflow
  • pass it into the tool Activity
  • store “operation result” in workflow state

When the Activity retries, it reuses the key so the upstream system deduplicates.

This is the difference between “retries” and “duplicates.”


Human-in-the-loop as a first-class step

For dangerous operations:

  • pause
  • ask for approval with the plan summary
  • resume when approved

Temporal workflows can wait for signals without holding threads like a traditional process would.

This is one of the cleanest ways to build:

  • “preview -> approve -> apply” without building a bunch of custom state machinery.

Replay, audit, and debugging

Temporal events are recorded as part of the workflow’s event history. [4]

This yields production superpowers:

  • reconstruct exactly what happened
  • understand why a step was taken
  • replay a run to test a bug fix
  • implement “reset” patterns (carefully)

For agents, this is the difference between:

  • “the model did something weird” and
  • “step 7 called tool X with args Y after tool Z returned response R”

Versioning: evolving agents safely

Agent logic will change. Prompts will change. Tool contracts will change.

If you have long-running agents, you need a strategy that doesn’t break in-flight executions.

Temporal provides workflow versioning mechanisms because determinism means you can’t simply change workflow logic without thought. [2]

Production approach:

  • keep existing executions on old code paths
  • route new executions to new paths
  • migrate intentionally

This prevents “deploy broke every running workflow.”


A production checklist

Architecture

  • Agent runs modeled as workflows; tool calls as activities.
  • External events modeled as signals; state exposed via queries.

Determinism

  • No IO in workflow code (only orchestration).
  • Workflow changes use versioning strategy. [2]

Reliability

  • Retry policies defined for Activities. [3]
  • Timeouts defined and bounded.
  • Idempotency keys used for side-effecting actions.

Governance

  • Human approval gates exist for dangerous operations.
  • Audit trails include plan summaries and results.

Operability

  • Event history used for debugging and incident analysis. [4]

References

[1] Temporal - Workflow Definition (determinism requirement): https://docs.temporal.io/workflow-definition [2] Temporal Go SDK - Versioning (evolving deterministic workflows safely): https://docs.temporal.io/develop/go/versioning [3] Temporal - Retry Policies (how and when retries happen): https://docs.temporal.io/encyclopedia/retry-policies [4] Temporal - Events reference (event history): https://docs.temporal.io/references/events [5] Temporal - Workflows overview: https://docs.temporal.io/workflows

Authors
DevOps Architect · Applied AI Engineer
I’ve spent 20 years building systems across embedded systems, micro-controllers, PLCS, security platforms, fintech, SRE, and platform architecture. Today I focus on production AI systems in Go: multi-agent orchestration, MCP server ecosystems, and the DevOps platforms that keep them running. I care about systems that work under pressure: observable, recoverable, and built to last.