Durable Agents with Temporal: Retries, Idempotency, and Long-Running State

Why this matters
Agents are often framed as “reason + tools.”
In production, the actual problem is execution:
- calls fail
- networks flake
- credentials expire
- humans need to approve steps
- tasks take hours/days
- systems restart
- you need a forensic trail of what happened
If your agent runtime is “one process with a loop,” you will eventually lose state and do the wrong side effect twice.
This is why workflow engines exist.
Temporal’s model - durable workflows with deterministic execution and event history - maps incredibly well to tool-using agents. Temporal explicitly requires workflow code to be deterministic and provides APIs for versioning long-running workflows. [1][2]
This article is a production pattern: use Temporal to make agents durable.
TL;DR
- Represent an agent run as a Temporal Workflow.
- Make tool calls Activities (retryable, timeout-bounded).
- Put side-effecting tools behind:
- idempotency keys
- preview -> apply
- durable “exactly-once” semantics (from the workflow’s perspective)
- Use Temporal’s retry policies for Activities and explicit failure handling. [3]
- Use event history and replay for forensics (Temporal events are first-class). [4]
- Use workflow versioning for safe evolution of long-running agents. [2]
Contents
- Why agents need durable execution
- Mapping an agent to Temporal
- Determinism and why it matters
- Retries, timeouts, and idempotency
- Human-in-the-loop as a first-class step
- Replay, audit, and debugging
- Versioning: evolving agents safely
- A production checklist
- References
Why agents need durable execution
A few failure modes you’ll recognize:
Partial side effects
- agent creates a ticket
- process dies before storing the ticket ID
- agent retries and creates a duplicate
Long-running waits
- “wait for PR approvals”
- “wait for a CI pipeline”
- “wait for a meeting to complete” If your agent can’t wait durably, it becomes a polling daemon.
Human approval
Some steps should not be automated:
- “apply to prod”
- “send email”
- “delete resources” You need durable pause/resume with clean audit.
Mapping an agent to Temporal
Workflow = agent run
One agent run becomes a single Temporal Workflow Execution. Temporal workflows are designed for long-running, durable coordination. [5]
Inside the workflow you model steps:
- interpret goal
- choose tools
- call tools
- react to results
- request approvals
- finalize output
Activities = tool calls and external IO
All external calls should be Activities:
- MCP tool calls
- HTTP calls
- DB writes
- notifications
Why? Activities are where retries and timeouts belong. Temporal defines retry policies as configuration for how and when to retry failures. [3]
Signals = external events
Use signals for:
- human approvals
- “cancel”
- updated user intent
- out-of-band events (“incident resolved”)
Queries = introspection
Expose workflow state:
- current step
- last tool call
- pending approvals
- budget remaining
Determinism and why it matters
Temporal requires workflow code to be deterministic. [1] Determinism is what allows Temporal to replay history and rebuild state after worker crashes.
Practical consequence:
- Don’t do IO in workflow code.
- Don’t read the current time directly in workflow code (use Temporal APIs).
- Don’t call random generators without deterministic control.
- Keep workflow logic as “orchestration,” not execution.
If you violate determinism, you can hit non-deterministic errors on replay. Temporal’s docs and community discussions emphasize this constraint and the need for careful changes. [1][2]
Retries, timeouts, and idempotency
Retry policies (Activities)
Temporal retry policies control backoff and retry behavior for activity failures. [3]
Use them intentionally:
- retries for transient failures (rate limits, timeouts)
- limited retries for “probably broken” failures
- exponential backoff with jitter (avoid thundering herd)
Timeouts are not optional
Set explicit timeouts:
- ScheduleToStart
- StartToClose
- ScheduleToClose
Without timeouts, retries can run “forever” in practice.
Idempotency keys for side effects
Your workflow can be retried/replayed. Your Activity can be retried. Upstream systems can time out after performing the operation.
For side-effecting tools:
- generate an idempotency key in the workflow
- pass it into the tool Activity
- store “operation result” in workflow state
When the Activity retries, it reuses the key so the upstream system deduplicates.
This is the difference between “retries” and “duplicates.”
Human-in-the-loop as a first-class step
For dangerous operations:
- pause
- ask for approval with the plan summary
- resume when approved
Temporal workflows can wait for signals without holding threads like a traditional process would.
This is one of the cleanest ways to build:
- “preview -> approve -> apply” without building a bunch of custom state machinery.
Replay, audit, and debugging
Temporal events are recorded as part of the workflow’s event history. [4]
This yields production superpowers:
- reconstruct exactly what happened
- understand why a step was taken
- replay a run to test a bug fix
- implement “reset” patterns (carefully)
For agents, this is the difference between:
- “the model did something weird” and
- “step 7 called tool X with args Y after tool Z returned response R”
Versioning: evolving agents safely
Agent logic will change. Prompts will change. Tool contracts will change.
If you have long-running agents, you need a strategy that doesn’t break in-flight executions.
Temporal provides workflow versioning mechanisms because determinism means you can’t simply change workflow logic without thought. [2]
Production approach:
- keep existing executions on old code paths
- route new executions to new paths
- migrate intentionally
This prevents “deploy broke every running workflow.”
A production checklist
Architecture
- Agent runs modeled as workflows; tool calls as activities.
- External events modeled as signals; state exposed via queries.
Determinism
- No IO in workflow code (only orchestration).
- Workflow changes use versioning strategy. [2]
Reliability
- Retry policies defined for Activities. [3]
- Timeouts defined and bounded.
- Idempotency keys used for side-effecting actions.
Governance
- Human approval gates exist for dangerous operations.
- Audit trails include plan summaries and results.
Operability
- Event history used for debugging and incident analysis. [4]
References
[1] Temporal - Workflow Definition (determinism requirement): https://docs.temporal.io/workflow-definition [2] Temporal Go SDK - Versioning (evolving deterministic workflows safely): https://docs.temporal.io/develop/go/versioning [3] Temporal - Retry Policies (how and when retries happen): https://docs.temporal.io/encyclopedia/retry-policies [4] Temporal - Events reference (event history): https://docs.temporal.io/references/events [5] Temporal - Workflows overview: https://docs.temporal.io/workflows