Durable Agents with Temporal: Retries, Idempotency, and Long-Running State

Why this matters

Agents are often framed as “reason + tools.”

In production, the actual problem is execution:

calls fail
networks flake
credentials expire
humans need to approve steps
tasks take hours/days
systems restart
you need a forensic trail of what happened

If your agent runtime is “one process with a loop,” you will eventually lose state and do the wrong side effect twice.

This is why workflow engines exist.

Temporal’s model - durable workflows with deterministic execution and event history - maps incredibly well to tool-using agents. Temporal explicitly requires workflow code to be deterministic and provides APIs for versioning long-running workflows. [1][2]

This article is a production pattern: use Temporal to make agents durable.

TL;DR

Represent an agent run as a Temporal Workflow.
Make tool calls Activities (retryable, timeout-bounded).
Put side-effecting tools behind:
idempotency keys
preview -> apply
durable “exactly-once” semantics (from the workflow’s perspective)
Use Temporal’s retry policies for Activities and explicit failure handling. [3]
Use event history and replay for forensics (Temporal events are first-class). [4]
Use workflow versioning for safe evolution of long-running agents. [2]

Why agents need durable execution
Mapping an agent to Temporal
Determinism and why it matters
Retries, timeouts, and idempotency
Human-in-the-loop as a first-class step
Replay, audit, and debugging
Versioning: evolving agents safely
A production checklist
References

Why agents need durable execution

A few failure modes you’ll recognize:

Partial side effects

agent creates a ticket
process dies before storing the ticket ID
agent retries and creates a duplicate

Long-running waits

“wait for PR approvals”
“wait for a CI pipeline”
“wait for a meeting to complete” If your agent can’t wait durably, it becomes a polling daemon.

Human approval

Some steps should not be automated:

“apply to prod”
“send email”
“delete resources” You need durable pause/resume with clean audit.

Mapping an agent to Temporal

Workflow = agent run

One agent run becomes a single Temporal Workflow Execution. Temporal workflows are designed for long-running, durable coordination. [5]

Inside the workflow you model steps:

interpret goal
choose tools
call tools
react to results
request approvals
finalize output

Activities = tool calls and external IO

All external calls should be Activities:

MCP tool calls
HTTP calls
DB writes
notifications

Why? Activities are where retries and timeouts belong. Temporal defines retry policies as configuration for how and when to retry failures. [3]

Signals = external events

Use signals for:

human approvals
“cancel”
updated user intent
out-of-band events (“incident resolved”)

Queries = introspection

Expose workflow state:

current step
last tool call
pending approvals
budget remaining

Determinism and why it matters

Temporal requires workflow code to be deterministic. [1] Determinism is what allows Temporal to replay history and rebuild state after worker crashes.

Practical consequence:

Don’t do IO in workflow code.
Don’t read the current time directly in workflow code (use Temporal APIs).
Don’t call random generators without deterministic control.
Keep workflow logic as “orchestration,” not execution.

If you violate determinism, you can hit non-deterministic errors on replay. Temporal’s docs and community discussions emphasize this constraint and the need for careful changes. [1][2]

Retries, timeouts, and idempotency

Retry policies (Activities)

Temporal retry policies control backoff and retry behavior for activity failures. [3]

Use them intentionally:

retries for transient failures (rate limits, timeouts)
limited retries for “probably broken” failures
exponential backoff with jitter (avoid thundering herd)

Timeouts are not optional

Set explicit timeouts:

ScheduleToStart
StartToClose
ScheduleToClose

Without timeouts, retries can run “forever” in practice.

Idempotency keys for side effects

Your workflow can be retried/replayed. Your Activity can be retried. Upstream systems can time out after performing the operation.

For side-effecting tools:

generate an idempotency key in the workflow
pass it into the tool Activity
store “operation result” in workflow state

When the Activity retries, it reuses the key so the upstream system deduplicates.

This is the difference between “retries” and “duplicates.”

Human-in-the-loop as a first-class step

For dangerous operations:

pause
ask for approval with the plan summary
resume when approved

Temporal workflows can wait for signals without holding threads like a traditional process would.

This is one of the cleanest ways to build:

“preview -> approve -> apply” without building a bunch of custom state machinery.

Replay, audit, and debugging

Temporal events are recorded as part of the workflow’s event history. [4]

This yields production superpowers:

reconstruct exactly what happened
understand why a step was taken
replay a run to test a bug fix
implement “reset” patterns (carefully)

For agents, this is the difference between:

“the model did something weird” and
“step 7 called tool X with args Y after tool Z returned response R”

Versioning: evolving agents safely

Agent logic will change. Prompts will change. Tool contracts will change.

If you have long-running agents, you need a strategy that doesn’t break in-flight executions.

Temporal provides workflow versioning mechanisms because determinism means you can’t simply change workflow logic without thought. [2]

Production approach:

keep existing executions on old code paths
route new executions to new paths
migrate intentionally

This prevents “deploy broke every running workflow.”

A production checklist

Architecture

Agent runs modeled as workflows; tool calls as activities.
External events modeled as signals; state exposed via queries.

Determinism

No IO in workflow code (only orchestration).
Workflow changes use versioning strategy. [2]

Reliability

Retry policies defined for Activities. [3]
Timeouts defined and bounded.
Idempotency keys used for side-effecting actions.

Governance

Human approval gates exist for dangerous operations.
Audit trails include plan summaries and results.

Operability

Event history used for debugging and incident analysis. [4]

References

[1] Temporal - Workflow Definition (determinism requirement): https://docs.temporal.io/workflow-definition [2] Temporal Go SDK - Versioning (evolving deterministic workflows safely): https://docs.temporal.io/develop/go/versioning [3] Temporal - Retry Policies (how and when retries happen): https://docs.temporal.io/encyclopedia/retry-policies [4] Temporal - Events reference (event history): https://docs.temporal.io/references/events [5] Temporal - Workflows overview: https://docs.temporal.io/workflows

Temporal Agents Durable-Execution Reliability Go Architecture

Authors

Roy Gabriel

DevOps Architect · Applied AI Engineer

I’ve spent 20 years building systems across embedded systems, micro-controllers, PLCS, security platforms, fintech, SRE, and platform architecture. Today I focus on production AI systems in Go: multi-agent orchestration, MCP server ecosystems, and the DevOps platforms that keep them running. I care about systems that work under pressure: observable, recoverable, and built to last.

← Cost Is a Reliability Problem December 13, 2025

Evals for Tool-Using Agents: Regression Tests Beyond Prompts November 29, 2025 →