Pause / Resume / Checkpoint for Agents

Execution memory lets agents save their task state, pause for human input or external events, resume from exactly where they left off, and handle failures gracefully. Build reliable long-running agent tasks that survive interruptions.

Overview

Long-running agent tasks -- migrations, large code reviews, multi-step deployments -- need persistence. If the agent crashes or the user closes their laptop, work should not be lost. Execution memory provides:

  • Checkpoints -- Snapshots of task progress that can be restored
  • Pause/Resume -- Gracefully pause a task and resume later
  • Failure recovery -- Automatically retry or resume from the last checkpoint on failure
  • State persistence -- Arbitrary JSON state stored alongside task metadata

Creating Execution State

Start by creating an execution state entry for your task. This establishes the task identity and initial configuration.

Create execution state
import Rekall from '@rekall/agent-sdk';
const rekall = new Rekall({
apiKey: 'rk_your_key',
agentId: 'agent_abc123',
});
// Create an execution state for a long-running task
const execution = await rekall.execution.create({
name: 'Migrate database schema',
description: 'Apply 15 migration files to production database',
task: {
type: 'database-migration',
totalSteps: 15,
params: {
database: 'production',
migrations: ['001_users.sql', '002_posts.sql', '...'],
},
},
config: {
checkpointInterval: 'per-step',
maxRetries: 3,
timeout: 1800000, // 30 minutes
onFailure: 'pause',
},
});
console.log(`Execution ID: ${execution.id}`); // exec_xyz789
console.log(`Status: ${execution.status}`); // 'created'

Saving Checkpoints

Checkpoints capture the current state of a task so it can be restored later. Save checkpoints at meaningful boundaries -- after each migration file, after processing a batch, etc.

Save and manage checkpoints
// Start the execution
await rekall.execution.start({ executionId: execution.id });
// Process each migration step
for (let i = 0; i < migrations.length; i++) {
const migration = migrations[i];
// Apply the migration
await applyMigration(migration);
// Save checkpoint after each step
await rekall.execution.checkpoint({
executionId: execution.id,
step: i + 1,
state: {
completedMigrations: migrations.slice(0, i + 1),
lastApplied: migration,
databaseVersion: i + 1,
rollbackAvailable: true,
},
message: `Applied migration ${i + 1}/${migrations.length}: ${migration}`,
});
console.log(`Checkpoint saved: step ${i + 1}`);
}
// Mark execution as complete
await rekall.execution.complete({
executionId: execution.id,
result: {
migrationsApplied: migrations.length,
finalVersion: migrations.length,
duration: Date.now() - startTime,
},
});

Checkpoint Strategies

Different checkpoint strategies
// Per-step: checkpoint after every step (safest, more storage)
config: { checkpointInterval: 'per-step' }
// Periodic: checkpoint every N seconds
config: { checkpointInterval: 'periodic', checkpointPeriod: 30000 } // 30s
// Manual: only checkpoint when you explicitly call checkpoint()
config: { checkpointInterval: 'manual' }
// Custom: checkpoint based on a condition
config: {
checkpointInterval: 'custom',
shouldCheckpoint: (step, state) => {
// Checkpoint every 5 steps or when processing large items
return step % 5 === 0 || state.currentItemSize > 1000000;
},
}

Pausing Long-Running Tasks

You can pause a running task explicitly (for human approval) or it can auto-pause on failure. The current state is preserved at the last checkpoint.

Pause an execution
// Explicitly pause for human review
await rekall.execution.pause({
executionId: execution.id,
reason: 'Awaiting approval before applying migration 008_drop_table.sql',
preserveState: true,
});
// Check the paused state
const state = await rekall.execution.getState({
executionId: execution.id,
});
console.log(`Status: ${state.status}`); // 'paused'
console.log(`Paused at step: ${state.currentStep}`); // 7
console.log(`Reason: ${state.pauseReason}`);
console.log(`Last checkpoint: ${state.lastCheckpoint.message}`);
// List all paused executions
const paused = await rekall.execution.list({
status: 'paused',
sort: 'pausedAt',
order: 'desc',
});
for (const exec of paused.items) {
console.log(`[${exec.name}] paused at step ${exec.currentStep}: ${exec.pauseReason}`);
}

Waiting for External Processes

Agents often need to wait for external systems -- CI/CD pipelines, third-party APIs, database migrations, deployment rollouts, or approval workflows in other tools. The external_dependency pause reason is designed for this. The agent pauses, records what it is waiting for, and resumes automatically or manually once the dependency completes.

Pause reasons

Rekall supports six pause reasons: human_approval, human_input, scheduled_wait, rate_limit, external_dependency, and checkpoint. Each one tags the paused state so dashboards and automations can react appropriately.

Polling Pattern

The simplest approach: pause the agent with details about the external process, then poll until the dependency resolves. This works well for CI builds, deployment health checks, or any system that exposes a status endpoint.

Pause for external dependency (polling)
// Agent kicked off a CI build and needs to wait for it
const buildId = await triggerCIBuild({ branch: 'feature/auth-v2' });
// Pause with external_dependency reason
await rekall.execution.pause({
executionId: execution.id,
reason: 'external_dependency',
message: `Waiting for CI build ${buildId} to complete`,
preserveState: true,
metadata: {
dependencyType: 'ci_build',
externalId: buildId,
externalUrl: `https://ci.example.com/builds/${buildId}`,
startedAt: new Date().toISOString(),
pollInterval: 30000, // suggested poll interval in ms
},
});
// --- Later: a separate process or cron job polls and resumes ---
// Check all executions waiting on external dependencies
const waiting = await rekall.execution.list({
status: 'paused',
pauseReason: 'external_dependency',
});
for (const exec of waiting.items) {
const { dependencyType, externalId } = exec.metadata;
if (dependencyType === 'ci_build') {
const build = await checkCIBuild(externalId);
if (build.status === 'success') {
await rekall.execution.resume({
executionId: exec.id,
stateOverrides: {
buildResult: build,
buildArtifacts: build.artifacts,
},
});
} else if (build.status === 'failed') {
await rekall.execution.fail({
executionId: exec.id,
error: {
message: `CI build ${externalId} failed`,
buildLogs: build.logUrl,
},
});
}
// If still running, do nothing -- will check again next poll
}
}

Webhook / Callback Pattern

For systems that support webhooks, you can register a callback URL that resumes the agent automatically when the external process finishes. This is more efficient than polling and provides faster resume times.

Pause for external dependency (webhook)
// Start a deployment and register a webhook for completion
const deployment = await startDeployment({
service: 'api-gateway',
version: 'v2.4.0',
callbackUrl: `https://api.rekall.ai/v1/execution/${execution.id}/webhook`,
});
// Pause until the webhook fires
await rekall.execution.pause({
executionId: execution.id,
reason: 'external_dependency',
message: `Waiting for deployment ${deployment.id} to roll out`,
preserveState: true,
metadata: {
dependencyType: 'deployment',
externalId: deployment.id,
service: 'api-gateway',
targetVersion: 'v2.4.0',
resumeOn: 'webhook', // signals this will auto-resume
},
});
// The webhook handler (server-side) receives the callback
// and automatically resumes the execution with the payload:
//
// POST /v1/execution/exec_xyz789/webhook
// { "status": "healthy", "instances": 3, "version": "v2.4.0" }
//
// Rekall merges the webhook payload into the execution state
// and transitions the execution back to 'running'.

Combine with timeouts

Set a timeout on external dependency pauses to avoid indefinite waits. If the external process does not complete within the timeout, the execution transitions to failed with a timeout error, which you can catch with the onFailure handler.

Resuming from Checkpoint

Resume a paused or failed execution from its last checkpoint. The task picks up exactly where it left off with the full state restored.

Resume an execution
// Resume from the last checkpoint
await rekall.execution.resume({
executionId: execution.id,
});
// Resume from a specific checkpoint (roll back to earlier state)
const checkpoints = await rekall.execution.listCheckpoints({
executionId: execution.id,
});
console.log('Available checkpoints:');
for (const cp of checkpoints.items) {
console.log(` Step ${cp.step}: ${cp.message} (${cp.createdAt})`);
}
// Resume from step 5 instead of step 7
await rekall.execution.resume({
executionId: execution.id,
fromCheckpoint: checkpoints.items[4].id, // Step 5 checkpoint
});

Partial Resume

You can modify the state before resuming, useful for fixing issues that caused the pause.

Resume with modified state
// Get the current state
const state = await rekall.execution.getState({
executionId: execution.id,
});
// Modify state to fix the issue
await rekall.execution.resume({
executionId: execution.id,
stateOverrides: {
...state.lastCheckpoint.state,
// Fix the issue: skip the problematic migration
skipMigrations: ['008_drop_table.sql'],
// Or provide corrected parameters
databaseUrl: 'postgresql://corrected-host:5432/mydb',
},
});

Handling Failures

Configure how failures are handled -- automatic retry, pause for human review, or abort with cleanup.

Retry Strategies

Failure handling configuration
const execution = await rekall.execution.create({
name: 'Data Processing Pipeline',
config: {
// On failure: 'retry' | 'pause' | 'abort' | 'callback'
onFailure: 'retry',
// Retry configuration
maxRetries: 3,
retryDelay: 5000, // 5 seconds between retries
retryBackoff: 'exponential', // 'fixed' | 'exponential' | 'linear'
retryBackoffMultiplier: 2, // 5s, 10s, 20s
// After all retries exhausted
onRetriesExhausted: 'pause', // Falls back to pause for human review
// Timeout per step
stepTimeout: 60000,
// Total execution timeout
timeout: 3600000,
// Cleanup on abort
onAbort: async (state) => {
// Roll back partial changes
await rollbackMigrations(state.completedMigrations);
},
},
});
// Register a failure callback (for onFailure: 'callback')
await rekall.execution.onFailure({
executionId: execution.id,
callback: async (error, state) => {
if (error.code === 'TIMEOUT') {
// Extend timeout and retry
return { action: 'retry', newTimeout: state.timeout * 2 };
}
if (error.code === 'AUTH_EXPIRED') {
// Refresh credentials and retry
await refreshCredentials();
return { action: 'retry' };
}
// Unknown error - pause for human review
return { action: 'pause', reason: error.message };
},
});

Idempotent steps

Design your steps to be idempotent (safe to retry). If a step partially completed before failing, retrying it should not cause duplicate work or data corruption. Use the checkpoint state to track what has already been processed.

Execution Lifecycle

Lifecycle States

StateDescriptionTransitions To
createdExecution created but not startedrunning, cancelled
runningActively processing stepspaused, completed, failed
pausedStopped, waiting for resume. Tagged with a pause reason: human_approval, human_input, external_dependency, scheduled_wait, rate_limit, or checkpoint.running, cancelled
completedAll steps finished successfully(terminal)
failedFailed after all retries exhaustedrunning (via resume)
cancelledExplicitly cancelled by user(terminal)
Monitor execution lifecycle
// Subscribe to execution state changes
const unsubscribe = rekall.execution.onStateChange({
executionId: execution.id,
callback: (event) => {
console.log(`[${event.timestamp}] ${event.previousStatus} -> ${event.status}`);
if (event.status === 'paused') {
console.log(` Reason: ${event.reason}`);
// Notify the user via Slack, email, etc.
notifyUser(`Task paused: ${event.reason}`);
}
if (event.checkpoint) {
console.log(` Checkpoint: step ${event.checkpoint.step}`);
}
},
});
// Cancel an execution
await rekall.execution.cancel({
executionId: execution.id,
reason: 'No longer needed',
cleanup: true, // Run cleanup handlers
});

Next Steps

Rekall
rekall