Pause / Resume / Checkpoint for Agents
Execution memory lets agents save their task state, pause for human input or external events, resume from exactly where they left off, and handle failures gracefully. Build reliable long-running agent tasks that survive interruptions.
Overview
Long-running agent tasks -- migrations, large code reviews, multi-step deployments -- need persistence. If the agent crashes or the user closes their laptop, work should not be lost. Execution memory provides:
- •Checkpoints -- Snapshots of task progress that can be restored
- •Pause/Resume -- Gracefully pause a task and resume later
- •Failure recovery -- Automatically retry or resume from the last checkpoint on failure
- •State persistence -- Arbitrary JSON state stored alongside task metadata
Creating Execution State
Start by creating an execution state entry for your task. This establishes the task identity and initial configuration.
import Rekall from '@rekall/agent-sdk';const rekall = new Rekall({apiKey: 'rk_your_key',agentId: 'agent_abc123',});// Create an execution state for a long-running taskconst execution = await rekall.execution.create({name: 'Migrate database schema',description: 'Apply 15 migration files to production database',task: {type: 'database-migration',totalSteps: 15,params: {database: 'production',migrations: ['001_users.sql', '002_posts.sql', '...'],},},config: {checkpointInterval: 'per-step',maxRetries: 3,timeout: 1800000, // 30 minutesonFailure: 'pause',},});console.log(`Execution ID: ${execution.id}`); // exec_xyz789console.log(`Status: ${execution.status}`); // 'created'
Saving Checkpoints
Checkpoints capture the current state of a task so it can be restored later. Save checkpoints at meaningful boundaries -- after each migration file, after processing a batch, etc.
// Start the executionawait rekall.execution.start({ executionId: execution.id });// Process each migration stepfor (let i = 0; i < migrations.length; i++) {const migration = migrations[i];// Apply the migrationawait applyMigration(migration);// Save checkpoint after each stepawait rekall.execution.checkpoint({executionId: execution.id,step: i + 1,state: {completedMigrations: migrations.slice(0, i + 1),lastApplied: migration,databaseVersion: i + 1,rollbackAvailable: true,},message: `Applied migration ${i + 1}/${migrations.length}: ${migration}`,});console.log(`Checkpoint saved: step ${i + 1}`);}// Mark execution as completeawait rekall.execution.complete({executionId: execution.id,result: {migrationsApplied: migrations.length,finalVersion: migrations.length,duration: Date.now() - startTime,},});
Checkpoint Strategies
// Per-step: checkpoint after every step (safest, more storage)config: { checkpointInterval: 'per-step' }// Periodic: checkpoint every N secondsconfig: { checkpointInterval: 'periodic', checkpointPeriod: 30000 } // 30s// Manual: only checkpoint when you explicitly call checkpoint()config: { checkpointInterval: 'manual' }// Custom: checkpoint based on a conditionconfig: {checkpointInterval: 'custom',shouldCheckpoint: (step, state) => {// Checkpoint every 5 steps or when processing large itemsreturn step % 5 === 0 || state.currentItemSize > 1000000;},}
Pausing Long-Running Tasks
You can pause a running task explicitly (for human approval) or it can auto-pause on failure. The current state is preserved at the last checkpoint.
// Explicitly pause for human reviewawait rekall.execution.pause({executionId: execution.id,reason: 'Awaiting approval before applying migration 008_drop_table.sql',preserveState: true,});// Check the paused stateconst state = await rekall.execution.getState({executionId: execution.id,});console.log(`Status: ${state.status}`); // 'paused'console.log(`Paused at step: ${state.currentStep}`); // 7console.log(`Reason: ${state.pauseReason}`);console.log(`Last checkpoint: ${state.lastCheckpoint.message}`);// List all paused executionsconst paused = await rekall.execution.list({status: 'paused',sort: 'pausedAt',order: 'desc',});for (const exec of paused.items) {console.log(`[${exec.name}] paused at step ${exec.currentStep}: ${exec.pauseReason}`);}
Waiting for External Processes
Agents often need to wait for external systems -- CI/CD pipelines, third-party APIs, database migrations, deployment rollouts, or approval workflows in other tools. The external_dependency pause reason is designed for this. The agent pauses, records what it is waiting for, and resumes automatically or manually once the dependency completes.
Pause reasons
Rekall supports six pause reasons: human_approval, human_input, scheduled_wait, rate_limit, external_dependency, and checkpoint. Each one tags the paused state so dashboards and automations can react appropriately.
Polling Pattern
The simplest approach: pause the agent with details about the external process, then poll until the dependency resolves. This works well for CI builds, deployment health checks, or any system that exposes a status endpoint.
// Agent kicked off a CI build and needs to wait for itconst buildId = await triggerCIBuild({ branch: 'feature/auth-v2' });// Pause with external_dependency reasonawait rekall.execution.pause({executionId: execution.id,reason: 'external_dependency',message: `Waiting for CI build ${buildId} to complete`,preserveState: true,metadata: {dependencyType: 'ci_build',externalId: buildId,externalUrl: `https://ci.example.com/builds/${buildId}`,startedAt: new Date().toISOString(),pollInterval: 30000, // suggested poll interval in ms},});// --- Later: a separate process or cron job polls and resumes ---// Check all executions waiting on external dependenciesconst waiting = await rekall.execution.list({status: 'paused',pauseReason: 'external_dependency',});for (const exec of waiting.items) {const { dependencyType, externalId } = exec.metadata;if (dependencyType === 'ci_build') {const build = await checkCIBuild(externalId);if (build.status === 'success') {await rekall.execution.resume({executionId: exec.id,stateOverrides: {buildResult: build,buildArtifacts: build.artifacts,},});} else if (build.status === 'failed') {await rekall.execution.fail({executionId: exec.id,error: {message: `CI build ${externalId} failed`,buildLogs: build.logUrl,},});}// If still running, do nothing -- will check again next poll}}
Webhook / Callback Pattern
For systems that support webhooks, you can register a callback URL that resumes the agent automatically when the external process finishes. This is more efficient than polling and provides faster resume times.
// Start a deployment and register a webhook for completionconst deployment = await startDeployment({service: 'api-gateway',version: 'v2.4.0',callbackUrl: `https://api.rekall.ai/v1/execution/${execution.id}/webhook`,});// Pause until the webhook firesawait rekall.execution.pause({executionId: execution.id,reason: 'external_dependency',message: `Waiting for deployment ${deployment.id} to roll out`,preserveState: true,metadata: {dependencyType: 'deployment',externalId: deployment.id,service: 'api-gateway',targetVersion: 'v2.4.0',resumeOn: 'webhook', // signals this will auto-resume},});// The webhook handler (server-side) receives the callback// and automatically resumes the execution with the payload://// POST /v1/execution/exec_xyz789/webhook// { "status": "healthy", "instances": 3, "version": "v2.4.0" }//// Rekall merges the webhook payload into the execution state// and transitions the execution back to 'running'.
Combine with timeouts
Set a timeout on external dependency pauses to avoid indefinite waits. If the external process does not complete within the timeout, the execution transitions to failed with a timeout error, which you can catch with the onFailure handler.
Resuming from Checkpoint
Resume a paused or failed execution from its last checkpoint. The task picks up exactly where it left off with the full state restored.
// Resume from the last checkpointawait rekall.execution.resume({executionId: execution.id,});// Resume from a specific checkpoint (roll back to earlier state)const checkpoints = await rekall.execution.listCheckpoints({executionId: execution.id,});console.log('Available checkpoints:');for (const cp of checkpoints.items) {console.log(` Step ${cp.step}: ${cp.message} (${cp.createdAt})`);}// Resume from step 5 instead of step 7await rekall.execution.resume({executionId: execution.id,fromCheckpoint: checkpoints.items[4].id, // Step 5 checkpoint});
Partial Resume
You can modify the state before resuming, useful for fixing issues that caused the pause.
// Get the current stateconst state = await rekall.execution.getState({executionId: execution.id,});// Modify state to fix the issueawait rekall.execution.resume({executionId: execution.id,stateOverrides: {...state.lastCheckpoint.state,// Fix the issue: skip the problematic migrationskipMigrations: ['008_drop_table.sql'],// Or provide corrected parametersdatabaseUrl: 'postgresql://corrected-host:5432/mydb',},});
Handling Failures
Configure how failures are handled -- automatic retry, pause for human review, or abort with cleanup.
Retry Strategies
const execution = await rekall.execution.create({name: 'Data Processing Pipeline',config: {// On failure: 'retry' | 'pause' | 'abort' | 'callback'onFailure: 'retry',// Retry configurationmaxRetries: 3,retryDelay: 5000, // 5 seconds between retriesretryBackoff: 'exponential', // 'fixed' | 'exponential' | 'linear'retryBackoffMultiplier: 2, // 5s, 10s, 20s// After all retries exhaustedonRetriesExhausted: 'pause', // Falls back to pause for human review// Timeout per stepstepTimeout: 60000,// Total execution timeouttimeout: 3600000,// Cleanup on abortonAbort: async (state) => {// Roll back partial changesawait rollbackMigrations(state.completedMigrations);},},});// Register a failure callback (for onFailure: 'callback')await rekall.execution.onFailure({executionId: execution.id,callback: async (error, state) => {if (error.code === 'TIMEOUT') {// Extend timeout and retryreturn { action: 'retry', newTimeout: state.timeout * 2 };}if (error.code === 'AUTH_EXPIRED') {// Refresh credentials and retryawait refreshCredentials();return { action: 'retry' };}// Unknown error - pause for human reviewreturn { action: 'pause', reason: error.message };},});
Idempotent steps
Design your steps to be idempotent (safe to retry). If a step partially completed before failing, retrying it should not cause duplicate work or data corruption. Use the checkpoint state to track what has already been processed.
Execution Lifecycle
Lifecycle States
| State | Description | Transitions To |
|---|---|---|
| created | Execution created but not started | running, cancelled |
| running | Actively processing steps | paused, completed, failed |
| paused | Stopped, waiting for resume. Tagged with a pause reason: human_approval, human_input, external_dependency, scheduled_wait, rate_limit, or checkpoint. | running, cancelled |
| completed | All steps finished successfully | (terminal) |
| failed | Failed after all retries exhausted | running (via resume) |
| cancelled | Explicitly cancelled by user | (terminal) |
// Subscribe to execution state changesconst unsubscribe = rekall.execution.onStateChange({executionId: execution.id,callback: (event) => {console.log(`[${event.timestamp}] ${event.previousStatus} -> ${event.status}`);if (event.status === 'paused') {console.log(` Reason: ${event.reason}`);// Notify the user via Slack, email, etc.notifyUser(`Task paused: ${event.reason}`);}if (event.checkpoint) {console.log(` Checkpoint: step ${event.checkpoint.step}`);}},});// Cancel an executionawait rekall.execution.cancel({executionId: execution.id,reason: 'No longer needed',cleanup: true, // Run cleanup handlers});
Next Steps
- •Human-in-the-Loop -- Combine execution state with human approval gates
- •Workflow Automation -- Integrate execution memory with procedural workflows
- •Execution Memory Concepts -- Deep dive into the execution memory model
- •Execution API Reference -- Full endpoint documentation
