Building Resilient n8n Workflows: Error Handling and Recovery Patterns
Production workflows fail. APIs timeout, services go down, data formats change unexpectedly. The difference between a hobbyist workflow and a production-grade automation is how it handles these failures.
This guide covers essential patterns for building resilient n8n workflows.
Understanding n8n Error Handling
n8n provides several mechanisms for handling errors:
- Error Trigger Node - Catches workflow-level errors
- Try/Catch in Code Nodes - Handles errors within code
- Retry Mechanism - Automatic retries for failed nodes
- Error Workflows - Dedicated workflows for error processing
Pattern 1: Global Error Handler
Error Trigger Workflow
{
"name": "Global Error Handler",
"nodes": [
{
"name": "Error Trigger",
"type": "n8n-nodes-base.errorTrigger",
"position": [250, 300]
},
{
"name": "Parse Error",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "const error = $input.first().json;\n\n// Extract error details\nconst errorInfo = {\n workflow_name: error.workflow.name,\n workflow_id: error.workflow.id,\n execution_id: error.execution.id,\n error_message: error.execution.error?.message || 'Unknown error',\n error_stack: error.execution.error?.stack,\n failed_node: error.execution.lastNodeExecuted,\n timestamp: new Date().toISOString(),\n mode: error.execution.mode,\n retry_of: error.execution.retryOf\n};\n\n// Classify error severity\nconst criticalPatterns = ['database', 'authentication', 'rate limit', 'quota'];\nconst isCritical = criticalPatterns.some(p => \n errorInfo.error_message.toLowerCase().includes(p)\n);\n\nerrorInfo.severity = isCritical ? 'critical' : 'warning';\nerrorInfo.should_alert = isCritical || errorInfo.mode === 'production';\n\nreturn [{ json: errorInfo }];"
}
},
{
"name": "Route by Severity",
"type": "n8n-nodes-base.switch",
"parameters": {
"dataType": "string",
"value1": "={{ $json.severity }}",
"rules": {
"rules": [
{ "value2": "critical" },
{ "value2": "warning" }
]
}
}
},
{
"name": "Alert Critical",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#critical-alerts",
"text": "🚨 CRITICAL: {{ $json.workflow_name }} failed\n\nError: {{ $json.error_message }}\nNode: {{ $json.failed_node }}\nExecution: {{ $json.execution_id }}"
}
},
{
"name": "Log Warning",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "workflow_errors",
"columns": "workflow_id,workflow_name,error_message,failed_node,severity,timestamp"
}
},
{
"name": "Check Retry Eligibility",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "const error = $input.first().json;\n\n// Define retry-eligible errors\nconst retryablePatterns = [\n 'timeout',\n 'ECONNREFUSED',\n 'rate limit',\n '503',\n '502',\n '429',\n 'temporarily unavailable'\n];\n\nconst isRetryable = retryablePatterns.some(p =>\n error.error_message.toLowerCase().includes(p.toLowerCase())\n);\n\n// Check retry count (max 3)\nconst retryCount = error.retry_of ? 1 : 0; // Simplified\nconst shouldRetry = isRetryable && retryCount < 3;\n\nreturn [{ json: { ...error, should_retry: shouldRetry, retry_count: retryCount } }];"
}
},
{
"name": "Retry Decision",
"type": "n8n-nodes-base.if",
"parameters": {
"conditions": {
"boolean": [{ "value1": "={{ $json.should_retry }}", "value2": true }]
}
}
},
{
"name": "Schedule Retry",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "// Exponential backoff delay\nconst retryCount = $input.first().json.retry_count;\nconst delayMs = Math.min(1000 * Math.pow(2, retryCount), 30000);\n\n// In production, you'd trigger the workflow via API after delay\n// This is a placeholder for the retry logic\nreturn [{\n json: {\n action: 'retry_scheduled',\n workflow_id: $input.first().json.workflow_id,\n execution_id: $input.first().json.execution_id,\n delay_ms: delayMs,\n retry_attempt: retryCount + 1\n }\n}];"
}
}
]
}Pattern 2: Node-Level Retry Configuration
Configuring Retries
{
"name": "API Call with Retry",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://api.example.com/data",
"method": "GET"
},
"retryOnFail": true,
"maxTries": 3,
"waitBetweenTries": 1000
}Smart Retry Logic in Code Nodes
// Code node: HTTP request with intelligent retry
const makeRequestWithRetry = async (url, options, maxRetries = 3) => {
const delays = [1000, 2000, 5000]; // Exponential backoff
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const response = await $http.request({
url,
...options,
timeout: 30000
});
return {
success: true,
data: response,
attempts: attempt + 1
};
} catch (error) {
const isRetryable = isRetryableError(error);
const isLastAttempt = attempt === maxRetries;
if (!isRetryable || isLastAttempt) {
return {
success: false,
error: error.message,
attempts: attempt + 1,
retryable: isRetryable
};
}
// Wait before retry
await sleep(delays[attempt] || delays[delays.length - 1]);
}
}
};
const isRetryableError = (error) => {
const retryableCodes = [408, 429, 500, 502, 503, 504];
const retryableMessages = ['ETIMEDOUT', 'ECONNRESET', 'ECONNREFUSED'];
if (error.response?.status && retryableCodes.includes(error.response.status)) {
return true;
}
return retryableMessages.some(msg => error.message.includes(msg));
};
const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));
// Usage
const result = await makeRequestWithRetry(
'https://api.example.com/data',
{ method: 'GET', headers: { 'Authorization': 'Bearer token' } }
);
return [{ json: result }];Pattern 3: Fallback Strategies
Primary/Fallback Service Pattern
// Code node: Service with fallback
const callWithFallback = async (primaryConfig, fallbackConfig) => {
// Try primary service
try {
const primaryResult = await $http.request({
url: primaryConfig.url,
method: primaryConfig.method,
headers: primaryConfig.headers,
timeout: 10000
});
return {
source: 'primary',
data: primaryResult,
fallback_used: false
};
} catch (primaryError) {
console.log(`Primary service failed: ${primaryError.message}`);
// Try fallback service
try {
const fallbackResult = await $http.request({
url: fallbackConfig.url,
method: fallbackConfig.method,
headers: fallbackConfig.headers,
timeout: 10000
});
return {
source: 'fallback',
data: fallbackResult,
fallback_used: true,
primary_error: primaryError.message
};
} catch (fallbackError) {
// Both services failed
throw new Error(`Both services failed. Primary: ${primaryError.message}, Fallback: ${fallbackError.message}`);
}
}
};
// Example usage
const result = await callWithFallback(
{
url: 'https://primary-api.example.com/data',
method: 'GET',
headers: { 'Authorization': 'Bearer primary-token' }
},
{
url: 'https://backup-api.example.com/data',
method: 'GET',
headers: { 'Authorization': 'Bearer backup-token' }
}
);
return [{ json: result }];Cached Data Fallback
// Code node: API call with cache fallback
const getDataWithCacheFallback = async (cacheKey, apiUrl) => {
const cache = await $getWorkflowStaticData('global');
try {
// Try fresh API call
const response = await $http.request({
url: apiUrl,
method: 'GET',
timeout: 5000
});
// Update cache on success
cache[cacheKey] = {
data: response,
timestamp: Date.now(),
ttl: 3600000 // 1 hour
};
return {
source: 'api',
data: response,
cached: false
};
} catch (apiError) {
// Check cache
const cached = cache[cacheKey];
if (cached && (Date.now() - cached.timestamp) < cached.ttl) {
return {
source: 'cache',
data: cached.data,
cached: true,
cache_age_ms: Date.now() - cached.timestamp,
api_error: apiError.message
};
}
// No valid cache, throw error
throw new Error(`API failed and no valid cache: ${apiError.message}`);
}
};
const result = await getDataWithCacheFallback(
'exchange-rates',
'https://api.exchangerate.host/latest'
);
return [{ json: result }];Pattern 4: Circuit Breaker
// Code node: Circuit breaker pattern
const circuitBreaker = {
state: 'CLOSED', // CLOSED, OPEN, HALF_OPEN
failureCount: 0,
successCount: 0,
lastFailureTime: null,
failureThreshold: 5,
resetTimeout: 30000, // 30 seconds
halfOpenSuccessThreshold: 2
};
const getCircuitState = async () => {
const staticData = await $getWorkflowStaticData('global');
return staticData.circuitBreaker || { ...circuitBreaker };
};
const saveCircuitState = async (state) => {
const staticData = await $getWorkflowStaticData('global');
staticData.circuitBreaker = state;
};
const callWithCircuitBreaker = async (requestFn) => {
const state = await getCircuitState();
// Check if circuit should be reset from OPEN to HALF_OPEN
if (state.state === 'OPEN') {
if (Date.now() - state.lastFailureTime > state.resetTimeout) {
state.state = 'HALF_OPEN';
state.successCount = 0;
} else {
// Circuit is open, fail fast
return {
success: false,
error: 'Circuit breaker is OPEN',
circuit_state: state.state,
retry_after_ms: state.resetTimeout - (Date.now() - state.lastFailureTime)
};
}
}
try {
const result = await requestFn();
// Success - update circuit state
if (state.state === 'HALF_OPEN') {
state.successCount++;
if (state.successCount >= state.halfOpenSuccessThreshold) {
state.state = 'CLOSED';
state.failureCount = 0;
}
} else {
state.failureCount = 0;
}
await saveCircuitState(state);
return {
success: true,
data: result,
circuit_state: state.state
};
} catch (error) {
// Failure - update circuit state
state.failureCount++;
state.lastFailureTime = Date.now();
if (state.state === 'HALF_OPEN') {
state.state = 'OPEN';
} else if (state.failureCount >= state.failureThreshold) {
state.state = 'OPEN';
}
await saveCircuitState(state);
return {
success: false,
error: error.message,
circuit_state: state.state,
failure_count: state.failureCount
};
}
};
// Usage
const result = await callWithCircuitBreaker(async () => {
return await $http.request({
url: 'https://api.example.com/data',
method: 'GET',
timeout: 5000
});
});
return [{ json: result }];Pattern 5: Dead Letter Queue
Failed Item Processing
{
"name": "Process with DLQ",
"nodes": [
{
"name": "Get Items to Process",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "executeQuery",
"query": "SELECT * FROM items_queue WHERE status = 'pending' LIMIT 100"
}
},
{
"name": "Split Items",
"type": "n8n-nodes-base.splitInBatches",
"parameters": {
"batchSize": 1
}
},
{
"name": "Process Item",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "const item = $input.first().json;\n\ntry {\n // Process the item\n const result = await processItem(item);\n \n return [{\n json: {\n ...item,\n status: 'success',\n result: result\n }\n }];\n} catch (error) {\n return [{\n json: {\n ...item,\n status: 'failed',\n error: error.message,\n failed_at: new Date().toISOString(),\n retry_count: (item.retry_count || 0) + 1\n }\n }];\n}"
},
"continueOnFail": true
},
{
"name": "Route by Status",
"type": "n8n-nodes-base.switch",
"parameters": {
"dataType": "string",
"value1": "={{ $json.status }}",
"rules": {
"rules": [
{ "value2": "success" },
{ "value2": "failed" }
]
}
}
},
{
"name": "Mark Success",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "update",
"table": "items_queue",
"updateKey": "id",
"columns": "status,processed_at"
}
},
{
"name": "Check Retry Limit",
"type": "n8n-nodes-base.if",
"parameters": {
"conditions": {
"number": [
{
"value1": "={{ $json.retry_count }}",
"operation": "smaller",
"value2": 3
}
]
}
}
},
{
"name": "Schedule Retry",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "update",
"table": "items_queue",
"updateKey": "id",
"columns": "status,retry_count,next_retry_at",
"additionalFields": {
"status": "pending",
"next_retry_at": "={{ DateTime.now().plus({ minutes: $json.retry_count * 5 }).toISO() }}"
}
}
},
{
"name": "Move to DLQ",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "executeQuery",
"query": "INSERT INTO dead_letter_queue (original_id, data, error, retry_count, moved_at) VALUES ('{{ $json.id }}', '{{ JSON.stringify($json) }}', '{{ $json.error }}', {{ $json.retry_count }}, NOW()); UPDATE items_queue SET status = 'dead_lettered' WHERE id = '{{ $json.id }}';"
}
}
]
}Pattern 6: Graceful Degradation
// Code node: Graceful degradation with feature flags
const executeWithDegradation = async () => {
const features = {
enrichment: { enabled: true, timeout: 5000 },
notification: { enabled: true, timeout: 3000 },
analytics: { enabled: true, timeout: 2000 }
};
const input = $input.first().json;
const result = { base_data: input, enrichments: {} };
// Core processing (must succeed)
result.core = await processCoreLogic(input);
// Optional enrichment (graceful failure)
if (features.enrichment.enabled) {
try {
result.enrichments.extra_data = await Promise.race([
fetchEnrichmentData(input),
timeout(features.enrichment.timeout)
]);
} catch (error) {
result.enrichments.extra_data = null;
result.degraded = result.degraded || [];
result.degraded.push({ feature: 'enrichment', error: error.message });
}
}
// Optional notification (graceful failure)
if (features.notification.enabled) {
try {
await Promise.race([
sendNotification(result),
timeout(features.notification.timeout)
]);
result.notification_sent = true;
} catch (error) {
result.notification_sent = false;
result.degraded = result.degraded || [];
result.degraded.push({ feature: 'notification', error: error.message });
}
}
// Optional analytics (fire and forget)
if (features.analytics.enabled) {
trackEvent(result).catch(() => {}); // Ignore failures
}
return [{ json: result }];
};
const timeout = (ms) => new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
);
return await executeWithDegradation();Pattern 7: Health Checks and Monitoring
Workflow Health Check
{
"name": "System Health Check",
"nodes": [
{
"name": "Schedule",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"rule": {
"interval": [{ "field": "minutes", "minutesInterval": 5 }]
}
}
},
{
"name": "Check All Services",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "const services = [\n { name: 'api', url: 'https://api.example.com/health', timeout: 5000 },\n { name: 'database', url: 'https://db.example.com/health', timeout: 3000 },\n { name: 'cache', url: 'https://cache.example.com/health', timeout: 2000 }\n];\n\nconst results = [];\n\nfor (const service of services) {\n try {\n const start = Date.now();\n const response = await $http.request({\n url: service.url,\n method: 'GET',\n timeout: service.timeout\n });\n const latency = Date.now() - start;\n \n results.push({\n service: service.name,\n status: 'healthy',\n latency_ms: latency,\n response_code: response.statusCode\n });\n } catch (error) {\n results.push({\n service: service.name,\n status: 'unhealthy',\n error: error.message\n });\n }\n}\n\nconst allHealthy = results.every(r => r.status === 'healthy');\n\nreturn [{\n json: {\n timestamp: new Date().toISOString(),\n overall_status: allHealthy ? 'healthy' : 'degraded',\n services: results\n }\n}];"
}
},
{
"name": "Check Status",
"type": "n8n-nodes-base.if",
"parameters": {
"conditions": {
"string": [
{
"value1": "={{ $json.overall_status }}",
"value2": "degraded"
}
]
}
}
},
{
"name": "Alert Unhealthy",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#monitoring",
"text": "⚠️ System Health Degraded\n\n{{ $json.services.filter(s => s.status === 'unhealthy').map(s => `${s.service}: ${s.error}`).join('\\n') }}"
}
},
{
"name": "Log Health",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "health_checks",
"columns": "timestamp,overall_status,services_json"
}
}
]
}Best Practices Summary
## Error Handling Checklist
### Design Phase
- [ ] Identify failure points in workflow
- [ ] Define retry strategies per node type
- [ ] Plan fallback options
- [ ] Design dead letter queue handling
### Implementation
- [ ] Configure global error handler
- [ ] Set appropriate timeouts
- [ ] Implement circuit breakers for external services
- [ ] Add graceful degradation for optional features
- [ ] Use continueOnFail where appropriate
### Monitoring
- [ ] Log all errors with context
- [ ] Set up alerting for critical failures
- [ ] Implement health checks
- [ ] Track error rates and patterns
### Recovery
- [ ] Document recovery procedures
- [ ] Implement automated retry logic
- [ ] Create DLQ processing workflows
- [ ] Plan for manual intervention scenariosConclusion
Building resilient n8n workflows requires thinking about failure from the start. The patterns in this guide provide a toolkit for handling errors gracefully, from simple retries to sophisticated circuit breakers.
Key takeaways:
- Expect failure - Design for it from the start
- Fail fast - Use circuit breakers and timeouts
- Degrade gracefully - Keep core functionality working
- Monitor everything - You can't fix what you can't see
- Automate recovery - Reduce manual intervention needs
At DeviDevs, we help organizations build production-grade n8n automations with enterprise-level resilience. Contact us to discuss your automation needs.