Background
Archive
Journal Entry

Automation Error Handling: When Workflows Break

Documented
Capacity
6 MIN READ
Domain
AI & Automation

Your automation works perfectly in testing. Then it hits production and encounters a malformed email, an API timeout, or a field that is blank when it should not be. What happens next determines whether automation saves you time or creates emergencies.

Why Automations Fail: The Common Causes

Understanding failure modes helps you design for resilience before problems occur.

API changes and outages. The services your automation connects to update their APIs, change rate limits, or experience downtime. Stripe pushes a breaking change. HubSpot has a 20-minute outage. These events are not rare. Major SaaS APIs experience unplanned outages several times per year, and their change velocity means your integration assumptions may expire without notice.

Unexpected data formats. Your automation was built assuming invoices arrive as PDFs with consistent layouts. Then a supplier sends a Word document. Or a CSV. Or a screenshot. The input is valid from a human perspective but the automation has no rule for it and fails silently.

Authentication token expiry. OAuth tokens expire. API keys get rotated. When authentication silently fails, automations stop processing without obvious error messages. This is one of the most common causes of automation “going dark” for hours or days before anyone notices.

Rate limits. Send 500 API calls in a minute when the limit is 100 and your automation backs up, fails, or triggers abuse detection. Rate limits vary by plan tier, often change without announcement, and are rarely well-documented for edge cases.

Cascading failures. Automation A writes to a database. Automation B reads from that database. If Automation A fails quietly, Automation B processes stale or absent data and produces wrong outputs. Nobody notices until a downstream consequence surfaces.

Retry Logic: When and How

Not all failures warrant immediate retrying. Some do. The distinction matters.

When retrying is safe: Transient failures. API timeouts, momentary network issues, temporary rate limit hits. These typically resolve within seconds or minutes. Retry with a delay.

When retrying is dangerous: Operations that are not idempotent. If your automation sends an email or creates a record, retrying a failed action may create duplicates. A payment trigger is the obvious example: retrying a failed charge without checking whether the first attempt actually succeeded can double-charge your customer.

Exponential backoff is the standard approach for safe retries: wait 1 second, then 2 seconds, then 4 seconds, then 8, up to a maximum. This prevents hammering a struggling API and avoids contributing to rate limit issues.

Maximum retry limits prevent infinite loops. Define a ceiling: if after 5 retries the action has not succeeded, declare failure and route to the fallback path.

Fallback Strategies

What happens when an automation cannot complete its task? You need a defined answer for every automation you run.

Human-in-the-loop queues are the most robust fallback for business-critical processes. When the automation cannot proceed, it drops the item into a review queue with relevant context: what it tried to do, what error occurred, what information it has gathered so far. A human reviews and either resolves manually or provides the missing input that allows the automation to proceed.

Graceful degradation means the automation completes a partial version of its job rather than failing entirely. An invoice processing automation that cannot extract the line items can still extract vendor and total amount, flag the line items as requiring manual entry, and route the partial record to the accounts team with a clear note.

Default values and notifications work for lower-stakes processes. If a field cannot be populated automatically, apply a sensible default and flag it for review rather than blocking the whole process.

Dead letter queues are a technical pattern where failed items are stored in a separate queue for later investigation rather than simply dropped. Nothing is lost. The operations team reviews the dead letter queue periodically and resolves items individually.

Alerting That Does Not Cry Wolf

Bad alerting is almost as damaging as no alerting. If your team receives 50 alerts per day, they will stop reading them. The 51st alert about the failure that matters gets ignored along with the noise.

Categorise errors before alerting:

  • Critical: Automation has stopped completely, data is not being processed, business impact is immediate. Alert immediately, escalate to the right person.
  • Warning: Error rate is elevated but processing continues, or an unusual pattern has been detected. Alert during business hours.
  • Info: Individual failures that were handled by retry or fallback, normal-range exceptions. Log but do not alert. Include in daily digest if needed.

Set thresholds, not triggers. Do not alert on every single failure. Alert when error rate exceeds 5% in a 15-minute window. Alert when an automation has not processed any items for 2 hours during business hours. Alert when a queue depth exceeds a defined threshold. Threshold-based alerting reflects actual operational impact rather than normal noise.

Route alerts to the right person. Not every automation error warrants waking someone up. Define: who owns this automation, what level of error requires them immediately, what can wait for the morning review.

Building Error Handling From Day One

Every new automation should answer these questions before it goes live:

What are the realistic failure modes? For each integration point, for each data input, for each external dependency: what happens when it fails?

What is the blast radius? If this automation fails silently for 24 hours, what is the business impact? How many records are affected? What downstream processes depend on it?

Who gets notified? Name a specific person or team who owns each automation. Define their escalation path.

What is the manual fallback? If the automation is down, how does the work get done? Document this explicitly so it can be executed without the person who built the automation.

Is the operation idempotent? If the automation runs twice on the same input, does it produce a duplicate or does it recognise the duplication and handle it safely?

This checklist adds a few hours to every build. It prevents emergencies that cost days.

Our managed systems service covers ongoing error monitoring, alerting configuration, and incident response for automations we build and maintain. We build these resilience patterns in from the start, not as an afterthought.

Want automations built to withstand real-world conditions? Talk to our team or read our monitoring guide for what ongoing oversight looks like.

Further Reading

Say hello

Quick intro