Daily Workflow Health Audit
Check scheduled jobs and workflows for repeated failures, stale runs, and silent drift.
{{SLACK_CHANNEL_ID}}{{TIMEZONE}}Template Content
Daily Workflow Health Audit
Check scheduled jobs and workflows for repeated failures, stale runs, and silent drift.
Schedule
{
"cron": "0 8 * * *",
"timezone": "UTC",
"agentRole": "lead",
"enabled": true
}
Scheduled Task
This is the full task prompt the schedule runs on each fire. Adapt the channel IDs, mentions, app URLs, and escalation rules to your environment before enabling. As you learn from real incidents, expand this prompt with your own local failure modes and recovery notes.
Task Type: Daily Workflow + Schedule Health Audit
You are Lead. Run this audit and post a single Slack digest. Cadence: daily at 08:00 UTC. Purpose: surface any workflow run or scheduled-task fire from the last 24h that hard-failed or silently failed (completed but produced nothing useful) so the team catches broken cron/workflow plumbing before it ages out.
Phase 1 — Query the six failure modes
Use db-query for each.
1A. Hard-failed workflow runs (last 24h)
SELECT wr.id, w.name AS workflowName, wr.status,
wr.finishedAt, wr.lastUpdatedAt,
SUBSTR(COALESCE(wr.error, ''), 1, 220) AS errSnippet
FROM workflow_runs wr
JOIN workflows w ON w.id = wr.workflowId
WHERE wr.status = 'failed'
AND datetime(COALESCE(wr.finishedAt, wr.lastUpdatedAt, wr.startedAt)) > datetime('now', '-24 hours')
ORDER BY wr.lastUpdatedAt DESC;
1B. Hard-failed schedule-spawned tasks (last 24h)
SELECT t.id, s.name AS scheduleName, t.status,
SUBSTR(COALESCE(t.failureReason, ''), 1, 220) AS reasonSnippet,
SUBSTR(COALESCE(t.output, ''), 1, 220) AS outSnippet,
t.lastUpdatedAt
FROM agent_tasks t
LEFT JOIN scheduled_tasks s ON s.id = t.scheduleId
WHERE t.status = 'failed'
AND t.scheduleId IS NOT NULL
AND datetime(t.lastUpdatedAt) > datetime('now', '-24 hours')
ORDER BY t.lastUpdatedAt DESC;
1C. Halted >24h workflow runs (silent stuck)
SELECT wr.id, w.name AS workflowName, wr.status, wr.lastUpdatedAt
FROM workflow_runs wr
JOIN workflows w ON w.id = wr.workflowId
WHERE wr.status IN ('running', 'waiting')
AND datetime(wr.lastUpdatedAt) < datetime('now', '-24 hours')
ORDER BY wr.lastUpdatedAt ASC;
1D. Silent: schedule-spawned task completed with empty/sentinel output
SELECT t.id, s.name AS scheduleName, t.status,
SUBSTR(COALESCE(t.output, ''), 1, 220) AS outSnippet,
LENGTH(TRIM(COALESCE(t.output, ''))) AS outLen,
t.lastUpdatedAt
FROM agent_tasks t
LEFT JOIN scheduled_tasks s ON s.id = t.scheduleId
WHERE t.status = 'completed'
AND t.scheduleId IS NOT NULL
AND datetime(t.lastUpdatedAt) > datetime('now', '-24 hours')
AND (
t.output IS NULL
OR TRIM(t.output) = ''
OR TRIM(t.output) = '⚡ Running shell command'
OR LENGTH(TRIM(t.output)) < 10
)
ORDER BY t.lastUpdatedAt DESC;
1E. Cron didn't fire (nextRunAt in the past)
SELECT s.id, s.name, s.cronExpression, s.lastRunAt, s.nextRunAt, s.consecutiveErrors,
SUBSTR(COALESCE(s.lastErrorMessage, ''), 1, 220) AS lastErrSnippet
FROM scheduled_tasks s
WHERE s.enabled = 1
AND s.scheduleType = 'recurring'
AND s.nextRunAt IS NOT NULL
AND datetime(s.nextRunAt) < datetime('now', '-1 hour')
ORDER BY s.nextRunAt ASC;
1F. Schedules with consecutive errors (defensive)
SELECT s.id, s.name, s.cronExpression, s.consecutiveErrors, s.lastErrorAt,
SUBSTR(COALESCE(s.lastErrorMessage, ''), 1, 220) AS lastErrSnippet
FROM scheduled_tasks s
WHERE s.enabled = 1
AND s.consecutiveErrors >= 3
ORDER BY s.consecutiveErrors DESC;
1G. Totals (for the "all clear" denominator)
SELECT
(SELECT COUNT(*) FROM workflow_runs WHERE datetime(lastUpdatedAt) > datetime('now','-24 hours')) AS workflowRuns24h,
(SELECT COUNT(*) FROM agent_tasks WHERE scheduleId IS NOT NULL AND datetime(lastUpdatedAt) > datetime('now','-24 hours')) AS scheduledFires24h;
Phase 2 — Render the digest
Each bullet must include a clickable URL.
- Workflow run URL:
https://app.agent-swarm.dev/workflow-runs/<id>→ Slack format:<https://app.agent-swarm.dev/workflow-runs/<id>|workflow:<workflowName>> - Task URL:
https://app.agent-swarm.dev/tasks/<id>→ Slack format:<https://app.agent-swarm.dev/tasks/<id>|schedule:<scheduleName>>
Truncate error/output snippets to 200 chars + … if longer. Replace newlines with ⏎.
Template
If TOTAL issues across 1A–1F is zero:
:white_check_mark: *Daily Workflow + Schedule Health Audit* — <YYYY-MM-DD>
<OWNER_OR_TEAM_MENTION> All clear — <workflowRuns24h> workflow runs + <scheduledFires24h> scheduled fires in the last 24h, all produced expected output.
Otherwise:
:stethoscope: *Daily Workflow + Schedule Health Audit* — <YYYY-MM-DD>
<OWNER_OR_TEAM_MENTION> Audit window: last 24h. Totals: <workflowRuns24h> workflow runs · <scheduledFires24h> scheduled fires · *<TOTAL_ISSUES> issues*
*Hard failures — workflow runs* (<N1A>)
• <url|workflow:name> — failed <relative-time>
↳ <errSnippet>
*Hard failures — scheduled tasks* (<N1B>)
• <url|schedule:name> — failed <relative-time>
↳ <reasonSnippet OR outSnippet OR "(no failureReason set)">
*Silent: halted >24h* (<N1C>)
• <url|workflow:name> — status=<status>, no progress since <timestamp>
*Silent: empty output* (<N1D>)
• <url|schedule:name> — completed, output=<"empty" | first-N-chars>
*Cron didn't fire on time* (<N1E>)
• schedule:<name> (cron `<expr>`) — nextRunAt=<past-timestamp>, lastRunAt=<timestamp or "never">
*Schedules with ≥3 consecutive errors* (<N1F>)
• schedule:<name> — consecutiveErrors=<n>, last error: <lastErrSnippet>
Omit any section whose count is 0. Cap message at 4000 chars (Slack limit) — if longer, keep top 5 per section and add …and <K> more lines.
Phase 3 — Post to Slack and complete
- Call
slack-postwith your configured channel ID andmessage=<rendered digest>. Prefer a top-level daily fire unless your team's convention is to thread recurring audit messages. - Call
store-progresswithstatus: "completed"and a one-paragraphoutputsummary:Issues found: hard-fail-wf=<N1A>, hard-fail-task=<N1B>, halted-24h=<N1C>, silent-empty=<N1D>, cron-stuck=<N1E>, consec-err=<N1F>.Totals: workflowRuns24h=<X>, scheduledFires24h=<Y>.Slack message ts: <ts from slack-post response>.
Anti-patterns
- ❌ Posting a separate Slack message per failure mode — ONE digest.
- ❌ Raw IDs without clickable URLs.
- ❌ Dumping full
error/outputcontent — truncate to 220 chars per item. - ❌ Threading the daily digest somewhere your team will not scan.
- ❌ Skipping the "all clear" message when zero issues — the heartbeat itself is the signal that the audit ran.