Incidents
An incident is the story of an outage. Produl opens one automatically when a service fails, or you can post one by hand for things the prober can't see.
Anatomy of an incident
Every incident has five fields:
- Title — short human summary, e.g. “API returning 502 in eu-west”.
- Status — one of
investigating,identified,monitoring,resolved. - Impact — one of
none,minor,major,critical. - Service (optional) — which monitored service this affects. Leave blank for page-wide events (e.g. office internet, DNS provider).
- Timeline — ordered list of updates, each with its own status and message. This is what visitors actually read.
Automatic incidents
Produl opens an incident automatically when a monitored service fails two consecutive checks in a row — i.e. “it's not a flake”. The auto-incident:
- Is titled
<service name> is down. - Starts with status
investigatingand impactmajor. - Is tagged
AUTOin the management UI so you can distinguish from manual ones. - Posts an opening timeline entry: “Automated health checks report <service> is failing. Investigation started.”
Failure threshold
Two consecutive failures — not one. This is deliberate: a single failed check is usually a transient blip (DNS hiccup, TCP reset, the probe hit mid-deploy). Two in a row indicates a real problem. At 5-minute intervals, that's a ~10-minute detection window, which trades some alert latency for a vastly lower false-positive rate.
Why two, not three
Recovery threshold
Two consecutive successful checks auto-resolves an autoincident. The resolution posts a closing timeline entry (“<service> is back online”) and flips the public page's banner back to “All systems operational” — assuming no other services are still down.
Manual incidents don't auto-resolve
Manual incidents
Not every outage is observable by HTTP probe. Manual incidents cover:
- Planned maintenance — announcing a scheduled window before you start.
- Third-party outages — Stripe/Slack/GitHub being down, degrading your service.
- Partial degradation — features that respond slowly but still return 200.
- Regional issues — an AZ outage your probe missed because the probe hit a different region.
- 1
Open the manage page
From the sidebar, click Status → your page.
- 2
Click Post incident
Top-right of the Incidents section.
- 3
Write a clear title
Prefer “API returning 502 in eu-west” over “API down”. Specificity signals that you're actually on it.
- 4
Choose service (optional) and impact
“Minor” is a good default for slow-but-working; reserve “Critical” for “the product is inaccessible”.
- 5
Write the opening update
One or two sentences. What's the user-visible symptom? What are you doing about it?
- 6
Click Post incident
It appears immediately on the public page, banner updates if needed.
Status & impact taxonomy
Status describes where you are in the investigation. Impact describes how badly users are affected. They're independent — a critical-impact incident can be in any status.
| Status | Meaning | Typical duration |
|---|---|---|
investigating | We see the problem but don't know the cause yet. | 5–30 min |
identified | Cause is known, fix is being prepared/rolled out. | 10–60 min |
monitoring | Fix is deployed, we're watching metrics to confirm. | 10–20 min |
resolved | Incident is over. Public page returns to green. | — |
| Impact | Banner colour | When to use |
|---|---|---|
none | Green | Informational only — e.g. a maintenance announcement before anything breaks. |
minor | Yellow | Slight slowdown or a rarely-used feature is broken. Most users notice nothing. |
major | Orange | A core feature is broken for a significant subset of users. Default for auto-incidents. |
critical | Red | The product is effectively unusable for most users. |
Timeline updates
Every status change you make posts a new timeline entry. You can also post a plain text update without changing status — useful when you have progress to share but nothing's changed at the investigation level (“Identified the cause — a long-running migration blocking writes. Working on a rollback.”).
The public page shows timeline entries newest-first inside each incident card, with a timestamp on each line. Keep each update under ~200 chars and it'll stay readable.
Incident-writing playbook
A status page you never update is worse than no status page. Here's the minimum cadence that actually reassures users:
- Open fast. Post the incident within 60 seconds of realising something's wrong, even if you have nothing else to say.
- First update: ≤ 5 minutes. “We're investigating reports of X. We'll update again within 15 minutes.”
- Regular cadence. Update every 15–30 minutes during an active incident, even if the message is “still investigating”. Silence feels like abandonment.
- Name the symptom, not the tech. “Dashboard loads slowly” beats “Redis latency elevated”.
- Resolve clearly. Last timeline entry should say what was wrong, what you fixed, and whether there's any remediation users need to do on their end.
Post-mortems