Incidents

An incident is the story of an outage. Produl opens one automatically when a service fails, or you can post one by hand for things the prober can't see.

Anatomy of an incident

Every incident has five fields:

  • Title — short human summary, e.g. “API returning 502 in eu-west”.
  • Status — one of investigating, identified, monitoring, resolved.
  • Impact — one of none, minor, major, critical.
  • Service (optional) — which monitored service this affects. Leave blank for page-wide events (e.g. office internet, DNS provider).
  • Timeline — ordered list of updates, each with its own status and message. This is what visitors actually read.

Automatic incidents

Produl opens an incident automatically when a monitored service fails two consecutive checks in a row — i.e. “it's not a flake”. The auto-incident:

  • Is titled <service name> is down.
  • Starts with status investigating and impact major.
  • Is tagged AUTO in the management UI so you can distinguish from manual ones.
  • Posts an opening timeline entry: “Automated health checks report <service> is failing. Investigation started.”

Failure threshold

Two consecutive failures — not one. This is deliberate: a single failed check is usually a transient blip (DNS hiccup, TCP reset, the probe hit mid-deploy). Two in a row indicates a real problem. At 5-minute intervals, that's a ~10-minute detection window, which trades some alert latency for a vastly lower false-positive rate.

Why two, not three

Three would push detection latency out to 15 minutes for 5-minute intervals. In practice two is the sweet spot — it catches every real incident I've personally debugged and ignores essentially all network-layer flakes.

Recovery threshold

Two consecutive successful checks auto-resolves an autoincident. The resolution posts a closing timeline entry (“<service> is back online”) and flips the public page's banner back to “All systems operational” — assuming no other services are still down.

Manual incidents don't auto-resolve

If you post an incident yourself, Produl will never close it automatically — even if the associated service is healthy. You have to mark it resolved when you're ready. This prevents “we rebooted and it looked healthy for 10 seconds” from prematurely closing an incident while your team is still investigating.

Manual incidents

Not every outage is observable by HTTP probe. Manual incidents cover:

  • Planned maintenance — announcing a scheduled window before you start.
  • Third-party outages — Stripe/Slack/GitHub being down, degrading your service.
  • Partial degradation — features that respond slowly but still return 200.
  • Regional issues — an AZ outage your probe missed because the probe hit a different region.
  1. 1

    Open the manage page

    From the sidebar, click Status → your page.

  2. 2

    Click Post incident

    Top-right of the Incidents section.

  3. 3

    Write a clear title

    Prefer “API returning 502 in eu-west” over “API down”. Specificity signals that you're actually on it.

  4. 4

    Choose service (optional) and impact

    “Minor” is a good default for slow-but-working; reserve “Critical” for “the product is inaccessible”.

  5. 5

    Write the opening update

    One or two sentences. What's the user-visible symptom? What are you doing about it?

  6. 6

    Click Post incident

    It appears immediately on the public page, banner updates if needed.

Status & impact taxonomy

Status describes where you are in the investigation. Impact describes how badly users are affected. They're independent — a critical-impact incident can be in any status.

StatusMeaningTypical duration
investigatingWe see the problem but don't know the cause yet.5–30 min
identified Cause is known, fix is being prepared/rolled out. 10–60 min
monitoring Fix is deployed, we're watching metrics to confirm. 10–20 min
resolved Incident is over. Public page returns to green.
ImpactBanner colourWhen to use
none Green Informational only — e.g. a maintenance announcement before anything breaks.
minor Yellow Slight slowdown or a rarely-used feature is broken. Most users notice nothing.
major Orange A core feature is broken for a significant subset of users. Default for auto-incidents.
criticalRed The product is effectively unusable for most users.

Timeline updates

Every status change you make posts a new timeline entry. You can also post a plain text update without changing status — useful when you have progress to share but nothing's changed at the investigation level (“Identified the cause — a long-running migration blocking writes. Working on a rollback.”).

The public page shows timeline entries newest-first inside each incident card, with a timestamp on each line. Keep each update under ~200 chars and it'll stay readable.

Incident-writing playbook

A status page you never update is worse than no status page. Here's the minimum cadence that actually reassures users:

  1. Open fast. Post the incident within 60 seconds of realising something's wrong, even if you have nothing else to say.
  2. First update: ≤ 5 minutes. “We're investigating reports of X. We'll update again within 15 minutes.”
  3. Regular cadence. Update every 15–30 minutes during an active incident, even if the message is “still investigating”. Silence feels like abandonment.
  4. Name the symptom, not the tech. “Dashboard loads slowly” beats “Redis latency elevated”.
  5. Resolve clearly. Last timeline entry should say what was wrong, what you fixed, and whether there's any remediation users need to do on their end.

Post-mortems

After a real incident, link the post-mortem in your final timeline update once it's written. “Full root-cause analysis here: [link]” — it's the closest thing to “I'm sorry” that goes over well with technical users.