Telemetry.run is an observability platform for backend engineers. I redesigned the incident-triage flow so on-call engineers could go from pager to root cause in 71% less time, measured through synthetic incident drills and live MTTI.
Incident triage is a hostile design context. The user is half awake, often on a phone, scared they are going to make it worse, and looking at a tool they only open when something is broken. Familiarity has decayed since their last incident.
Telemetry.run had powerful primitives. Fast queries, deep traces. The incident flow asked the on-call engineer to be a power user at the moment they were least equipped to be one.
Users open the product when something is broken. There is no warm-up, no learning curve, no second chance. The first ten seconds matter more than the next ten minutes.
Power users want raw query languages and shortcuts. Sleep-deprived users want a guided path. The product had to be both.
Half of all pages are acknowledged from a phone. The triage flow had to work on a six-inch screen, in dark mode, with one thumb.
Telemetry's incident flow dropped the on-call engineer onto a dashboard. Nine widgets, no obvious next step. The implicit message was "browse around, you will find it". A fine message for a workday, a hostile one at 3am.
Engineers compensated by writing runbooks, but the runbooks lived in Notion, the alerts in PagerDuty, the dashboards in Telemetry. The incident response was a tab-switching exercise where each tab assumed you remembered the last one.
The incident landing page was a dashboard. Useful for monitoring, useless for triage. No clear "what now".
Linking from PagerDuty into Telemetry did not carry the alert context. Engineers retyped queries from screenshots.
Default views assumed the user knew their service topology. New on-call rotators had no on-ramp.
"I open Telemetry and immediately Cmd-K my way to the runbook in another tab. The product is the runbook."
I owned the incident-triage workstream end to end, paired with one designer focused on dashboards. We worked closely with the on-call rotation team, backend engineers who carried the pager, and ran biweekly synthetic incident drills with them.
My remit covered the full incident loop. Alert handoff, triage landing, root-cause exploration, and post-incident artefacts.
Three changes turned the incident landing from a dashboard into a guided path, without removing anything power users relied on.
The triage landing replaces the dashboard for any user arriving from a page. It opens with the alert's context loaded, the relevant service in focus, and three suggested next moves. See the trace, see recent deploys, see comparable past incidents.
It is deliberately opinionated. Every suggestion is a one-tap detour with a "back to triage" gesture. Power users can drop to the raw query in two keystrokes. Tired users can follow the path.
The triage view suggests next steps based on heuristics. Recent deploys, similar past incidents, error spike sources. It never auto-jumps. The engineer chooses; the system narrows the option space.
Every suggested action opens a focused view (trace, deploy log, etc.) with a persistent "back to triage" gesture. Engineers stop losing their place when they explore.
Designed the triage flow mobile-first, dark-mode-first. Half of pages are acknowledged from a phone. Designing for the desk first meant designing for the wrong moment.
service=checkout-apiwindow=02:47–02:49region=us-east-1| Before | → | After |
|---|---|---|
| Dashboard as incident landing | → | Triage landing with guided path |
| Alerts drop user without context | → | Context handoff carries the query |
| Mobile usable but unloved | → | Mobile-first triage flow |
| Seven tab switches per incident | → | Two tab switches per incident |
| Runbooks lived in Notion | → | Runbook patterns embedded in product |
| Power-user defaults everywhere | → | Guided path with power-user shortcuts |
Biweekly drills with real on-call engineers were the highest-signal research method I have used. Cheaper than ride-alongs, more realistic than usability tests.
Designing for the phone forced ruthless prioritisation that improved the desktop UX too. The constraint was a gift.
In a domain where being wrong is expensive, narrowing the option space without choosing for the user was the right line. It is now a team principle.
The context handoff needed cooperation from the alert source. Earlier conversations would have unblocked deeper integration sooner.
Time to first insight was easy to game. A sharper, externally verifiable metric would have made the result harder to dispute.
I undersold the redesign to power users initially. They perceived it as training wheels. Better launch comms would have closed that gap.
"The product's job is to make the worst moment of your week 71% shorter. Everything else is decoration."