Dev tools ✦ −71% MTTI 5 months
(01) CASE STUDY
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
TELEMETRY.RUN · 2022
PRODUCT / UX DESIGN

Designing for
a 3am pager

Telemetry.run is an observability platform for backend engineers. I redesigned the incident-triage flow so on-call engineers could go from pager to root cause in 71% less time, measured through synthetic incident drills and live MTTI.

−71%
Time to first insight in
the incident triage flow
ROLE Product designer
SCOPE End to end
INDUSTRY Developer tools, observability
YEAR 2022
TEAM Two designers, eight engineers
app.telemetry.run/incidents/inc-4821/triage after
incidents  /  INC-4821 production AZ
SEV-2 checkout-api us-east-1 acknowledged
Elevated 5xx on checkout-api
PAGED 02:47 OWNER payments-oncall IMPACT ~3.1% of checkouts
Start triage → Drop to query ⌘K
Error rate · checkout-api +1,840% vs 1h
02:3002:47 · deploy #2231now
5xx rate
6.2%
p99 latency
1.84s
throughput
412 rps
saturation
61%
Suggested next moves
Recent deploy correlates with the spike deploy #2231 shipped 02:47 · 14s before first error inspect
Trace the failing requests 38 traces sampled · DB pool exhaustion suspected open
3 similar past incidents INC-4102 resolved by rollback · 8 min MTTR compare
Signals
5xx error ratio6.2% ↑
DB connection pool98 / 100
upstream: payments-gwhealthy
CDN / edgehealthy
Incident timeline
02:47deploy #2231 to checkout-api
02:47first 5xx observed
02:49alert fired · SEV-2
02:51paged payments-oncall
02:52acknowledged from mobile
(02) Context

Designing for the
worst moment of
someone's week.

Incident triage is a hostile design context. The user is half awake, often on a phone, scared they are going to make it worse, and looking at a tool they only open when something is broken. Familiarity has decayed since their last incident.

Telemetry.run had powerful primitives. Fast queries, deep traces. The incident flow asked the on-call engineer to be a power user at the moment they were least equipped to be one.

Company snapshot
STAGE Series A dev tool
USERS Backend engineers, SREs
PRIMITIVES Logs, traces, metrics
PRIMARY UX Query and dashboard
INCIDENT VOLUME Variable, bursty
(01)
Cold-start design

Users open the product when something is broken. There is no warm-up, no learning curve, no second chance. The first ten seconds matter more than the next ten minutes.

(02)
Power and accessibility

Power users want raw query languages and shortcuts. Sleep-deprived users want a guided path. The product had to be both.

(03)
Phone-first triage

Half of all pages are acknowledged from a phone. The triage flow had to work on a six-inch screen, in dark mode, with one thumb.

(03) The problem

Dashboards are
made for browsing.
Incidents need a path.

Telemetry's incident flow dropped the on-call engineer onto a dashboard. Nine widgets, no obvious next step. The implicit message was "browse around, you will find it". A fine message for a workday, a hostile one at 3am.

Engineers compensated by writing runbooks, but the runbooks lived in Notion, the alerts in PagerDuty, the dashboards in Telemetry. The incident response was a tab-switching exercise where each tab assumed you remembered the last one.

7 tabs
Average context switches
per incident, before the redesign
Failure mode 01
Dashboard as default

The incident landing page was a dashboard. Useful for monitoring, useless for triage. No clear "what now".

Failure mode 02
Lost context across tabs

Linking from PagerDuty into Telemetry did not carry the alert context. Engineers retyped queries from screenshots.

Failure mode 03
Power-user defaults

Default views assumed the user knew their service topology. New on-call rotators had no on-ramp.

"I open Telemetry and immediately Cmd-K my way to the runbook in another tab. The product is the runbook."

On-call SRE, research interview
app.telemetry.run/dashboards/prod-overview before
dashboards  /  Production overview search metrics…/ AZ
14 widgetslast 1h · auto-refresh 30s
request volumerps
error rate%
p99 latencyms
1.84s
cpucluster
memorycluster
71%
db poolconns
98
queue depthmsgs
FIG. 01 — The original incident landing: a monitoring dashboard. Fourteen equal-weight widgets, no recommended action, no “what now”.
(04) Role

Product designer.
Five months.

I owned the incident-triage workstream end to end, paired with one designer focused on dashboards. We worked closely with the on-call rotation team, backend engineers who carried the pager, and ran biweekly synthetic incident drills with them.

My remit covered the full incident loop. Alert handoff, triage landing, root-cause exploration, and post-incident artefacts.

Responsibilities
RESEARCH Pager observation, drill protocol
DESIGN Triage flow, mobile, dark UX
VALIDATION Synthetic drills, MTTI tracking
GTM Docs, office hours, field rollout
M1
On-call shadowing
M1
Drill protocol
M2
Triage flow design
M3
Mobile + dark mode
M4
Drills + iteration
M5
Rollout + MTTI tracking
(05) Initiatives

Three bets.
One direction.

Three changes turned the incident landing from a dashboard into a guided path, without removing anything power users relied on.

(01)
Triage landing
A single-page guided path replacing the default dashboard for paged engineers.
→ 71% faster MTTI
(02)
Context handoff
Alerts carry their full query context into Telemetry. No retyping.
→ 7 to 2 tab switches
(03)
Mobile triage
A first-class mobile flow for the half of pages acknowledged from a phone.
→ Phone triage viable
Deep dive
Triage landing

From dashboard
to decision.

The triage landing replaces the dashboard for any user arriving from a page. It opens with the alert's context loaded, the relevant service in focus, and three suggested next moves. See the trace, see recent deploys, see comparable past incidents.

It is deliberately opinionated. Every suggestion is a one-tap detour with a "back to triage" gesture. Power users can drop to the raw query in two keystrokes. Tired users can follow the path.

The research
6 PAGER OBSERVATIONS Real incidents, with consent
12 SYNTHETIC DRILLS Biweekly, controlled
POST-INCIDENT REVIEWS 30 over three months
MTTI TRACKING Pre and post comparison
01
Suggest, don't solve

The triage view suggests next steps based on heuristics. Recent deploys, similar past incidents, error spike sources. It never auto-jumps. The engineer chooses; the system narrows the option space.

Tradeoff. Some users wanted auto-resolve. The argument: incidents are one of a kind, and the cost of being wrong is high. Heuristics inform, they do not decide.
02
One-tap detours, one-tap return

Every suggested action opens a focused view (trace, deploy log, etc.) with a persistent "back to triage" gesture. Engineers stop losing their place when they explore.

Outcome. Tab switching dropped from seven to two in synthetic drills. The product became the runbook.
03
Mobile as a first-class surface

Designed the triage flow mobile-first, dark-mode-first. Half of pages are acknowledged from a phone. Designing for the desk first meant designing for the wrong moment.

Tradeoff. Some power features, like multi-pane query, were desktop only. We accepted the gap and called it out in docs.
app.telemetry.run/dashboards/prod-overview before
where do I even start?14 widgets · no recommended action
request volumerps
error rate%
p99 latencyms
1.84s
db poolconns
98
queue depthmsgs
Before · monitoring dashboard as incident landing7 tabs / incident
app.telemetry.run/incidents/inc-4821/triage after
SEV-2 checkout-api
Elevated 5xx on checkout-api
PAGED 02:47IMPACT ~3.1% checkouts
Start triage →
One path. Three suggested moves.
Recent deploy correlates with the spikedeploy #2231 · 14s before first errorinspect
Trace the failing requestsDB pool exhaustion suspectedopen
3 similar past incidentsINC-4102 resolved by rollback · 8 mincompare
After · guided triage landing, context preloaded2 tabs / incident
(07) Impact

The work,
measured.

−71%
Time to first insight
Triage landing
SYNTHETIC DRILL
PRE AND POST AVG
7→2
Tab switches per incident
CONTEXT HANDOFF
AND ONE-TAP DETOURS
+22 NPS
On-call engineer NPS
POST ROLLOUT
THREE-MONTH SURVEY
50%
Pages acknowledged on mobile
NOW SUPPORTED
FIRST CLASS
12
Synthetic drills run
BIWEEKLY
WITH ROTATORS
NOTE. Some metrics are directional based on data available at the time. Where possible, figures reflect controlled comparisons; otherwise stakeholder-validated estimates.
(08) Outcomes

What changed.
What it earned.

BeforeAfter
Dashboard as incident landingTriage landing with guided path
Alerts drop user without contextContext handoff carries the query
Mobile usable but unlovedMobile-first triage flow
Seven tab switches per incidentTwo tab switches per incident
Runbooks lived in NotionRunbook patterns embedded in product
Power-user defaults everywhereGuided path with power-user shortcuts
−71%
Time to first insight in the incident triage flow
(09) Reflection

What I would do
differently.

What worked
Synthetic drills as a research mode

Biweekly drills with real on-call engineers were the highest-signal research method I have used. Cheaper than ride-alongs, more realistic than usability tests.

Mobile first as a design constraint

Designing for the phone forced ruthless prioritisation that improved the desktop UX too. The constraint was a gift.

Suggest, don't solve as a stance

In a domain where being wrong is expensive, narrowing the option space without choosing for the user was the right line. It is now a team principle.

What I would do differently
Earlier alignment with PagerDuty

The context handoff needed cooperation from the alert source. Earlier conversations would have unblocked deeper integration sooner.

Tighter MTTI definition

Time to first insight was easy to game. A sharper, externally verifiable metric would have made the result harder to dispute.

Power-user docs in parallel

I undersold the redesign to power users initially. They perceived it as training wheels. Better launch comms would have closed that gap.

"The product's job is to make the worst moment of your week 71% shorter. Everything else is decoration."

Artur Lopez Zarytskyi, Telemetry.run, 2022
← Back to portfolio Get in touch