Dev tools ✦ −71% MTTI 5 months

(01) CASE STUDY
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
TELEMETRY.RUN · 2022
PRODUCT / UX DESIGN

Designing for
a 3am pager

Telemetry.run is an observability platform for backend engineers. I redesigned the incident-triage flow so on-call engineers could go from pager to root cause in 71% less time, measured through synthetic incident drills and live MTTI.

−71%

Time to first insight in
the incident triage flow

ROLE Product designer
SCOPE End to end
INDUSTRY Developer tools, observability
YEAR 2022
TEAM Two designers, eight engineers

app.telemetry.run/incidents/inc-4821/triage after

SEV-2 checkout-api us-east-1 acknowledged

Elevated 5xx on checkout-api

PAGED 02:47 OWNER payments-oncall IMPACT ~3.1% of checkouts

Start triage → Drop to query ⌘K

Error rate · checkout-api +1,840% vs 1h

02:3002:47 · deploy #2231now

5xx rate

6.2%

p99 latency

1.84s

throughput

412 rps

saturation

61%

Suggested next moves

⤴ Recent deploy correlates with the spike deploy #2231 shipped 02:47 · 14s before first error inspect

≡ Trace the failing requests 38 traces sampled · DB pool exhaustion suspected open

↺ 3 similar past incidents INC-4102 resolved by rollback · 8 min MTTR compare

Signals

5xx error ratio6.2% ↑

DB connection pool98 / 100

upstream: payments-gwhealthy

CDN / edgehealthy

Incident timeline

02:47deploy #2231 to checkout-api

02:47first 5xx observed

02:49alert fired · SEV-2

02:51paged payments-oncall

02:52acknowledged from mobile

(02) Context

Designing for the
worst moment of
someone's week.

Incident triage is a hostile design context. The user is half awake, often on a phone, scared they are going to make it worse, and looking at a tool they only open when something is broken. Familiarity has decayed since their last incident.

Telemetry.run had powerful primitives. Fast queries, deep traces. The incident flow asked the on-call engineer to be a power user at the moment they were least equipped to be one.

Company snapshot

            STAGE Series A dev tool

            USERS Backend engineers, SREs

            PRIMITIVES Logs, traces, metrics

            PRIMARY UX Query and dashboard

            INCIDENT VOLUME Variable, bursty

(01)

Cold-start design

Users open the product when something is broken. There is no warm-up, no learning curve, no second chance. The first ten seconds matter more than the next ten minutes.

(02)

Power and accessibility

Power users want raw query languages and shortcuts. Sleep-deprived users want a guided path. The product had to be both.

(03)

Phone-first triage

Half of all pages are acknowledged from a phone. The triage flow had to work on a six-inch screen, in dark mode, with one thumb.

(03) The problem

Dashboards are
made for browsing.
Incidents need a path.

Telemetry's incident flow dropped the on-call engineer onto a dashboard. Nine widgets, no obvious next step. The implicit message was "browse around, you will find it". A fine message for a workday, a hostile one at 3am.

Engineers compensated by writing runbooks, but the runbooks lived in Notion, the alerts in PagerDuty, the dashboards in Telemetry. The incident response was a tab-switching exercise where each tab assumed you remembered the last one.

7 tabs

Average context switches
per incident, before the redesign

Failure mode 01

Dashboard as default

The incident landing page was a dashboard. Useful for monitoring, useless for triage. No clear "what now".

Failure mode 02

Lost context across tabs

Linking from PagerDuty into Telemetry did not carry the alert context. Engineers retyped queries from screenshots.

Failure mode 03

Power-user defaults

Default views assumed the user knew their service topology. New on-call rotators had no on-ramp.

"I open Telemetry and immediately Cmd-K my way to the runbook in another tab. The product is the runbook."

On-call SRE, research interview

app.telemetry.run/dashboards/prod-overview before

14 widgetslast 1h · auto-refresh 30s

request volumerps

error rate%

p99 latencyms

1.84s

cpucluster

memorycluster

71%

db poolconns

queue depthmsgs

FIG. 01 — The original incident landing: a monitoring dashboard. Fourteen equal-weight widgets, no recommended action, no “what now”.

(04) Role

Product designer.
Five months.

I owned the incident-triage workstream end to end, paired with one designer focused on dashboards. We worked closely with the on-call rotation team, backend engineers who carried the pager, and ran biweekly synthetic incident drills with them.

My remit covered the full incident loop. Alert handoff, triage landing, root-cause exploration, and post-incident artefacts.

Responsibilities

        RESEARCH Pager observation, drill protocol

        DESIGN Triage flow, mobile, dark UX

        VALIDATION Synthetic drills, MTTI tracking

        GTM Docs, office hours, field rollout

On-call shadowing

Drill protocol

Triage flow design

Mobile + dark mode

Drills + iteration

Rollout + MTTI tracking

(05) Initiatives

Three bets.
One direction.

Three changes turned the incident landing from a dashboard into a guided path, without removing anything power users relied on.

(01)

Triage landing

A single-page guided path replacing the default dashboard for paged engineers.

→ 71% faster MTTI

(02)

Context handoff

Alerts carry their full query context into Telemetry. No retyping.

→ 7 to 2 tab switches

(03)

Mobile triage

A first-class mobile flow for the half of pages acknowledged from a phone.

→ Phone triage viable

Deep dive

Triage landing

From dashboard
to decision.

The triage landing replaces the dashboard for any user arriving from a page. It opens with the alert's context loaded, the relevant service in focus, and three suggested next moves. See the trace, see recent deploys, see comparable past incidents.

It is deliberately opinionated. Every suggestion is a one-tap detour with a "back to triage" gesture. Power users can drop to the raw query in two keystrokes. Tired users can follow the path.

The research

        6 PAGER OBSERVATIONS Real incidents, with consent

        12 SYNTHETIC DRILLS Biweekly, controlled

        POST-INCIDENT REVIEWS 30 over three months

        MTTI TRACKING Pre and post comparison

Suggest, don't solve

The triage view suggests next steps based on heuristics. Recent deploys, similar past incidents, error spike sources. It never auto-jumps. The engineer chooses; the system narrows the option space.

Tradeoff. Some users wanted auto-resolve. The argument: incidents are one of a kind, and the cost of being wrong is high. Heuristics inform, they do not decide.

One-tap detours, one-tap return

Every suggested action opens a focused view (trace, deploy log, etc.) with a persistent "back to triage" gesture. Engineers stop losing their place when they explore.

Outcome. Tab switching dropped from seven to two in synthetic drills. The product became the runbook.

Mobile as a first-class surface

Designed the triage flow mobile-first, dark-mode-first. Half of pages are acknowledged from a phone. Designing for the desk first meant designing for the wrong moment.

Tradeoff. Some power features, like multi-pane query, were desktop only. We accepted the gap and called it out in docs.

app.telemetry.run/dashboards/prod-overview before

where do I even start?14 widgets · no recommended action

request volumerps

error rate%

p99 latencyms

1.84s

db poolconns

queue depthmsgs

Before · monitoring dashboard as incident landing7 tabs / incident

app.telemetry.run/incidents/inc-4821/triage after

SEV-2 checkout-api

Elevated 5xx on checkout-api

PAGED 02:47IMPACT ~3.1% checkouts

Start triage →

One path. Three suggested moves.

⤴Recent deploy correlates with the spikedeploy #2231 · 14s before first errorinspect

≡Trace the failing requestsDB pool exhaustion suspectedopen

↺3 similar past incidentsINC-4102 resolved by rollback · 8 mincompare

After · guided triage landing, context preloaded2 tabs / incident

The flow, screen by screen

04 surfaces · dark, mobile-first

telemetry.run · ios mobile

2:525G · ●●●

SEV-2

Elevated 5xx

checkout-api · PAGED 02:47

⤴Recent deploy#2231 · 14s before

≡Trace requests38 sampled

Acknowledge · take triage

Mobile triage · the same path, one thumb

alert → telemetry.run handoff

pagerduty · alert

checkout-api 5xx > 5%
service=checkout-api
window=02:47–02:49
region=us-east-1

→carries
context

telemetry · preloaded

Triage opens with the right service, time window and query already loaded. Zero retyping.

Context handoff · the alert carries the query

app.telemetry.run/incidents/inc-4821/trace/7f2a detour

Trace 7f2a · POST /checkout 1,842ms · errored← back to triage

checkout-api POST /checkout1842ms

auth verify41ms

cart-svc load63ms

db acquire conn1.41s

db SELECT order38ms

payments-gw charge29ms

One-tap detour · the waterfall makes the DB-pool wait obvious, with a return gesture always presentdb · acquire conn 1.41s

(07) Impact

The work,
measured.

−71%

Time to first insight
Triage landing

SYNTHETIC DRILL
PRE AND POST AVG

7→2

Tab switches per incident

CONTEXT HANDOFF
AND ONE-TAP DETOURS

+22 NPS

On-call engineer NPS

POST ROLLOUT
THREE-MONTH SURVEY

50%

Pages acknowledged on mobile

NOW SUPPORTED
FIRST CLASS

Synthetic drills run

BIWEEKLY
WITH ROTATORS

    NOTE. Some metrics are directional based on data available at the time. Where possible, figures reflect controlled comparisons; otherwise stakeholder-validated estimates.
  

(08) Outcomes

What changed.
What it earned.

Before	→	After
Dashboard as incident landing	→	Triage landing with guided path
Alerts drop user without context	→	Context handoff carries the query
Mobile usable but unloved	→	Mobile-first triage flow
Seven tab switches per incident	→	Two tab switches per incident
Runbooks lived in Notion	→	Runbook patterns embedded in product
Power-user defaults everywhere	→	Guided path with power-user shortcuts

−71%

Time to first insight in the incident triage flow

(09) Reflection

What I would do
differently.

Synthetic drills as a research mode

Biweekly drills with real on-call engineers were the highest-signal research method I have used. Cheaper than ride-alongs, more realistic than usability tests.

Mobile first as a design constraint

Designing for the phone forced ruthless prioritisation that improved the desktop UX too. The constraint was a gift.

Suggest, don't solve as a stance

In a domain where being wrong is expensive, narrowing the option space without choosing for the user was the right line. It is now a team principle.

Earlier alignment with PagerDuty

The context handoff needed cooperation from the alert source. Earlier conversations would have unblocked deeper integration sooner.

Tighter MTTI definition

Time to first insight was easy to game. A sharper, externally verifiable metric would have made the result harder to dispute.

Power-user docs in parallel

I undersold the redesign to power users initially. They perceived it as training wheels. Better launch comms would have closed that gap.

Designing for a 3am pager.

Designing for theworst moment ofsomeone's week.

Dashboards aremade for browsing.Incidents need a path.

Product designer.Five months.

Three bets.One direction.

From dashboardto decision.

The flow, screen by screen

The work,measured.

What changed.What it earned.

What I would dodifferently.

Designing for
a 3am pager

Designing for the
worst moment of
someone's week.

Dashboards are
made for browsing.
Incidents need a path.

Product designer.
Five months.

Three bets.
One direction.

From dashboard
to decision.

The work,
measured.

What changed.
What it earned.

What I would do
differently.