Operators were running the panel on faith.
Most clusters came with an AI-generated summary. In early research, operators either trusted it without verifying, or dismissed the whole cluster because they had no way to check it. The gap wasn't the summary — it was the absence of verifiable evidence behind it.
Research surfaced three questions operators asked before acting on any cluster: Is anything in the environment behaving unusually? Is this the same problem across all these tickets? Is this still happening? None of those questions were answered anywhere in the product. The inspect panel was designed to answer them — one component per question.
Component 1 — Environment Signal
The wrong question was "what do these tickets have in common?"
Tickets are already grouped by similarity — confirming that similarity is tautological. The right question: is anything in the environment behaving unusually? The system has a model of the full org, pulled from CrowdStrike, Jamf, and Okta. That makes it possible to compare the cluster against org-wide baseline distributions, not just against itself.
In testing, 5 of 8 operators asked the same thing: "Is 96% high? Is that normal for that gateway?" The original design couldn't answer them. A 96% match on VPN gateway sounds alarming until you learn that gateway handles 95% of all company VPN traffic. Without a baseline, every percentage is noise.
"A bar showing 96% concentration is only meaningful if the operator knows whether 96% is normal. Without a baseline, every bar is noise."
— Design Process, Environment SignalThe rename from "Common Attributes" to "Environment Signal" was a framing shift, not a label change. Common Attributes asked what the tickets shared internally. Environment Signal asked what the environment was doing that it shouldn't. Three rounds of copy testing ran alongside the visual iteration: Round 1 flagged terms like "baseline distribution" as jargon operators didn't use. Round 2 introduced relative language — "higher than usual" — which was ambiguous without knowing the reference population. Round 3 shipped with explicit scope and raw percentages. Slower to read, but verifiable.
The two-population bar design made the anomaly visible before reading a number: org-wide (muted) on one row, this cluster (orange) below. Below-baseline attributes rendered in blue-gray — absence is a different failure mode than concentration.
The final design replaced per-attribute snapshots with a shared time axis. All three signals — VPN gateway routing percentage, client version adoption curve, OS version prevalence — on one chart. One marker at 09:03 cuts through all three lines. Gateway and Client Version bend at it. OS Version crosses it flat. That contrast encodes the diagnosis before anything is read. The attribute rows below act as a reference layer: value, cluster%, org%.
Final — three org-wide signals on a shared time axis. Gateway and Client Version bend at 09:03; OS Version is unaffected. Hover to read values at any point.
Component 2 — Ticket Rate
Without time, every cluster feels equally urgent.
The inspect panel could tell operators what the cluster was and what was anomalous in the environment. What it couldn't tell them: is this still happening? A cluster that peaked 40 minutes ago needs a different response than one still climbing. Without temporal context, operators can't calibrate their response to something they can't place in time.
The bar chart was invisible at panel width. Smooth curves were worse: they imply continuous data between discrete arrival buckets — a smooth line between minute 3 and minute 4 invents data that doesn't exist. Step interpolation is honest about bucket boundaries. Each horizontal segment is a minute; nothing is smoothed between them.
"The step line was the only variant where participants spontaneously asked 'what's this one?' pointing at the ghost — indicating the shape was clear enough that the second line registered as meaningful rather than noise."
— Design Process, Ticket Rate · Round 1 testing (n=8)The final component overlays a ghost cluster from the most comparable prior incident — not a number, not a label, but a shape. Operators could see whether the current cluster was tracking above or below a pattern they'd already resolved. Across five rounds and 42 participants, operators without the chart made wrong decisions on escalating clusters at three times the rate of operators who had it. Median time to decision dropped from 54 seconds to 38.
Final — step-interpolated line with ghost overlay from the most comparable prior cluster. Tapering state shown.
Component 3 — Outlier Surfacing
Three actions was too many.
Before any workflow runs, some tickets in the cluster diverge from the dominant attribute pattern — not because they don't belong (they passed clustering), but because one detail is different: a different Okta tenant, a different region. The component's job is to surface that before the operator acts, without alarming anyone. The core tension: how quiet is too quiet?
In testing, "Route separately" had zero interaction rate: "Route separately — where? I don't have another workflow in front of me right now." The action model was designed for a threat level the component didn't carry.
The signal problem ran deeper than the actions. Early iterations used a numeric badge — "3 outliers" — to indicate divergence in the collapsed row. In testing, 50% of operators expanded the section when they saw only the badge. When attribute names appeared in the collapsed row instead — "Identity Provider, Region" — that rate jumped to 75%. The number didn't mean anything. The attribute name told operators whether the divergence was relevant to their cluster.
"The action model should match the signal weight. A low-urgency pre-run signal doesn't need three decisions — it needs one easy out and one meaningful escalation path."
— Design Process, Outlier SurfacingFour iterations later: two actions — Got it and Remove from group — diff data moved from tooltips into inline rows, attribute names visible in the collapsed state. "Got it" reached 8/8 comprehension; "Noted" had been 2/6. Time to first action dropped from 18 seconds to 11.
Final — two actions, attribute names visible in collapsed state. Toggle to see a smaller or larger outlier set.
Evidence layer, not decision layer.
Environment Signal shows what's anomalous in the infrastructure. Ticket Rate shows whether it's still happening. Outlier Surfacing flags what sits outside the dominant pattern before the run. None of these components tell the operator what to do — that's intentional. Their job is to make the available evidence visible and legible, not to make the decision for the person holding it.
What I learned.
The question drives the design. "What do these tickets have in common?" was our original framing for the commonality component. The right question — "what is the environment doing that it shouldn't?" — produces a completely different component. Getting there required exploration, not refinement. Those aren't the same thing.
Testing exposed gaps that couldn't be designed around. The single bar with a tick mark failed for 7 of 10 participants on an encoding I thought was obvious. A smooth curve read as "a trend" instead of discrete ticket arrivals. "Noted" had 2/6 comprehension; "Got it" had 8/8. You can't see those gaps from the inside.
Engineering and ML reviews were design inputs, not implementation checkpoints. The ML lead's note — "surfacing our uncertainty builds trust in the cases where we are confident" — restored a confidence indicator that had been cut. The engineering walkthrough of Outlier Surfacing produced four open questions that sharpened the spec. The best design feedback came from people who weren't UX designers.