Wide-Events-Query-First-Debugging-And-Tail-Sampling-As-A-Replacement-Pattern
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 04:03
Key takeaways
- The speaker claims OpenTelemetry standardizes collection/export but does not decide what to log or add business context unless engineers instrument it, so adopting OpenTelemetry alone does not automatically fix observability.
- The speaker claims that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
- The speaker claims high-cardinality logging data is only expensive and slow on legacy logging systems optimized for low-cardinality search strings, not inherently.
- The speaker reports the main logging/OpenTelemetry implementation work in T3 Chat was done by teammates Julius and Mark (mostly Julius), rather than by the speaker.
- The speaker claims many logging systems are misaligned with debugging needs because they charge by volume and struggle with high-cardinality fields even though high-cardinality identifiers are essential for debugging.
Sections
Wide-Events-Query-First-Debugging-And-Tail-Sampling-As-A-Replacement-Pattern
- The speaker claims OpenTelemetry standardizes collection/export but does not decide what to log or add business context unless engineers instrument it, so adopting OpenTelemetry alone does not automatically fix observability.
- The speaker reports that in T3 Chat, spans are annotated with many contextual attributes across the request flow (e.g., validation errors, thread IDs, message IDs, attachment metadata) instead of passing a large context object through function calls.
- The speaker claims naive random sampling can miss rare but critical failures and proposes tail sampling that keeps all errors, keeps slow requests above a threshold (e.g., above P99), always keeps specific important users/sessions, and only randomly samples a small fraction of successful fast requests.
- The speaker proposes building a single wide, canonical event per request per service hop by accumulating context throughout the request lifecycle and emitting once at the end.
- The speaker claims wide events enable debugging via structured queries (including SQL-style aggregation) rather than text search.
- The speaker claims that, with wide events, teams can query checkout failures by user segment and feature flag, group by error code, and get sub-second results to identify root cause in one query.
Failure-Modes-Of-Traditional-Logging-In-Modern-Systems
- The speaker claims that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
- The speaker reports they habitually leave excessive debug logs in code and that over a thousand log statements had to be cleaned up in the T3 Chat codebase.
- The speaker claims traditional logs fail in modern distributed systems because a single user request can traverse many services and infrastructure components while logs still behave like single-server output.
- The speaker claims concurrent requests interleave log lines and that, without strong correlation context, this makes it difficult to reconstruct what happened for a single user request in production.
- The speaker claims string-searching logs is unreliable because identifiers are inconsistently logged, many events omit needed context (e.g., user ID), and downstream services may log different identifiers (e.g., order ID).
- The speaker claims typical logs are optimized for easy emission by developers rather than efficient querying during investigations, which makes low-effort log statements less useful during outages.
Economic-And-Platform-Misalignment-Cardinality-And-Volume-Pricing
- The speaker claims high-cardinality logging data is only expensive and slow on legacy logging systems optimized for low-cardinality search strings, not inherently.
- The speaker claims many logging systems are misaligned with debugging needs because they charge by volume and struggle with high-cardinality fields even though high-cardinality identifiers are essential for debugging.
- The speaker reports T3 Chat production logging contains about 5.8 billion records and 7.3 terabytes of raw text logs, excluding user messages and model responses.
- The speaker reports their OpenTelemetry dashboard shows about 12,000 spans per minute and that image generation spans average around 60 seconds with a P95 of about 1.25 minutes.
- The speaker recommends that engineering logging practices evolve to take advantage of improved tooling for high-cardinality data.
End-To-End-Correlation-And-Implementation-Reference-Points
- The speaker reports the main logging/OpenTelemetry implementation work in T3 Chat was done by teammates Julius and Mark (mostly Julius), rather than by the speaker.
- The speaker reports a stack review form asks for details including team size, languages, compute platforms, databases, collected data, observability tools, coding agents, monthly observability spend, log volume, and number of services.
- The speaker claims T3 Chat correlates client and server activity by passing span identifiers from the frontend to the backend, and that client errors can surface a span ID so support can retrieve the trace for that request.
- The speaker reports they use ClickHouse for their logs for analytics workloads.
- The speaker reports Axiom allows up to 4096 fields on spans, that T3 Chat uses 59 fields, and that their Stripe logging schema includes 791 fields.
Unknowns
- What measurable incident-response improvements (e.g., time-to-diagnosis, MTTR, number of pages, support escalation time) resulted from adopting wide events, richer span attributes, and the described correlation workflow?
- What is the total observability cost profile (ingest/storage/query) at the stated scale, and how much does tail sampling reduce it while preserving diagnostic coverage?
- How accurate are the reported scale and performance numbers (record counts, terabytes, spans per minute, latency percentiles), and over what time window were they measured?
- What specific schema fields are treated as mandatory in practice (especially business-context fields) and how are they governed to avoid drift across services?
- What are the operational and security implications of increasing event width and field cardinality (e.g., accidental sensitive data capture, access control, retention policies)?