Rosa Del Mar

Daily Brief

Issue 103 2026-04-13

Wide-Events-Query-First-Debugging-And-Tail-Sampling-As-A-Replacement-Pattern

Issue 103 Edition 2026-04-13 8 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 04:03

Key takeaways

  • The speaker claims OpenTelemetry standardizes collection/export but does not decide what to log or add business context unless engineers instrument it, so adopting OpenTelemetry alone does not automatically fix observability.
  • The speaker claims that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
  • The speaker claims high-cardinality logging data is only expensive and slow on legacy logging systems optimized for low-cardinality search strings, not inherently.
  • The speaker reports the main logging/OpenTelemetry implementation work in T3 Chat was done by teammates Julius and Mark (mostly Julius), rather than by the speaker.
  • The speaker claims many logging systems are misaligned with debugging needs because they charge by volume and struggle with high-cardinality fields even though high-cardinality identifiers are essential for debugging.

Sections

Wide-Events-Query-First-Debugging-And-Tail-Sampling-As-A-Replacement-Pattern

  • The speaker claims OpenTelemetry standardizes collection/export but does not decide what to log or add business context unless engineers instrument it, so adopting OpenTelemetry alone does not automatically fix observability.
  • The speaker reports that in T3 Chat, spans are annotated with many contextual attributes across the request flow (e.g., validation errors, thread IDs, message IDs, attachment metadata) instead of passing a large context object through function calls.
  • The speaker claims naive random sampling can miss rare but critical failures and proposes tail sampling that keeps all errors, keeps slow requests above a threshold (e.g., above P99), always keeps specific important users/sessions, and only randomly samples a small fraction of successful fast requests.
  • The speaker proposes building a single wide, canonical event per request per service hop by accumulating context throughout the request lifecycle and emitting once at the end.
  • The speaker claims wide events enable debugging via structured queries (including SQL-style aggregation) rather than text search.
  • The speaker claims that, with wide events, teams can query checkout failures by user segment and feature flag, group by error code, and get sub-second results to identify root cause in one query.

Failure-Modes-Of-Traditional-Logging-In-Modern-Systems

  • The speaker claims that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
  • The speaker reports they habitually leave excessive debug logs in code and that over a thousand log statements had to be cleaned up in the T3 Chat codebase.
  • The speaker claims traditional logs fail in modern distributed systems because a single user request can traverse many services and infrastructure components while logs still behave like single-server output.
  • The speaker claims concurrent requests interleave log lines and that, without strong correlation context, this makes it difficult to reconstruct what happened for a single user request in production.
  • The speaker claims string-searching logs is unreliable because identifiers are inconsistently logged, many events omit needed context (e.g., user ID), and downstream services may log different identifiers (e.g., order ID).
  • The speaker claims typical logs are optimized for easy emission by developers rather than efficient querying during investigations, which makes low-effort log statements less useful during outages.

Economic-And-Platform-Misalignment-Cardinality-And-Volume-Pricing

  • The speaker claims high-cardinality logging data is only expensive and slow on legacy logging systems optimized for low-cardinality search strings, not inherently.
  • The speaker claims many logging systems are misaligned with debugging needs because they charge by volume and struggle with high-cardinality fields even though high-cardinality identifiers are essential for debugging.
  • The speaker reports T3 Chat production logging contains about 5.8 billion records and 7.3 terabytes of raw text logs, excluding user messages and model responses.
  • The speaker reports their OpenTelemetry dashboard shows about 12,000 spans per minute and that image generation spans average around 60 seconds with a P95 of about 1.25 minutes.
  • The speaker recommends that engineering logging practices evolve to take advantage of improved tooling for high-cardinality data.

End-To-End-Correlation-And-Implementation-Reference-Points

  • The speaker reports the main logging/OpenTelemetry implementation work in T3 Chat was done by teammates Julius and Mark (mostly Julius), rather than by the speaker.
  • The speaker reports a stack review form asks for details including team size, languages, compute platforms, databases, collected data, observability tools, coding agents, monthly observability spend, log volume, and number of services.
  • The speaker claims T3 Chat correlates client and server activity by passing span identifiers from the frontend to the backend, and that client errors can surface a span ID so support can retrieve the trace for that request.
  • The speaker reports they use ClickHouse for their logs for analytics workloads.
  • The speaker reports Axiom allows up to 4096 fields on spans, that T3 Chat uses 59 fields, and that their Stripe logging schema includes 791 fields.

Unknowns

  • What measurable incident-response improvements (e.g., time-to-diagnosis, MTTR, number of pages, support escalation time) resulted from adopting wide events, richer span attributes, and the described correlation workflow?
  • What is the total observability cost profile (ingest/storage/query) at the stated scale, and how much does tail sampling reduce it while preserving diagnostic coverage?
  • How accurate are the reported scale and performance numbers (record counts, terabytes, spans per minute, latency percentiles), and over what time window were they measured?
  • What specific schema fields are treated as mandatory in practice (especially business-context fields) and how are they governed to avoid drift across services?
  • What are the operational and security implications of increasing event width and field cardinality (e.g., accidental sensitive data capture, access control, retention policies)?

Investor overlay

Read-throughs

  • Demand shift toward analytics-first observability stacks that handle high-cardinality, wide structured events efficiently, away from legacy text-search logging products that price by volume and struggle with cardinality.
  • Increased adoption of tail sampling and schema-governed instrumentation as cost controls, benefiting vendors and tools that support policy-driven sampling, correlation workflows, and queryable structured telemetry.
  • Competitive advantage for log and trace backends positioned for large-scale structured queries and aggregations, as teams treat observability as an analytics problem with canonical event schemas and correlation across hops.

What would confirm

  • Customer references and benchmarks showing fast queries and stable costs at high cardinality and wide event schemas, including workflows that retrieve exact traces via correlation identifiers.
  • Case studies reporting measurable incident-response improvements after adopting canonical wide events, richer span attributes, and tail sampling, such as reduced time-to-diagnosis or MTTR.
  • Product roadmaps and usage metrics emphasizing schema governance, structured query patterns, and tail sampling defaults, plus reduced reliance on ad hoc text logs for debugging.

What would kill

  • Data showing wide events and high-cardinality fields materially increase cost or degrade performance even on modern backends, making the approach impractical at scale.
  • Lack of measurable operational improvements from the approach, such as unchanged MTTR or investigation time despite richer context and correlation workflows.
  • Security or governance failures from increased event width, such as sensitive data leakage or inability to enforce retention and access controls, leading teams to limit attributes and revert to narrower telemetry.

Sources

  1. youtube.com