Rosa Del Mar

Daily Brief

Issue 61 2026-03-02

Implementation Examples And Operational Workflows

Issue 61 Edition 2026-03-02 8 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-03-02 20:30

Key takeaways

  • The speaker attributes the main logging and OpenTelemetry implementation work in T3 Chat to teammates Julius and Mark, mostly Julius.
  • The speaker asserts that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
  • Wide events shift debugging from text search to structured queries, including SQL-style aggregation, enabling analysis of production traffic.
  • The speaker asserts that high-cardinality logging data is expensive and slow primarily on legacy logging systems optimized for low-cardinality search strings, not inherently.
  • Adopting OpenTelemetry does not automatically fix observability because it standardizes collection/export but does not decide what to log or add business context unless engineers instrument it.

Sections

Implementation Examples And Operational Workflows

  • The speaker attributes the main logging and OpenTelemetry implementation work in T3 Chat to teammates Julius and Mark, mostly Julius.
  • The stack review form asks for team size, languages, compute platforms, databases, collected data, observability tools, coding agents, monthly observability spend, log volume, and number of services.
  • In T3 Chat, spans are annotated with many contextual attributes across the request flow (e.g., validation errors and IDs) rather than passing a large context object through function calls.
  • The speaker reports T3 Chat production logging contains about 5.8 billion records and 7.3 terabytes of raw text logs, excluding user messages and model responses.
  • T3 Chat correlates client and server activity by passing span identifiers from the frontend to the backend, and client errors can surface a span ID so support can retrieve the exact trace for that request.
  • Naive random sampling can miss rare but critical failures, and a tail-sampling approach should keep all errors, keep slow requests above a latency threshold, always keep specific important users/sessions, and only randomly sample a small fraction of successful fast requests.

Failure Modes Of Line Oriented Logging

  • The speaker asserts that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
  • Traditional logs fail in modern distributed systems because a single user request can traverse many services and infrastructure components while logs still behave like single-server output.
  • Concurrent requests interleave log lines, making it difficult to reconstruct what happened for any single user request in production without strong correlation context.
  • String-searching logs is unreliable because identifiers are inconsistently logged, many events omit needed context such as user ID, and downstream services may log different identifiers that require multiple manual searches.
  • The speaker claims logs are typically optimized for easy emission by developers rather than efficient querying during investigations.
  • Structured logging (e.g., JSON/key-value) is necessary but insufficient because useful debugging still requires deliberate inclusion of high-value context fields.

Wide Events And Query First Debugging

  • Wide events shift debugging from text search to structured queries, including SQL-style aggregation, enabling analysis of production traffic.
  • The speaker proposes building a single wide, canonical event per request per service hop by accumulating context throughout the request lifecycle and emitting once at the end.
  • The speaker describes a target capability where teams can query checkout failures by user segment and feature flag, group by error code, and get sub-second results to identify root cause in one query.
  • The speaker states that a wide event should include request, user, business, infrastructure, error, and performance context fields.
  • The speaker claims that better structured logging makes logs more trustworthy and less misleading during debugging.

Tooling And Incentives High Cardinality And Cost

  • The speaker asserts that high-cardinality logging data is expensive and slow primarily on legacy logging systems optimized for low-cardinality search strings, not inherently.
  • The speaker asserts that many logging systems are misaligned with debugging needs because they charge by volume and struggle with high-cardinality fields, even though high-cardinality identifiers are essential for debugging.
  • The speaker says they use ClickHouse for logs for analytics workloads.
  • The speaker states Axiom allows up to 4096 fields on spans, that T3 Chat uses 59 fields, and that their Stripe logging schema includes 791 fields.
  • The speaker expects that because tooling for high-cardinality data has caught up, engineering logging practices should evolve to take advantage of it.

Opentelemetry Limits And Instrumentation Responsibility

  • Adopting OpenTelemetry does not automatically fix observability because it standardizes collection/export but does not decide what to log or add business context unless engineers instrument it.

Unknowns

  • What measurable incident-response outcomes (e.g., MTTD/MTTR, pages per engineer, time-to-trace from support ticket) changed after adopting wider context on spans/events and query-first workflows?
  • What are the actual ingest, storage, and query costs under the proposed approach (wide events + high-cardinality fields + tail sampling), and how do they compare to prior logging setups?
  • What specific schema fields (and which are mandatory) most strongly predict faster diagnosis, versus adding noise and cost?
  • How is context accumulated safely and consistently across async boundaries, retries, fan-out, and partial failures, and what failure modes remain for wide-event emission at the end of a request?
  • To what extent are the scale numbers (records/terabytes, spans per minute, latency percentiles) stable over time and representative of typical load rather than a point-in-time snapshot?

Investor overlay

Read-throughs

  • Shift from line oriented logs to wide events and query first debugging could increase demand for observability platforms optimized for structured, high cardinality telemetry and fast SQL style aggregation.
  • OpenTelemetry adoption alone is insufficient; value may accrue to vendors and internal tooling that provide opinionated schemas, context propagation patterns, and workflows for incident response rather than just collection and export.
  • Tail sampling plus high context spans may support cost control while improving traceability, potentially benefiting tools that make sampling policies, trace lookup, and support ticket to trace correlation easy to operate.

What would confirm

  • Published or shared metrics showing improved incident response after adopting wide events and query first workflows, such as reduced time to trace from support ticket, lower MTTD or MTTR, or fewer pages per engineer.
  • Clear, stable schema guidance identifying a small set of mandatory high value fields on spans or events that materially speeds diagnosis without excessive noise or cost.
  • Transparent cost and performance data for ingest, storage, and query under wide events with high cardinality and tail sampling, including comparisons to prior logging setups.

What would kill

  • No measurable improvement in incident response outcomes despite adopting wider context, structured queries, and trace correlation workflows.
  • Cost or latency of ingest and querying high cardinality wide events proves materially worse than prior logging, even with tail sampling and modern tooling.
  • Persistent context propagation failures across async boundaries, retries, and fan out that lead to missing or unreliable canonical events, undermining query first debugging during real incidents.

Sources

  1. youtube.com