Implementation Examples And Operational Workflows
Sources: 1 • Confidence: Medium • Updated: 2026-03-02 20:30
Key takeaways
- The speaker attributes the main logging and OpenTelemetry implementation work in T3 Chat to teammates Julius and Mark, mostly Julius.
- The speaker asserts that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
- Wide events shift debugging from text search to structured queries, including SQL-style aggregation, enabling analysis of production traffic.
- The speaker asserts that high-cardinality logging data is expensive and slow primarily on legacy logging systems optimized for low-cardinality search strings, not inherently.
- Adopting OpenTelemetry does not automatically fix observability because it standardizes collection/export but does not decide what to log or add business context unless engineers instrument it.
Sections
Implementation Examples And Operational Workflows
- The speaker attributes the main logging and OpenTelemetry implementation work in T3 Chat to teammates Julius and Mark, mostly Julius.
- The stack review form asks for team size, languages, compute platforms, databases, collected data, observability tools, coding agents, monthly observability spend, log volume, and number of services.
- In T3 Chat, spans are annotated with many contextual attributes across the request flow (e.g., validation errors and IDs) rather than passing a large context object through function calls.
- The speaker reports T3 Chat production logging contains about 5.8 billion records and 7.3 terabytes of raw text logs, excluding user messages and model responses.
- T3 Chat correlates client and server activity by passing span identifiers from the frontend to the backend, and client errors can surface a span ID so support can retrieve the exact trace for that request.
- Naive random sampling can miss rare but critical failures, and a tail-sampling approach should keep all errors, keep slow requests above a latency threshold, always keep specific important users/sessions, and only randomly sample a small fraction of successful fast requests.
Failure Modes Of Line Oriented Logging
- The speaker asserts that logging as commonly practiced today is fundamentally broken and does not reliably reveal what happened during real incidents.
- Traditional logs fail in modern distributed systems because a single user request can traverse many services and infrastructure components while logs still behave like single-server output.
- Concurrent requests interleave log lines, making it difficult to reconstruct what happened for any single user request in production without strong correlation context.
- String-searching logs is unreliable because identifiers are inconsistently logged, many events omit needed context such as user ID, and downstream services may log different identifiers that require multiple manual searches.
- The speaker claims logs are typically optimized for easy emission by developers rather than efficient querying during investigations.
- Structured logging (e.g., JSON/key-value) is necessary but insufficient because useful debugging still requires deliberate inclusion of high-value context fields.
Wide Events And Query First Debugging
- Wide events shift debugging from text search to structured queries, including SQL-style aggregation, enabling analysis of production traffic.
- The speaker proposes building a single wide, canonical event per request per service hop by accumulating context throughout the request lifecycle and emitting once at the end.
- The speaker describes a target capability where teams can query checkout failures by user segment and feature flag, group by error code, and get sub-second results to identify root cause in one query.
- The speaker states that a wide event should include request, user, business, infrastructure, error, and performance context fields.
- The speaker claims that better structured logging makes logs more trustworthy and less misleading during debugging.
Tooling And Incentives High Cardinality And Cost
- The speaker asserts that high-cardinality logging data is expensive and slow primarily on legacy logging systems optimized for low-cardinality search strings, not inherently.
- The speaker asserts that many logging systems are misaligned with debugging needs because they charge by volume and struggle with high-cardinality fields, even though high-cardinality identifiers are essential for debugging.
- The speaker says they use ClickHouse for logs for analytics workloads.
- The speaker states Axiom allows up to 4096 fields on spans, that T3 Chat uses 59 fields, and that their Stripe logging schema includes 791 fields.
- The speaker expects that because tooling for high-cardinality data has caught up, engineering logging practices should evolve to take advantage of it.
Opentelemetry Limits And Instrumentation Responsibility
- Adopting OpenTelemetry does not automatically fix observability because it standardizes collection/export but does not decide what to log or add business context unless engineers instrument it.
Unknowns
- What measurable incident-response outcomes (e.g., MTTD/MTTR, pages per engineer, time-to-trace from support ticket) changed after adopting wider context on spans/events and query-first workflows?
- What are the actual ingest, storage, and query costs under the proposed approach (wide events + high-cardinality fields + tail sampling), and how do they compare to prior logging setups?
- What specific schema fields (and which are mandatory) most strongly predict faster diagnosis, versus adding noise and cost?
- How is context accumulated safely and consistently across async boundaries, retries, fan-out, and partial failures, and what failure modes remain for wide-event emission at the end of a request?
- To what extent are the scale numbers (records/terabytes, spans per minute, latency percentiles) stable over time and representative of typical load rather than a point-in-time snapshot?