Reference-Implementation-T3-Chat-Scale-And-Workflows
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:36
Key takeaways
- The speaker attributes the main T3 Chat logging/OTEL implementation work to teammates Julius and Mark (mostly Julius).
- Traditional line-oriented logs do not reliably reconstruct what happened during incidents in modern distributed systems where a single request traverses many services and components.
- A proposed logging pattern is to accumulate context throughout a request lifecycle and emit a single wide, canonical event per request per service hop at the end of the request.
- Adopting OpenTelemetry standardizes collection/export but does not determine what to instrument or add business context, so it does not automatically make observability useful without engineering instrumentation decisions.
- High-cardinality logging data is described as expensive and slow primarily on legacy logging systems optimized for low-cardinality search strings rather than being inherently expensive and slow.
Sections
Reference-Implementation-T3-Chat-Scale-And-Workflows
- The speaker attributes the main T3 Chat logging/OTEL implementation work to teammates Julius and Mark (mostly Julius).
- A codebase referenced as T3 Chat required cleanup of over a thousand debug log statements that had accumulated from habitual excessive logging.
- In T3 Chat, spans are annotated with contextual attributes across the request flow (e.g., validation errors, thread IDs, message IDs, attachment metadata) rather than passing a large context object through function calls.
- The T3 Chat production logging volume is reported as about 5.8 billion records and 7.3 TB of raw text logs, excluding user messages and model responses.
- T3 Chat correlates client and server activity by passing span identifiers from the frontend to the backend, and client errors can surface a span ID to support direct trace retrieval for a specific request.
- The speakers report using ClickHouse for log analytics workloads.
Failure-Modes-Of-Traditional-Logging-In-Distributed-Systems
- Traditional line-oriented logs do not reliably reconstruct what happened during incidents in modern distributed systems where a single request traverses many services and components.
- Concurrent request handling causes log-line interleaving that makes it difficult to reconstruct the sequence of events for a single user request without strong correlation context.
- Grepping/string-searching logs is unreliable during investigations because identifiers and required context fields are inconsistently emitted across events and across services.
- Common logging practices tend to optimize for easy log emission by developers rather than efficient querying and investigation during incidents.
Wide-Canonical-Events-And-Query-Based-Debugging
- A proposed logging pattern is to accumulate context throughout a request lifecycle and emit a single wide, canonical event per request per service hop at the end of the request.
- Wide events shift debugging from text search toward structured querying and aggregation (including SQL-style workflows) over production traffic.
- With wide events, a described target workflow is querying checkout failures by user segment and feature flag, grouping by error code, and getting sub-second results to identify root cause in one query.
- A wide canonical event should include request, user, business, infrastructure, error, and performance context fields to support debugging and product questions without multiple log searches.
Instrumentation-And-Context-Design-Over-Plumbing
- Adopting OpenTelemetry standardizes collection/export but does not determine what to instrument or add business context, so it does not automatically make observability useful without engineering instrumentation decisions.
- Structured logging (e.g., key-value or JSON) is necessary but does not, by itself, produce useful debugging data without deliberate inclusion of high-value context fields.
Economics-And-Platform-Constraints-Cardinality-And-Volume
- High-cardinality logging data is described as expensive and slow primarily on legacy logging systems optimized for low-cardinality search strings rather than being inherently expensive and slow.
- Some logging systems are economically and technically misaligned with debugging because they charge by volume and struggle with high-cardinality fields that are important for debugging (e.g., user/session/trace IDs).
Unknowns
- What measured changes (if any) occurred in MTTR, incident frequency, or on-call time after adopting span annotation, wide-event thinking, and query-based debugging?
- What are the actual observability costs (ingest, storage, query) for the reported T3 Chat logging scale, and how do those costs change under tail sampling?
- How are wide canonical events implemented in practice across services (e.g., how context is accumulated, schema governance, and how the final 'emit once at end' interacts with early failures and partial execution paths)?
- What query latency and completeness is achieved under real incident loads for the described segmentation/group-by workflows (e.g., checkout failures by feature flag and error code)?
- What are the constraints and trade-offs of high field-count schemas (e.g., field explosion, naming consistency, PII handling, schema evolution, and downstream compatibility)?