Post-Migration Production Instability And Architectural Mitigation Via Nitro

Issue 61 Edition 2026-03-02 8 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-03-02 20:04

Key takeaways

There is an active investigation involving Pouya and others to determine the true root cause of the TanStack route-loading failures.
T3Chat was migrated off Next.js to TanStack Start.
Cloudflare Workers bundle-size limits (about 3MB free and 10MB paid, per the speaker) were exceeded by the team's server-side code, making Cloudflare impractical without significant deployment complexity.
TanStack Start's routing approach uses code generation plus TypeScript inference to provide end-to-end type-safe route parameters and loaders that gate rendering on required data being fetched.
On Vercel, because the hard maximum request duration was 800 seconds, setting the chat API route maxDuration to 799 seconds in Next.js forced Fluid Compute to provision it separately from other 800-second routes, isolating long-running chat requests from short requests.

Sections

Post-Migration Production Instability And Architectural Mitigation Via Nitro

There is an active investigation involving Pouya and others to determine the true root cause of the TanStack route-loading failures.
The solution required patching TanStack Start server core to expose a Nitro/H3 event binding (getH3EventBinding) so the team could bind and handle Nitro events directly for mixed Nitro and TanStack routes.
Deployments encountered a recurring root-route error where Node's internal undici-based fetch failed around ~60% rollout, causing server issues and clients not rendering.
The team suspects TanStack route lazy-bundling retained many route bundles in memory and, under high concurrency, led to overload consistent with a 'too many files' (EMFILE) failure mode.
As a mitigation, the team removed API endpoints from TanStack routing and instead relied on Nitro's route handling for API resolution while keeping TanStack for browser route experience.
Configuring Nitro with a serverDir in Vite enables defining server functions in a Nitro-preferred structure outside TanStack route definitions while still operating within TanStack Start.

Framework-Router Misfit As Primary Migration Driver

T3Chat was migrated off Next.js to TanStack Start.
At launch, T3Chat used Next.js but replaced the Next router with a hacked-in React Router setup using rewrites to a static app shell.
The team intentionally used Next.js in a client-first way, avoided Server Components, and targeted SPA-like navigation speed after the initial JS load.
TRPC traffic initially broke due to the rewrite strategy and required using a custom header to keep TRPC requests from being routed to the static app shell.
After migrating the data layer to Convex, moving auth to WorkOS, and rewriting backend logic with Effect for observability, the speaker attributes remaining major bugs largely to the Next.js-plus-React-Router integration.
The move off Next.js was driven by a desire for a better SPA experience and a framework that keeps front end and back end deployed together under one package.json, not by a belief that Next.js or Vercel are inherently bad.

Platform And Org Constraints Shaping Viable Architectures

Cloudflare Workers bundle-size limits (about 3MB free and 10MB paid, per the speaker) were exceeded by the team's server-side code, making Cloudflare impractical without significant deployment complexity.
The team explored exits from Next.js including Remix, React Router's server approach, and a Vite+Hono rewrite targeting Cloudflare, but encountered blockers related to platform and documentation complexity.
Vercel's Fluid compute changed the economics of long-running AI generation requests by making scaling cheaper than a prior model where each user chat effectively consumed a dedicated Lambda.
Because the team was very small and lacked dedicated infrastructure staff, managed deployment via Vercel was a practical necessity to preserve engineering velocity.

Tanstack Start Routing Guarantees And Dx Improvements

TanStack Start's routing approach uses code generation plus TypeScript inference to provide end-to-end type-safe route parameters and loaders that gate rendering on required data being fetched.
TanStack Router provides type-safe route parameters by inferring parameter names from the route path string and propagating them into loader/component types.
The app uses a generated and committed route tree file (*.gen.ts) that should not be manually edited because it will be overwritten by code generation.
Post-migration, the codebase is described as easier to understand and debug, routing is significantly improved, and implementation ownership shifted away from the host's personal hacks to the team (notably Mark and Julius).

Request-Duration Isolation And Cost/Perf Tuning Regressions

On Vercel, because the hard maximum request duration was 800 seconds, setting the chat API route maxDuration to 799 seconds in Next.js forced Fluid Compute to provision it separately from other 800-second routes, isolating long-running chat requests from short requests.
Vercel's Fluid compute changed the economics of long-running AI generation requests by making scaling cheaper than a prior model where each user chat effectively consumed a dedicated Lambda.
In the current TanStack/Nitro setup, configuring maxDuration at a broader level appears to bundle multiple APIs together and removes the prior ability to split the chat endpoint from other API endpoints.

Watchlist

There is an active investigation involving Pouya and others to determine the true root cause of the TanStack route-loading failures.

Unknowns

What objective reliability changes occurred after the migration (incident rate, error budgets, rollback frequency), especially for routing and API transport issues previously attributed to Next.js+React Router?
What is the confirmed root cause of the rollout-correlated undici/fetch failures and route-loading instability, and what upstream or local fix resolves it without ongoing patch maintenance?
Does the TanStack/Nitro-on-Vercel deployment support per-route (or per-handler) duration/concurrency configuration comparable to the prior Next.js maxDuration isolation tactic for chat generation?
What are the measured performance outcomes of the new architecture (TTFB, LCP, click-to-render navigation latency), and how do SSR and loader-gated rendering contribute to or harm those metrics?
What is the long-term plan for mixed data access (Convex hydration plus retained TRPC client), and what portion of server-side needs still require TRPC?

Investor overlay

Read-throughs

Managed edge deployment options may be constrained for similar apps when server bundles exceed Cloudflare Workers limits, pushing teams toward Vercel or heavier deployment complexity.
Router and framework integration risk can be a material production reliability driver, making migrations toward router-centric stacks attractive when SPA behavior and type-safe routing are priorities.
Vercel platform limits and per-route duration controls can materially affect cost and concurrency, so loss of per-route isolation in new architectures could change unit economics for long-running AI chat endpoints.

What would confirm

Post-migration reliability metrics improve after identifying and fixing the TanStack route-loading and undici or fetch failures without ongoing patches.
TanStack plus Nitro on Vercel gains per-route or per-handler duration and concurrency configuration comparable to prior maxDuration isolation, restoring predictable chat versus short-request separation.
Measured performance results for SSR and loader-gated rendering show stable or improved TTFB, LCP, and navigation latency versus the prior Next.js plus React Router setup.

What would kill

Root cause remains unresolved and requires continuing bespoke patches to TanStack Start Nitro integration, with recurring production instability tied to routing or fetch transport.
New deployment continues to bundle APIs together with reduced isolation, causing regressions in concurrency, timeouts, or costs for long-running chat generation.
Objective outcomes show worse incident rate, rollback frequency, or performance after migration, undermining the claim that router layer mismatch was the primary instability driver.

Sources

Sc5ca-VJdxY

youtube.com