Scaling Agents: Orchestration, Composition, And Toolset Growth Management

Issue 105 Edition 2026-04-15 10 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-15 04:15

Key takeaways

Simon Last stated that a “manager agent” layer can supervise dozens of specialized agents and reduce noisy notifications (example given: from about 70/day to about 5), while helping debug failures.
Simon Last stated that he is bullish on CLIs over MCP in some contexts because CLIs provide progressive disclosure in the terminal and let agents debug and fix their own toolchain in the same environment when failures occur.
Sarah Sachs stated that Notion observes latency and quality variability across model providers and even across vendors serving what is presented as the same model, including evidence consistent with undisclosed quantization differences.
Sarah Sachs stated that Notion prefers to partner with external wearable/hardware companies for in-person meeting capture rather than building the wearable itself.
Sarah Sachs stated that Notion involves security review very early in AI feature development because late involvement causes more slowdown, tension, and weaker product outcomes.

Sections

Scaling Agents: Orchestration, Composition, And Toolset Growth Management

Simon Last stated that a “manager agent” layer can supervise dozens of specialized agents and reduce noisy notifications (example given: from about 70/day to about 5), while helping debug failures.
Simon Last stated that as Notion’s agent grew to 100+ tools, it became token-expensive and a quality risk, and this drove an effort toward progressive disclosure and tool search.
Simon Last stated that managing a fleet of AI agents can be treated as a rigorous technical system with explicit states (e.g., blocked) and explicit flows.
Simon Last stated that a low-intervention agent system needs a human-readable specification layer (e.g., committed markdown or Notion pages/databases) plus a strong evaluation and testing loop.
Simon Last stated that Notion intentionally has no special built-in “memory” concept for agents, and instead uses pages and databases as the memory substrate that both humans and agents can edit.
Sarah Sachs stated that Notion’s shift from few-shot prompt examples to goal-described tool definitions enabled distributing tool ownership to many teams rather than relying on a centralized prompt “center of excellence.”

Platform Posture: Protocol/Tool Interface Tradeoffs And Model Optionality

Simon Last stated that he is bullish on CLIs over MCP in some contexts because CLIs provide progressive disclosure in the terminal and let agents debug and fix their own toolchain in the same environment when failures occur.
Sarah Sachs stated that Notion leadership does not view training a foundation model as a necessary core competency unless forced by a market gap.
Sarah Sachs stated that Notion prefers building integrations natively first and then generalizing them, and that native apps like Notion Mail enable purpose-built tools optimized for latency, performance, and quality.
Sarah Sachs stated that Notion sometimes avoids using MCP for key capabilities like cross-tool search in order to maintain tighter control over quality and the agent’s search trajectory.
Simon Last stated that Notion maintains internal abstractions for tools, agents, completion calls, and chat archetypes so it can swap underlying AI and integration components as the ecosystem changes.
Simon Last stated that MCP is best suited for narrow, lightweight agents that need tightly permissioned tool-only access, while CLIs remain broadly valuable for agent workflows.

Reliability Operations: Distributed Eval Ownership And Provider Variance

Sarah Sachs stated that Notion observes latency and quality variability across model providers and even across vendors serving what is presented as the same model, including evidence consistent with undisclosed quantization differences.
Sarah Sachs stated that Notion designs “frontier/headroom” evals targeting roughly a 30% pass rate and has dedicated staff working on “Notion’s last exam” to keep evals sensitive to future improvements.
Simon Last stated that Notion centralizes the evaluation framework but requires each team to own and run its own evals (often in CI or nightly), with automated triggers to review major failures when models or harnesses change.
Sarah Sachs stated that Notion avoids fine-tuning models on Notion’s rapidly changing internal tools because retraining for each tool change would slow shipping velocity and likely lose to frontier model improvements during the build-to-train lag.
Sarah Sachs stated that Notion collaborates with frontier labs using multiple pre-release model snapshots and has seen cases where the shipped model was not the preferred snapshot.
Simon Last stated that Notion’s engineering approach emphasizes improving the agent outer loop (tools, harness, verification) because most agent failures are attributed to tool bugs rather than needing model training.

Agent-Driven Retrieval And Meeting Notes As A Data And Workload Driver

Sarah Sachs stated that Notion prefers to partner with external wearable/hardware companies for in-person meeting capture rather than building the wearable itself.
Sarah Sachs stated that Notion is reinvesting in retrieval and ranking because most search traffic now comes from agents rather than humans.
Sarah Sachs stated that Notion’s retrieval work is shifting away from optimizing vector embeddings toward agentic query generation and parallel query diversity, treating ranking, query generation, and retrieval as a single journey.
Sarah Sachs stated that Notion Meeting Notes has become a major growth lever (virality, adoption, retention) and increases content volume, driving scaling needs for search and agents.
Sarah Sachs stated that Notion’s internal teams run meetings with agents that generate pre-reads from Slack/GitHub, create meeting notes, and trigger follow-up tasks and Slack messages via calendar-integrated automation.

Enterprise Constraints: Permissions And Security As Primary Bottlenecks

Sarah Sachs stated that Notion involves security review very early in AI feature development because late involvement causes more slowdown, tension, and weaker product outcomes.
Simon Last stated that custom agents can be set up and debugged through the same chat used to run them, and that they cannot edit their own permissions unless a human enters an explicit confirmation mode.
Sarah Sachs stated that custom agents required multiple redesigns because permissioning is complex when an agent is shared across channels/groups and document visibility differs across audiences.
Simon Last stated that MCP is attractive for narrow or lightweight agents because it has a strong permission model, while CLI-based approaches introduce token and exfiltration risks.

Watchlist

Sarah Sachs stated that Notion observes latency and quality variability across model providers and even across vendors serving what is presented as the same model, including evidence consistent with undisclosed quantization differences.

Unknowns

What are the actual post-trial retention and paid conversion rates for custom agents after the three-month free period ends?
What is the concrete definition and measured percentage behind “most search traffic now comes from agents,” and how has it changed over time?
What is the exact shipped behavior of direct agent-to-agent invocation (availability date, permission model, auditability), and is the recursion limit configurable per workspace?
What are the observed quantitative impacts of progressive disclosure/tool search on token cost, latency, and tool-misfire rate in production?
How does Notion validate and attribute provider-to-provider model variance (e.g., controlled A/Bs across vendors, snapshot pinning, regression attribution)?

Investor overlay

Read-throughs

Tool and agent orchestration is shifting from single agent prompts to managed multi agent systems, creating demand for orchestration layers, eval frameworks, and governance that reduce operational noise and tool misuse.
Model provider variance in latency and quality, including differences across vendors serving nominally the same model, implies a growing need for vendor neutral evaluation, snapshot pinning, and reliability tooling at the application layer.
Meeting capture growth plus preference to partner on wearables suggests opportunity for hardware and capture partners, while the system of record value accrues to the software layer that ingests, indexes, and applies permissions to captured content.

What would confirm

Public product signals of direct agent to agent invocation with clear permission model, auditability, and configurable recursion caps, plus customer adoption of manager agent patterns.
Published or observable expansion of evaluation and reliability operations focused on cross provider variance, such as regression automation, snapshot controls, or vendor comparison workflows.
More emphasis on progressive disclosure and tool search to manage 100 plus tools, including measurable improvements in tool misfire rates, latency, or cost, and broader rollout across workflows.

What would kill

Custom agents show weak retention or paid conversion after a free period, implying limited willingness to pay for agent orchestration features beyond experimentation.
Provider variance is not reproducible under controlled tests or becomes negligible, reducing urgency for external eval and reliability tooling at the application layer.
Security and permissions constraints prevent broad deployment of shared agents across channels, leading to limited usage outside narrow, tightly permissioned scenarios.

Sources

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

2026-04-15 latent.space