Evaluation, Observability, And Harness Quality Become Differentiators

Issue 64 Edition 2026-03-05 9 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-03-08 21:28

Key takeaways

Agents can repeat known-bad actions if those mistakes remain in the context trace, and context pruning can reduce repeated failure loops.
Box's first investor connection originated at a TechCrunch house party, where Emily Melton later brought Box into DFJ for its Series A.
Enterprise file repositories can shift from passive storage to a continuously queried and transformed knowledge source when agents can search and synthesize their contents.
Most enterprise knowledge work is harder to agentify than coding because data access is restricted and fragmented, important inputs are non-text (e.g., calls and in-person interactions), and documentation practices are weak.
Box is hiring for a DevRel role to support its push around the message that "every agent needs a box."

Sections

Evaluation, Observability, And Harness Quality Become Differentiators

Agents can repeat known-bad actions if those mistakes remain in the context trace, and context pruning can reduce repeated failure loops.
Box uses a held-out multi-industry document evaluation scored by a rubric and has shifted from one-shot model testing to more agentic testing that measures both the harness and the model.
Agent observability and evaluation tooling is expected to become a massive market as enterprises will need evaluations for many internal pipelines.
Frontier models are generally not as good at search as humans' explore–exploit behavior.
Box partnered with Apex by sharing data about how different professions structure document workspaces (e.g., legal and banking).
Box's private evaluation is held out from model labs and not based on public data, reducing vendors' ability to train against it.

Go-To-Market Shifts: Direct Media, Devrel, And Adoption Costs

Box's first investor connection originated at a TechCrunch house party, where Emily Melton later brought Box into DFJ for its Series A.
A modern TechCrunch-like entity is valuable as a launchpad that attracts people during creative building and fundraising moments, regardless of contest format.
Companies will increasingly need to operate as media channels to communicate directly with audiences in a "go direct" model.
Developer relations demand is surging because products must get their services and APIs adopted by AI agents, making visibility to agents a content and marketing problem.
DevRel talent scarcity is driven in part by top practitioners being able to earn more independently via creator-economy channels such as Substack, YouTube, and Patreon.
Even if AI increases code output and features per dollar, companies may still spend as much on getting software adopted by customers due to attention competition and fast-changing technical requirements.

Enterprise Content Becomes Agent Substrate (Retrieval Dominates)

Enterprise file repositories can shift from passive storage to a continuously queried and transformed knowledge source when agents can search and synthesize their contents.
User trust in enterprise knowledge agents collapses quickly when retrieval is wrong, making model judgment and ranking quality critical to prevent abandonment after a small number of failures.
A key unsolved agentic search capability is determining when to stop searching and when to admit there is no answer, because models may return partial results without recognizing missing items.
Large or "infinite" context windows do not eliminate enterprise context needs today because cost and effective token limits still require selecting a small relevant slice from massive corpora, keeping search and context engineering essential.
In enterprise agent workflows, read–write is fundamental, but read is currently harder because the ratio of total content to task-relevant content is extremely high.
High-quality authoring in complex formats like PowerPoint remains difficult for current models because small formatting inconsistencies are highly visible to end users.

Workflow Re-Engineering And Information Hygiene As Prerequisites For Agent Roi

Most enterprise knowledge work is harder to agentify than coding because data access is restricted and fragmented, important inputs are non-text (e.g., calls and in-person interactions), and documentation practices are weak.
Enterprises will need to re-engineer workflows to make agents effective because work will adapt to the agent model rather than agents fully adapting to existing workflows.
AI labs are reportedly hiring field deployment and professional services roles and embedding with large enterprises.
Companies that systematically document and digitize tacit knowledge are likely to gain a productivity premium by reducing ramp time, rework, and performance variance.
A static "skills file" encoding company knowledge is insufficient because business reality changes frequently and requires ongoing updates that today mostly come from humans.
Agents will pressure organizations to improve documentation and maintain authoritative, up-to-date information because wrong data and lack of agent enablement are both costly.

Box Platform Positioning Around Agents And Ecosystem

Box is hiring for a DevRel role to support its push around the message that "every agent needs a box."
Box reports that 67% of the Fortune 500 are customers.
Box treats agents as a third type of customer distinct from human users and applications, requiring changes across stack layers such as metadata and search.
Aaron Levie's public mini-essays are produced via a deliberate flywheel that turns internal problems into public posts and feedback, then brings external learnings back into Box.
Box expects third-party agents to use Box as a sandboxed file-system workspace to store working artifacts (e.g., memory files, specs, generated documents) for collaboration and sharing.

Watchlist

AI-generated entertainment risks devolving into an endless feed of low-quality content rather than preserving film as a form of art.
Notable filmmakers are beginning to adopt AI; Aaron Levie cited his belief that Darren Aronofsky has released or will release an AI film.

Unknowns

What are actual enterprise adoption metrics for agents (agent identities provisioned, tasks executed per employee, and retention) versus the forecast of many agents per human?
How frequently are real security incidents (prompt injection, lateral movement, data exfiltration) occurring in enterprise agent deployments, and what controls would have prevented them?
What quantitative evidence supports the claim that retrieval errors drive rapid user abandonment (e.g., error thresholds, first-week retention curves, and impact of provenance UX)?
How does Box’s private held-out evaluation correlate with production outcomes (task success, latency, user satisfaction) across different models and harness designs?
What is the practical effectiveness and cost impact of context pruning to reduce repeated failure loops across common enterprise tasks?

Investor overlay

Read-throughs

Differentiation shifts from base models to evaluation, observability, and harness quality, implying upside for vendors that can prove reliable agent outcomes using private held-out tests and production telemetry.
Enterprise content platforms may gain strategic value as agents continuously search and synthesize repositories, shifting budgets toward retrieval, ranking, provenance, and context engineering rather than storage alone.
Go-to-market advantage increasingly depends on media-like distribution and DevRel, suggesting ecosystem and developer adoption mechanics become a primary constraint and differentiator for agent platforms.

What would confirm

Disclosure of agent adoption metrics such as agent identities provisioned, tasks executed per employee, and retention that demonstrate sustained usage beyond pilots.
Quantitative linkage between private held-out evaluations and production outcomes such as task success, latency, and user satisfaction across multiple models and harness designs.
Evidence that retrieval quality improvements and provenance UX reduce abandonment, shown via error thresholds, first-week retention curves, or measurable trust and usage lift.

What would kill

Material enterprise security incidents in agent deployments such as prompt injection, lateral movement, or data exfiltration that persist despite controls, slowing rollouts.
Context pruning and other reliability mechanisms fail to reduce repeated failure loops or impose prohibitive cost and latency, limiting practical deployment.
Enterprise workflows remain too fragmented and non-text, and documentation hygiene does not improve, preventing agent ROI outside coding-heavy use cases.

Sources

Every Agent Needs a Box — Aaron Levie, Box

2026-03-05 latent.space