Evaluation-And-Measurement-As-Primary-Lever

Issue 7 Edition 2026-01-07 9 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-02-20 09:11

Key takeaways

Foody says he is most interested in identifying which rubrics and data types most improve model performance and how to apply human labor more efficiently to frontier AI training problems.
Foody argues employers should evaluate what candidates can accomplish while using powerful AI tools (e.g., via screen-recorded timed build tasks) rather than trying to prevent AI assistance in assessments.
Mercor pays high expert wages (e.g., $150/hour for poets) because a one-time expert teaching effort can be amortized across very large numbers of model users.
Foody expects AI to rapidly automate roughly 50–75% of expert knowledge work while the final ~25% that requires human expertise remains difficult for a long time.
Foody expects generating reliable evaluation and training signals for very long-horizon tasks (e.g., 100 hours to 100 days) to become a primary bottleneck to further model improvement.

Sections

Evaluation-And-Measurement-As-Primary-Lever

Foody says he is most interested in identifying which rubrics and data types most improve model performance and how to apply human labor more efficiently to frontier AI training problems.
Foody argues the AI field has overemphasized abstract 'intelligence' and should focus on measuring and building capabilities enterprises find useful.
For subjective domains like poetry, Mercor uses grading-style rubrics to evaluate model outputs against desired characteristics.
Apex (AI Productivity Index) was created to measure model performance on economically valuable real-world tasks rather than academic benchmarks.
Apex builds task suites by surveying domain experts on time allocation and then collecting prompts and rubrics for each work bucket, using time-spend as a proxy for economic value weighting.
A single aggregate benchmark score does not translate uniformly to economic value because some domains require near-perfect accuracy (e.g., medicine) while others tolerate iteration (e.g., legal drafts, consulting).

Labor-Markets-And-Hiring-Shift-Toward-Work-Sample-And-Tool-Inclusive-Assessment

Foody argues employers should evaluate what candidates can accomplish while using powerful AI tools (e.g., via screen-recorded timed build tasks) rather than trying to prevent AI assistance in assessments.
Foody claims interviews often fail because they overweight vibes and similarity rather than measuring job-relevant skills, and a cleaner solution is to assign and grade a realistic project.
Foody says Mercor hired Sandeep Jain, former Chief Product Officer and Chief Technology Officer at Uber, as a deep domain expert in labor markets.
Foody claims labor markets are inefficient because applications and hiring are disaggregated, and while a centralized aggregator could improve matching, the main blocker is accurately predicting job performance from sparse profiles like LinkedIn.
Foody describes Mercor as operating at the intersection of labor markets and AI research, with deep labor-market expertise built into the company’s DNA.
Foody claims that for non-technical roles where realistic projects are hard to curate, prior similar-role experience and deep reference checks are better proxies, and that predicting growth rate is harder than measuring initial capability.

Data-Pipelines-And-Quality-Control-For-Post-Training

Foody says he is most interested in identifying which rubrics and data types most improve model performance and how to apply human labor more efficiently to frontier AI training problems.
Mercor pays high expert wages (e.g., $150/hour for poets) because a one-time expert teaching effort can be amortized across very large numbers of model users.
Mercor recruits and manages domain experts who create examples and evaluations used to train leading AI models to perform better on specific tasks.
Mercor sometimes uses AI models to flag low-effort or fatigued human contributors by comparing their work against expert-created rubrics and evaluations.
Models struggle most in domains where correct approaches are not written down and instead are tacit in experts’ heads, requiring pretraining exposure or post-training expert-created datasets to transfer that knowledge.

Macro-And-Organizational-Expectations-Elasticity-And-Work-Recomposition

Foody expects AI to rapidly automate roughly 50–75% of expert knowledge work while the final ~25% that requires human expertise remains difficult for a long time.
Foody claims demand for software is extremely price elastic, so making engineers far more productive could increase the number of software engineers and total software produced rather than reduce engineering employment.
Foody reports that obtaining small amounts of seed working capital can be difficult until a venture reaches scale, after which it can become heavily capitalized.
Foody expects much of the economy to become a reinforcement-learning-like environment where workers shift from repeating tasks to making fixed-cost investments in teaching agents once and reusing them many times.
Foody claims building businesses (product and distribution) is highly price elastic, implying more company formation and go-to-market experimentation as costs fall.

Agentic-Systems-Bounded-By-Time-Horizon-And-Eval-Scarcity

Foody expects generating reliable evaluation and training signals for very long-horizon tasks (e.g., 100 hours to 100 days) to become a primary bottleneck to further model improvement.
Foody claims model capability is strongly constrained by task time horizon, with short chat-window tasks becoming superhuman earlier than longer-horizon actions (e.g., drafting emails, scheduling meetings, or running a 90-day startup effort).
Foody says Mercor’s next major goal is scaling 'super realistic evaluations' that measure how AI models use tools over long trajectories that would take humans days or weeks.
Foody expects long-horizon task performance and extensive tool use to improve sharply once measured, and would be surprised if models are not highly capable on these within 6–12 months.

Watchlist

Foody expects generating reliable evaluation and training signals for very long-horizon tasks (e.g., 100 hours to 100 days) to become a primary bottleneck to further model improvement.
Foody says he is most interested in identifying which rubrics and data types most improve model performance and how to apply human labor more efficiently to frontier AI training problems.

Unknowns

What is the full public specification of Apex (task taxonomy, sampling, scoring, rater protocols, weighting scheme, and versioning), and is it reproducible by third parties?
Is the claimed 25–30% per-year improvement rate on economically valuable tasks, and the cited ~64% level for GPT-5, corroborated by published results on a stable benchmark distribution?
Do long-horizon, tool-using evaluations exist at scale today, and if they do, what are their failure modes (reward hacking, rater drift, environment instability) and reliability properties?
How large is the gap between model performance on short-horizon tasks versus long-horizon tasks when measured with consistent protocols, and where does reliability collapse as horizon increases?
Which rubric designs and data types produce the largest marginal model improvements per unit of expert labor, and how sensitive are results to rater disagreement and aggregation choices?

Investor overlay

Read-throughs

Evaluation artifacts and long-horizon task measurement could become a key bottleneck, implying increased demand for tools, services, and datasets that produce reliable rubrics, task suites, and scoring for agentic, tool-using workflows over days and weeks.
High-wage expert labor for post-training data and evaluations may be economically viable when teaching effort is reusable at scale, supporting businesses that can recruit, quality-control, and operationalize specialized raters and domain experts efficiently.
Hiring may shift toward work-sample, tool-inclusive assessments that measure AI-augmented productivity, implying growth for platforms that deliver validated, high-signal performance prediction and reduce reliance on resumes, interviews, and vibe-based screening.

What would confirm

Publicly reproducible specifications for task taxonomy, sampling, scoring, rater protocols, weighting, and versioning that third parties can run and obtain stable results across time and rater pools.
Demonstrated reliable long-horizon evaluations at scale with quantified inter-rater reliability, low rater drift, and documented defenses against reward hacking and environment instability.
Evidence that rubric and data type choices produce consistent marginal performance gains per unit of expert labor, with transparent aggregation methods and sensitivity analysis to rater disagreement.

What would kill

Independent attempts fail to reproduce evaluation results due to shifting distributions, unclear protocols, or unstable scoring, making the measurement product non-portable and hard to trust for optimization decisions.
Long-horizon benchmarks show severe reliability collapse as horizon increases, with high variance, frequent reward hacking, or rater drift that cannot be corrected without prohibitive human supervision.
Reported improvement rates on economically valuable tasks do not hold on a stable benchmark distribution, suggesting that gains are driven by benchmark churn, narrow task selection, or aggregation artifacts.

Sources

Brendan Foody on Teaching AI and the Future of Knowledge Work

2026-01-07 cowenconvos.libsyn.com