Llm-Driven Profiling Pipeline And Evaluation Caveats

Issue 80 Edition 2026-03-21 6 min read

General

Sources: 1 • Confidence: High • Updated: 2026-04-12 10:18

Key takeaways

A described profiling workflow is to fetch roughly a user's last 1000 HN comments, copy them via a tool, and paste them into an LLM with the prompt "profile this user".
The Algolia Hacker News API can list a user's most recent comments sorted by date using tags formatted as "comment,author_<username>" with hitsPerPage up to 1000.
The author finds it creepy that substantial information about someone can be derived easily from publicly shared content available via an API.
The author expects the model inferred his real name because his HN comments frequently link to his own website, providing URLs that connect the account to a public persona.
The author mainly uses generated profiles to avoid getting drawn into extended arguments with people who have a history of bad-faith debate.

A described profiling workflow is to fetch roughly a user's last 1000 HN comments, copy them via a tool, and paste them into an LLM with the prompt "profile this user".
The author reports that LLM profiling based on a user's recent HN comments can be startlingly effective.
The author runs the profiling prompt in incognito mode as an attempt to reduce the chance the model recognizes him and responds with overly flattering output.

The Algolia Hacker News API can list a user's most recent comments sorted by date using tags formatted as "comment,author_<username>" with hitsPerPage up to 1000.
The Algolia Hacker News API is served with open CORS headers, allowing it to be called from JavaScript on any web page.

The author finds it creepy that substantial information about someone can be derived easily from publicly shared content available via an API.
The author considers it invasive to quote profiles generated about other users and therefore only shares an example profile produced about himself.

The author expects the model inferred his real name because his HN comments frequently link to his own website, providing URLs that connect the account to a public persona.
The author reports not having seen profiling outputs guess real names for other accounts he has profiled.

The author mainly uses generated profiles to avoid getting drawn into extended arguments with people who have a history of bad-faith debate.

The author finds it creepy that substantial information about someone can be derived easily from publicly shared content available via an API.

What are the practical API constraints (rate limits, pagination behavior beyond 1000, retention window, availability guarantees) for at-scale per-user comment retrieval?
How accurate and consistent are LLM-generated profiles when evaluated against consenting ground truth across many users and multiple models/prompts?
How large is the effect of subject recognition and account/context leakage (including sycophancy) on profiling outputs?
What is the true frequency of real-name or identity guesses, and how strongly is it correlated with self-linking behavior (links to personal websites, handles, etc.)?
What norms, policies, or technical restrictions (platform-level or tool-level) will emerge regarding republishing or operationalizing profiles derived from public forum text?

Lower friction access to public comment histories enables new client side profiling and reputation tools, implying potential growth in LLM powered moderation and user risk scoring products if platforms allow it.
Privacy discomfort driven by inference rather than new collection suggests rising governance pressure on platforms and tool builders, potentially leading to restrictions on bulk access or republishing derived profiles.
Observed identity linkage via self links indicates demand for deanonymization risk controls and redaction tooling for individuals and communities, especially where pseudonymity is valued.

Platforms introduce or tighten rules about automated profiling, republishing profiles, or using public forum text for reputation assessment, alongside clearer enforcement guidance.
API providers add bulk access features for per user histories or tooling ecosystems emerge around client side retrieval and summarization, indicating sustained demand rather than one off experiments.
Community moderators or users adopt LLM summaries to decide engagement, with documented workflows and norms that treat such profiling as standard practice.

APIs reduce hitsPerPage, enforce stricter rate limits, remove open CORS access, or require authentication, materially raising friction for at scale per user retrieval.
Evaluations show LLM generated profiles are inconsistent or frequently incorrect across users and prompts, limiting usefulness for moderation or reputation assessment.
Strong norms or policies emerge that prohibit operationalizing inferred profiles, with credible compliance risk that discourages productization.