Operational Pipeline From Public Text To Llm-Produced Profiles

Issue 80 Edition 2026-03-21 7 min read

General

Sources: 1 • Confidence: High • Updated: 2026-03-25 17:54

Key takeaways

A described profiling workflow is: fetch a user's last roughly 1000 HN comments via a purpose-built tool, copy them, and paste them into an LLM with the instruction to profile the user.
The Algolia Hacker News API can list a user's most recent comments sorted by date by querying tags of the form "comment,author_<username>" with hitsPerPage up to 1000.
The author states he finds it creepy that substantial information about someone can be derived easily from public content accessible via an API.
The author hypothesizes that a model inferred his real name because his HN comments often link to his own website, providing URLs that could connect the username to a public persona.
The author reports using generated profiles mainly to avoid being drawn into extended arguments with people he believes have a history of bad-faith debate.

A described profiling workflow is: fetch a user's last roughly 1000 HN comments via a purpose-built tool, copy them, and paste them into an LLM with the instruction to profile the user.
The author reports that LLM-generated profiles from recent HN comments can be startlingly effective.
The author reports using incognito mode when running profiling prompts in an attempt to reduce the chance the model recognizes him and produces overly flattering responses.

The Algolia Hacker News API can list a user's most recent comments sorted by date by querying tags of the form "comment,author_<username>" with hitsPerPage up to 1000.
The Algolia HN API is served with open CORS headers, allowing cross-origin calls from JavaScript on arbitrary web pages.

The author states he finds it creepy that substantial information about someone can be derived easily from public content accessible via an API.
The author states he considers it invasive to publish LLM-generated profiles about other users and therefore only shares an example profile about himself.

The author hypothesizes that a model inferred his real name because his HN comments often link to his own website, providing URLs that could connect the username to a public persona.
The author reports not having seen LLM profiling outputs guess real names for other accounts he profiled.

The author reports using generated profiles mainly to avoid being drawn into extended arguments with people he believes have a history of bad-faith debate.

The author states he finds it creepy that substantial information about someone can be derived easily from public content accessible via an API.

How accurate and reproducible are LLM-generated user profiles when evaluated against consenting users' ground truth, under blinded assessment and across multiple models/prompts?
What are the practical scaling limits and failure modes of the Algolia HN API approach (rate limits, pagination behavior beyond 1000, completeness for high-volume users)?
How stable are the permissive CORS headers over time, and are there policy or technical changes that could remove browser-based access?
Under what conditions do models attempt explicit identity resolution (e.g., real-name guessing), and how strongly does self-linking behavior increase that risk?
Does using incognito or otherwise removing account history materially change profiling outputs in content (not just tone), and which evaluation controls are needed to isolate the effect?

Low friction public API access combined with LLM summarization enables cheap at scale profiling workflows, potentially increasing demand for tooling that automates ingestion, summarization, and user level analytics from public text.
Growing attention to inferential privacy risks from public data could drive interest in privacy preserving platform policies and third party compliance tooling that limits bulk access or governs downstream profiling use.
Profiling used for engagement avoidance suggests LLM generated reputation triage could influence community dynamics, creating demand for moderation and trust tooling that flags likely bad faith behavior from public history.

Evidence of sustained usage growth for browser based tools that pull large per user comment histories via permissive CORS APIs and feed them into LLM prompts for profiles.
Platform policy or technical changes specifically addressing bulk per user data retrieval, CORS access, or automated profiling, indicating the issue is reaching operational or reputational thresholds.
Controlled evaluations published by platforms or researchers measuring accuracy and reproducibility of LLM generated profiles against consenting ground truth, including identity resolution frequency and error rates.

API access becomes materially constrained through stricter rate limits, removal of permissive CORS headers, reduced hitsPerPage, or other changes that make lightweight bulk collection impractical.
Blinded evaluations show LLM generated profiles are not reliable or reproducible across prompts and models, limiting usefulness for triage and reducing incentives to build products around the workflow.
Models or policies consistently prevent identity resolution and discourage profiling outputs, reducing perceived creepiness and lowering the likelihood of platform level response or product demand.