Llm-Mediated Reverse Engineering Workflow And Reproducibility Constraints

Issue 79 Edition 2026-03-20 6 min read

General

Sources: 1 • Confidence: High • Updated: 2026-03-25 17:54

Key takeaways

The author used Codex CLI with GPT-5.4 xhigh to review the zip for obvious hallucinations and, seeing none, published the result.
An assembler-knowledgeable reviewer argued the output was not a full disassembly, consisted of short snippets, and questioned whether the snippets were correct.
Searching the binary for the hex opcode sequence B0 E8 ('mov al,0xe8') was sufficient to confirm that a presented snippet was not present anywhere in the binary.
Borland's 1985 Turbo Pascal 3.02 executable was 39,731 bytes and included a full text editor IDE and a Pascal compiler.
A reviewer noted additional suspicious and impossible code in the artifact, including a 'ret 1' in a system call dispatcher that would misalign the stack.

The author used Codex CLI with GPT-5.4 xhigh to review the zip for obvious hallucinations and, seeing none, published the result.
The author obtained the Turbo Pascal executable and used Claude to interpret the binary and produce an interactive annotated artifact via a sequence of prompts.
The shared Claude link did not include the actually executed code from the session, so the author provided a zip of intermediate files instead.

An assembler-knowledgeable reviewer argued the output was not a full disassembly, consisted of short snippets, and questioned whether the snippets were correct.
A later update states that the published decompiled/annotated result was hallucinated and inaccurate.
After receiving the critique, Claude agreed the artifact mixed real hex dumps and some correct disassembly with fabricated assembly and labels for roughly half the binary that fails byte-level comparison.

Searching the binary for the hex opcode sequence B0 E8 ('mov al,0xe8') was sufficient to confirm that a presented snippet was not present anywhere in the binary.
A reviewer noted additional suspicious and impossible code in the artifact, including a 'ret 1' in a system call dispatcher that would misalign the stack.
A reviewer identified an example where an 'EmitByte' routine in the artifact pointlessly pushed and popped AX and concluded those instructions do not appear in the actual binary.

Borland's 1985 Turbo Pascal 3.02 executable was 39,731 bytes and included a full text editor IDE and a Pascal compiler.

What is a fully reproducible, byte-addressed disassembly of the Turbo Pascal 3.02 executable that maps every claimed function/snippet to exact offsets and bytes?
What fraction of the published artifact is verifiably correct versus fabricated when checked against the binary using deterministic disassembly tooling?
Which specific verification gates were applied (or omitted) in the initial publication process, beyond a qualitative second-model review?
Do the simple opcode-pattern searches and 'impossible code' heuristics generalize to reliably screening other AI-generated reverse-engineering writeups?
Is there any direct operator/product/investor decision readthrough stated in the corpus beyond the general need for byte-level verification in AI-assisted reverse engineering?

Near term demand for verifiable reverse engineering workflows may rise as LLM generated disassembly is shown to be partially fabricated without byte level grounding.
Tooling that automates deterministic, byte addressed mapping of snippets to offsets could see increased adoption as a verification gate for AI assisted reverse engineering.
Security and compliance teams may expand simple falsification checks like opcode pattern searches and impossible code heuristics as low cost screening for AI generated technical artifacts.

Publication of a fully reproducible, byte addressed disassembly that maps every claimed snippet to exact offsets and bytes and is independently repeatable.
Quantified audit results showing what fraction of the prior artifact matches the binary using deterministic disassembly tooling.
Documented workflow gates beyond a second model review, such as mandatory opcode level evidence for each claim and automated mismatch detection.

No follow up release of reproducible byte addressed artifacts and no quantified correction of what was fabricated versus correct.
Independent checks continue to find missing opcode sequences and impossible stack discipline constructs in published outputs.
Evidence that low cost falsification methods do not generalize and produce high false positives or fail to detect fabrications across similar writeups.