Local Audio Transcription On Macos Via Mlx

Issue 102 Edition 2026-04-12 5 min read

General

Sources: 1 • Confidence: High • Updated: 2026-04-13 03:34

Key takeaways

A locally runnable uv-based recipe on macOS can transcribe an audio file using the 10.28 GB model google/gemma-4-e2b-it with MLX and mlx-vlm.
In the produced transcript, at least two word-level errors were observed: "This right here" was transcribed as "This front here" and "how well that works" was transcribed as "how that works."
The workflow invokes mlx_vlm.generate via uv using Python 3.13 and installs mlx_vlm, torchvision, and gradio, passing an audio .wav file and a transcription prompt while using model google/gemma-4-e2b-it.
On a 14-second .wav file test, the command produced a coherent English transcription of the voice memo content.
The example invocation sets generation controls to a maximum of 500 tokens and temperature 1.0.

A locally runnable uv-based recipe on macOS can transcribe an audio file using the 10.28 GB model google/gemma-4-e2b-it with MLX and mlx-vlm.
The workflow invokes mlx_vlm.generate via uv using Python 3.13 and installs mlx_vlm, torchvision, and gradio, passing an audio .wav file and a transcription prompt while using model google/gemma-4-e2b-it.
The example invocation sets generation controls to a maximum of 500 tokens and temperature 1.0.

In the produced transcript, at least two word-level errors were observed: "This right here" was transcribed as "This front here" and "how well that works" was transcribed as "how that works."
On a 14-second .wav file test, the command produced a coherent English transcription of the voice memo content.

What is the transcription accuracy across a representative set of audio conditions (multiple speakers, accents, background noise, longer recordings) for this exact workflow and model?
What macOS hardware and memory constraints are required for practical use (e.g., whether the model runs acceptably on common configurations) beyond the stated model size?
How sensitive are results (accuracy, determinism, verbosity) to decoding parameters such as temperature and token limit in this setup?
What prompting format and content yields the best transcription fidelity for this model (including whether specialized prompts reduce the specific mishearing patterns observed)?
What are the failure modes and operational pitfalls (dependency conflicts, model download issues, audio format limitations) when running this recipe outside the demonstrated case?

Local, on-device transcription on macOS using MLX could reduce dependence on hosted APIs for some workflows, shifting value toward Apple-silicon-optimized tooling and local-first AI software distribution.
Using a large model for transcription with a simple uv-based recipe suggests a lowering barrier to integrate local audio transcription into developer products, potentially expanding demand for packaging, UX, and evaluation tooling.
Observed word-level errors despite coherent output imply near-term need for verification, prompting optimization, and tuning, creating opportunity for products that measure accuracy and manage transcription quality locally.

Benchmark results for this exact workflow and model showing accuracy across varied audio conditions, plus clear sensitivity analysis to temperature and token limits that improves fidelity without large latency penalties.
Demonstrations that common macOS hardware runs the 10.28 GB model acceptably, including memory footprint and throughput metrics, indicating practical deployment beyond a single short clip.
Evidence of stable operation outside the demo case: reliable installs, predictable model downloads, and robust handling of typical audio formats, reducing operational friction for developers.

Evaluation shows materially poor transcription accuracy across realistic conditions such as multiple speakers, accents, noise, or longer recordings, requiring frequent manual correction.
Hardware requirements make practical use limited to high-end macOS configurations, or latency is too high for common user expectations, undermining local-first feasibility.
Frequent failure modes such as dependency conflicts, model download issues, or narrow audio format support make the recipe unreliable for real-world developer adoption.