使用MLX在macOS上实现Gemma 4音频转录

Simon Willison·4月13日 07:57 UTC·作者 Simon Willison

关键信息

该命令使用`uv run`配合Python 3.13、mlx-vlm和torchvision来运行WAV文件的推理。输出显示了一些轻微的转录错误，表明准确性仍有提升空间。

资讯摘要

西蒙·威尔森展示了如何通过MLX在macOS上使用Gemma 4 E2B模型转录音频文件。他提供了一个可立即运行的命令，利用mlx-vlm库，让开发者能在自己的Mac上轻松测试。

该示例使用了一个14秒的WAV文件，输出的转录结果基本准确，但存在一些小错误，比如将‘right here’误识别为‘front’。这展示了Apple Silicon芯片在本地LLM音频处理方面的潜力，无需外部硬件或云服务。

资讯正文

得益于Rahim Nathwani的提示，这里提供一个使用MLX和mlx-vlm在macOS上转录音频文件的uv run配方，所用模型为10.28 GB的Gemma 4 E2B模型（来自Hugging Face）：

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \

mlx_vlm.generate \

--model google/gemma-4-e2b-it \

--audio file.wav \

--prompt "Transcribe this audio" \

--max-tokens 500 \

--temperature 1.0

我尝试在<a href="https://static.simonwillison.net/static/2026/demo-audio-for-gemma.wav">这个14秒的.wav文件</a>上运行该命令，输出如下内容：

This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.

（原文本应为“This right here...”和“...how well that works”，但我能理解为什么它误读成了“front”和“how that works”。）

来源与参考