DeepSeek 4 Flash local inference engine for Metal

先看结论：ds4.

ds4.c is a small native inference engine for DeepSeek V4 Flash.

核心内容

It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework.

The main path is a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue.

This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Now, back at thsi project.

Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine?

Because after comparing it with powerful smaller dense models, we can report that: - DeepSeek v4 Flash is faster because of less active parameters.

- In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity.

This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.

- The model features a context window of 1 million tokens.

- Being so large, it knows more things if you go sampling at the edge of knowledge.

For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters.

- It writes much better English and Italian.

It feels a quasi-frontier model.

- The KV cache is incredibly compress, allowing long context inference on local computers and on disk KV cache persistence.

- It works well with 2-bit quantization, if quantized in a special way (read later).

延伸阅读：如果你想继续找可转化的工具入口，可以去工具合集和副业赚钱继续看。

进入 AI 工具导航页查看更多 AI 聊天