When we sat down to design llmdot, the first decision was not which kernels to write or which backend to target. It was which file format the runtime would consume. That choice cascades into everything: distribution, packaging, toolchain dependencies, what “model support” even means as a phrase. We picked GGUF, and we picked it as the only ingestion format. Here is the reasoning.

What you actually find when you go looking for a model

If you are a .NET developer who wants to run a small language model locally tomorrow morning, you start by browsing the model catalog that the community has actually produced. You will find an enormous, fast-growing set of quantized GGUF files: LLaMA-3.2 in Q4_K_M, Phi-3 mini in Q4_K_M, Qwen-2 1.5B in Q8_0, Mistral-7B in Q5_K_M, Gemma-2 in Q4_K, LFM2 350M through 2.6B in several quantizations. These are the assets people share, the assets people benchmark, and the assets people fine-tune from. They have settled into the format the way audio settled into MP3 a generation ago.

A .NET inference runtime that asks you to first convert that file into a different format — ONNX, a vendor archive, a proprietary binary container — is asking you to leave the catalog you actually want to use and re-enter it through a side door. Every conversion step is a place where layout assumptions, quantization fidelity, and tokenizer metadata can drift. Every drift is a bug you eventually have to debug from the .NET side. The simplest way to avoid all of that is to consume GGUF directly.

What GGUF gives you that other formats do not

GGUF is structured around two ideas that matter for an inference runtime. First, the file is self-describing — metadata, tensor table, and tensor data are addressable through a single header. Second, the metadata schema carries everything you need to identify the model family, its quantization layout, its tokenizer assets, and the hyperparameters that govern attention, normalization, and the feedforward stack.

That second property is what makes a small runtime feasible. We can read general.architecture from the header and resolve a TransformerConfig from the rest of the metadata — rotary base frequency, head counts (including grouped-query and multi-query), norm type, attention type, FFN type, QKV layout, parallel residual flag, embedding scaling, MoE presence, sliding window size, softcapping, and the per-layer type vector for hybrid conv-attention families. The execution engine then reads from that resolved config, not from raw GGUF keys. The model graph never asks “what architecture is this?” — it asks “what does the config say to do?”

That distinction is the central abstraction in llmdot. It is what lets the same execution code run LLaMA-3.2, Qwen-2, Phi-3, Mistral, and Gemma-2 without per-model branches. It is what lets us add a new family by extending the resolver, not the engine.

Why “managed-by-default” follows from this

Once GGUF is the ingestion format, the next question is what runs underneath. We chose pure managed .NET for the default path. Not because managed code is faster than tuned native libraries — it is not, on the absolute scale — but because the cost of dropping a native dependency into the .NET deployment story is high enough that, for a large fraction of real applications, it is the bottleneck.

A pure managed core is trimming-friendly, NativeAOT-friendly, and single-file publish-friendly. It runs the same on Windows, Linux, and macOS without per-platform native binaries to package. It cooperates with dotnet publish instead of fighting it. It does not require the deployment pipeline to know about runtimes/linux-x64/native/, the loader rules of the OS, or which glibc the destination box happens to have. For desktop apps, ASP.NET workloads, and worker services, that property compounds into a real shipping advantage.

The Tensor Runtime is written against Span<T> and System.Numerics.Vector. Hot paths — RmsNorm, LayerNorm, max-reduce and normalize in Softmax, Add, Scale, Mul — are SIMD-accelerated through Vector<T>. Dequantization avoids transcendental functions in the hot path (FP16 conversion is done through direct IEEE 754 bit manipulation). Sampling avoids per-token heap allocations by stackalloc for typical vocab sizes. The point is not that managed code is unconditionally fast. The point is that, for the model sizes and quantizations llmdot is targeting — 1–8B in Q4_K_M, Q5_K_M, Q8_0 — the runtime is competitive enough that the packaging benefit dominates.

Optional native acceleration, not foundational

When raw throughput matters more than packaging simplicity, llmdot exposes an IComputeBackend contract. Individual tensor operations route through that interface, which means a backend can offload one op — say, a fused MatMul on quantized weights — without taking ownership of the whole graph. Metal and Vulkan backends are in development; they implement the same contract; they fall back to CPU below a size threshold; CPU remains the default.

That design is a deliberate inversion of the more common “GPU-first, CPU-as-an-afterthought” pattern. We build for the boring case first: a .NET application on a developer laptop or a modest server, running a small quantized model from a NuGet package. When we add acceleration, it is additive. There is no version of llmdot in which “you must install a vendor SDK before you can run a model.”

What this rules out, on purpose

GGUF-native, managed-by-default, decoder-only 1–8B with optional GPU compute — that set of choices rules out a few things, and the ruling-out is the point.

We do not target NPUs. NPUs are graph execution engines, not programmable compute. They cannot dispatch individual MatMul or RoPE operations, they require compiled subgraphs with static shapes, they share system RAM bandwidth with CPU, and the path to using one in practice goes through ONNX conversion — which we deliberately do not have. Every major LLM framework, including llama.cpp and MLX, targets GPU compute, not NPU. We do the same.

We do not target frontier-scale (70B+) models as an early milestone. The execution templates can in principle scale; the runtime can in principle run a larger model. But the value proposition collapses past a certain point — a 70B model is not a “ship it inside your .NET app” workload.

We do not require ONNX conversion, proprietary packaging, or sidecar services. The whole point is that there is one core package, one model format, and one programming model. If we ever found ourselves writing a “model converter” tool, we would have lost the plot.

What it looks like in code

This is what we want the developer experience to be:

using Llmdot;

await using var model = await LlmModel.LoadAsync("phi-3-mini-q4_k_m.gguf");
await using var session = model.CreateChatSession();

await foreach (var token in session.StreamAsync("Explain GGUF in one paragraph."))
    Console.Write(token);

That sample reflects the target API shape, and the implementation is still in active development — see the roadmap for what is wired up today. But the small surface in that code block is the entire pitch. The whole rest of the project, including the architecture-agnostic execution templates, the four-template taxonomy, and the optional GPU backends, exists to make sure that those five lines stay five lines as the model catalog grows.

That is why we picked GGUF.

Back to all posts · Features · Get started · Read the docs

Why we picked GGUF as the ingestion format for .NET