Most local-inference stories in .NET start with a GPU. A vendor SDK gets installed, a native library lands in runtimes/<rid>/native/, a backend is initialized, and CPU shows up only in the README under “fallback for systems without acceleration.” That order — GPU is the headline, CPU is the leftover — is a reasonable place to start if your goal is to win on raw throughput across a vendor’s hardware portfolio. It is the wrong place to start if your goal is to be the easiest way to ship a small local model from a .NET application.

llmdot starts the other way around. The default install is pure managed .NET, the default execution target is CPU, and optional GPU backends are dispatched per-operation through a stable contract. This post explains why that ordering is the actual product, not a compromise.

The deployment story is the product

Talk to anyone who has shipped a desktop or server .NET application with a native inference dependency, and the same set of pain points comes up. Native binaries are per-platform and per-architecture — one for win-x64, one for linux-x64, one for linux-arm64, one for osx-arm64, possibly one for osx-x64 still. They get packaged into runtimes/<rid>/native/ and shipped as part of the application. They get loaded at start-up through P/Invoke and the OS loader, which means a missing dependency on the target box is a start-up failure rather than a build failure. They fight dotnet publish --self-contained, they fight single-file publishing, and they especially fight NativeAOT. They make Docker images bigger. They mean your install script knows about glibc versions.

For a runtime whose job is to make local inference feel native to .NET, that whole list is the bottleneck. Optimizing the inner loop of a matmul is not what is keeping a developer from shipping — the packaging story is.

The managed-by-default position is a direct response. Llmdot.Core has no native dependencies. The same DLL runs on Windows, Linux, and macOS without per-RID native assets. It trims. It works with NativeAOT. It works with single-file publish. dotnet publish does what you expect, the way it does for the rest of your application code.

That is the headline.

What “CPU-first” actually means in code

CPU-first is not the same as “we wrote scalar code and called it done.” It means three things in the codebase.

First, the tensor primitives are written against Span<T> to avoid copies and against System.Numerics.Vector to access SIMD. The hot paths — RmsNorm, LayerNorm, max-reduce and normalize in Softmax, Add, Scale, Mul — are vectorized through Vector<T>. Silu and Gelu remain scalar where transcendental functions dominate. MathF.Pow-based half-to-float conversion is replaced with direct IEEE 754 bit manipulation. The point is that the CPU path is a real implementation, not a placeholder.

Second, allocation discipline is treated as a first-class concern. The weight dequantization cache means norm and output projection weights are dequantized once and reused, rather than re-dequantized per token (which previously was responsible for dozens of allocations per token in a 32-layer model). KV cache and inference buffers are reused across Generate calls, not allocated per call. Sampling avoids logits.ToArray() by stackalloc for typical vocab sizes and in-place scaling. RoPE frequency tables are precomputed at construction time so we are not repeatedly calling MathF.Pow per head per layer per token.

Third, the quantization coverage is wide enough to be useful. Block quantization (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0), K-quants (Q2_K through Q6_K), and F16, F32, and BF16 are all supported in dequantization. The community publishes small models in K-quants. You can load those models without re-quantizing them through some other toolchain.

All of that lives in Llmdot.Core. None of it requires installing anything outside the NuGet package.

What we are honest about

CPU on modern consumer hardware is not the fastest way to run a transformer. For 1–8B quantized models the gap to a well-tuned GPU path is real but not infinite, and for many local workloads (interactive chat with a single user, occasional invocation from a worker service, on-device tool calls) it is comfortably good enough. For batched serving of large models across many users, it is not the right tool, and llmdot is not the right tool either. That is a non-goal and we say so out loud.

The phrase that captures the position is “optimize for the common case.” The common case we have in mind is a developer reaching for a small quantized model inside an application that does other things — a desktop app, an ASP.NET Core endpoint, a worker service, a CLI. That developer wants the install to be dotnet add package. They want the first token to arrive in a reasonable time. They do not want to debug GPU driver versions in a containerized deployment, and they do not want to ship 200 MB of vendor SDK with their application.

Incremental, additive acceleration

The position is not “CPU forever.” It is “CPU is the default, and acceleration is additive, not foundational.” Concretely, every tensor operation in InferenceEngine routes through IComputeBackend. The interface covers MatMul, MatMulF32, RmsNorm, LayerNorm, ApplyRoPE (both overloads), Softmax, Silu, SiluInPlace, Gelu, GeluScalar, Add, Scale, Mul, Softcap, Conv1D, DequantizeToFloat, and ArgMax. A backend that wants to offload one operation can; a backend that wants to offload many can; the engine does not care.

The Metal backend, targeting Apple Silicon, calls Metal APIs from C# through Objective-C runtime P/Invoke (objc_msgSend) and compiles compute shaders from Metal Shading Language source at runtime. It dispatches RmsNorm, LayerNorm, Softmax, Add (in-place and out-of-place), Scale (in-place and out-of-place), Mul, Silu, SiluInPlace, Gelu, and MatMulF32 on the GPU. Operations below a size threshold fall back to CPU automatically. Quantized MatMul, ApplyRoPE, DequantizeToFloat, Conv1D, and ArgMax remain on CPU at present; quantized Metal kernels are remaining work.

The Vulkan backend, targeting Linux and Windows, is structurally complete — the full P/Invoke surface is in place, SPIR-V compute shader source exists for RmsNorm, Softmax, and element-wise operations — but currently delegates to CPU pending SPIR-V binary compilation and testing on Vulkan-capable hardware.

BackendFactory.CreateBestAvailable() tries Metal on macOS, Vulkan on Linux and Windows, and falls back to CPU if neither is available. A user who never adds a backend package gets CPU and never has to think about it.

What this rules out, on purpose

The managed-by-default position rules out a category of architectures we would otherwise have to commit to. We do not target NPUs — NPUs are graph execution engines that require compiled subgraphs with static shapes, which is fundamentally incompatible with LLM inference patterns (dynamic KV cache, custom attention) and with the thin per-operation backend adapter design we have. NPUs also share system RAM bandwidth with CPU, providing no throughput advantage for memory-bound decode. The architecture doc covers this in detail. We are not running a quiet “NPU is coming” subplot.

It also rules out making CPU optional. There is no version of llmdot where you must install a GPU backend to load a model. The CPU path is the contract.

Why this is the actual product advantage

When we describe llmdot to a .NET developer, the line that lands is not “fast CPU kernels.” It is “the install is dotnet add package Llmdot.Core and the deploy is dotnet publish.” Every other property of the runtime — GGUF-native ingestion, four execution templates, the optional backend story, multimodal as a pluggable encoder — is in service of that line staying true as the project grows.

CPU as the headline path is what makes that line possible. The fallback framing has it exactly backwards.

Back to all posts · Features · Get started · Read the docs

Managed-by-default: why CPU is the headline path, not the fallback