Run local GGUF language models from .NET.
One core package. One model format. One programming model. A managed-by-default runtime that loads GGUF files directly and streams tokens through idiomatic .NET APIs — without a native toolchain, without ONNX conversion, without a sidecar service.
Load a GGUF file. Stream tokens.
The target API is small on purpose. No conversion step, no native wrapper to install,
no sidecar to run — just LlmModel.LoadAsync against a GGUF file on disk.
using Llmdot;
await using var model = await LlmModel.LoadAsync("phi-3-mini-q4_k_m.gguf");
await using var session = model.CreateChatSession();
await foreach (var token in session.StreamAsync("Explain GGUF in one paragraph."))
Console.Write(token); Sample reflects the target API shape. See the roadmap for current implementation status.
A third option for .NET.
Today the .NET local-inference path forces a choice between native
llama.cpp bindings (broad models, native packaging debt) and ONNX stacks
(strong acceleration, conversion friction). llmdot takes a third position.
GGUF-native
Load community models directly. No conversion pipeline, no proprietary packaging step.
Pure managed core
The default install is managed .NET. Trimming-friendly. NativeAOT-friendly. Single-file publish-friendly.
Idiomatic APIs
IAsyncEnumerable<T> streaming, DI, and Microsoft.Extensions.Hosting — the .NET you already write.
Config-driven architectures
New model families plug into four execution templates resolved from GGUF metadata — zero engine code per family.
Common case first
1–8B quantized models on consumer hardware. Small enough to fit, big enough to matter.
Incremental acceleration
Optional Vulkan and Metal backends offload individual operations — no all-or-nothing graph rewrites.
Four templates. All 1–8B families.
Every supported architecture collapses into one of four execution templates.
Variation is expressed through a TransformerConfig resolved from GGUF metadata —
not through conditional branches on architecture strings.
| Template | Architectures | Example models |
|---|---|---|
| LLaMA-like sequential pre-norm | llama, phi3, qwen2, stablelm, mistral | LLaMA-3.2, Qwen-2, Phi-3, Mistral-7B |
| GPT-NeoX-like parallel residual | gptneox, phi2 | Pythia, Phi-2 |
| Gemma-like embedding scaling + post-norm | gemma, gemma2 | Gemma 2B, Gemma-2 2B/9B |
| LFM2-like hybrid conv-attention | lfm2, lfm2_moe | LFM2 350M–2.6B, LFM2-VL |
Multimodal variants (vision via SigLIP2, audio via FastConformer/Mimi) plug in as modality encoders on top of the base LLM backbone — the core runtime is unchanged.
One required package. Everything else is opt-in.
| Package | Purpose | Dependencies |
|---|---|---|
Llmdot.Core | GGUF loader, model graph, CPU backend, sampling, tokenizer | Pure managed .NET |
Llmdot.Extensions.AI | IChatClient + Microsoft.Extensions.AI integration | Llmdot.Core |
Llmdot.Backends.Vulkan planned | Vulkan compute acceleration | Native Vulkan loader |
Llmdot.Backends.Metal planned | Metal compute (Apple Silicon) | Native Metal |
Llmdot.Multimodal.Vision planned | SigLIP2 vision encoder + connector | Llmdot.Core |
Recent writing
-
Managed-by-default: why CPU is the headline path, not the fallback
Most .NET inference stories start with a GPU and treat CPU as the leftover. llmdot starts the other way around — and the resulting deployment story is the actual product advantage.
-
Four execution templates for every 1–8B model we care about
How a small config-driven design collapses the modern decoder zoo into four execution templates — and why that matters for a .NET runtime that wants to stay small.
-
Why we picked GGUF as the ingestion format for .NET
GGUF is what the open model community actually publishes. For a .NET inference runtime, picking it as the primary format eliminates a class of problems before code is written.