.NET 8 / 9 / 10 · Pre-alpha

Run local GGUF language models from .NET.

One core package. One model format. One programming model. A managed-by-default runtime that loads GGUF files directly and streams tokens through idiomatic .NET APIs — without a native toolchain, without ONNX conversion, without a sidecar service.

Hello, world

Load a GGUF file. Stream tokens.

The target API is small on purpose. No conversion step, no native wrapper to install, no sidecar to run — just LlmModel.LoadAsync against a GGUF file on disk.

using Llmdot;

await using var model = await LlmModel.LoadAsync("phi-3-mini-q4_k_m.gguf");
await using var session = model.CreateChatSession();

await foreach (var token in session.StreamAsync("Explain GGUF in one paragraph."))
    Console.Write(token);

Sample reflects the target API shape. See the roadmap for current implementation status.

Why llmdot

A third option for .NET.

Today the .NET local-inference path forces a choice between native llama.cpp bindings (broad models, native packaging debt) and ONNX stacks (strong acceleration, conversion friction). llmdot takes a third position.

GGUF-native

Load community models directly. No conversion pipeline, no proprietary packaging step.

Pure managed core

The default install is managed .NET. Trimming-friendly. NativeAOT-friendly. Single-file publish-friendly.

Idiomatic APIs

IAsyncEnumerable<T> streaming, DI, and Microsoft.Extensions.Hosting — the .NET you already write.

Config-driven architectures

New model families plug into four execution templates resolved from GGUF metadata — zero engine code per family.

Common case first

1–8B quantized models on consumer hardware. Small enough to fit, big enough to matter.

Incremental acceleration

Optional Vulkan and Metal backends offload individual operations — no all-or-nothing graph rewrites.

Architectures

Four templates. All 1–8B families.

Every supported architecture collapses into one of four execution templates. Variation is expressed through a TransformerConfig resolved from GGUF metadata — not through conditional branches on architecture strings.

TemplateArchitecturesExample models
LLaMA-like
sequential pre-norm
llama, phi3, qwen2, stablelm, mistralLLaMA-3.2, Qwen-2, Phi-3, Mistral-7B
GPT-NeoX-like
parallel residual
gptneox, phi2Pythia, Phi-2
Gemma-like
embedding scaling + post-norm
gemma, gemma2Gemma 2B, Gemma-2 2B/9B
LFM2-like
hybrid conv-attention
lfm2, lfm2_moeLFM2 350M–2.6B, LFM2-VL

Multimodal variants (vision via SigLIP2, audio via FastConformer/Mimi) plug in as modality encoders on top of the base LLM backbone — the core runtime is unchanged.

Packaging

One required package. Everything else is opt-in.

PackagePurposeDependencies
Llmdot.CoreGGUF loader, model graph, CPU backend, sampling, tokenizerPure managed .NET
Llmdot.Extensions.AIIChatClient + Microsoft.Extensions.AI integrationLlmdot.Core
Llmdot.Backends.Vulkan plannedVulkan compute accelerationNative Vulkan loader
Llmdot.Backends.Metal plannedMetal compute (Apple Silicon)Native Metal
Llmdot.Multimodal.Vision plannedSigLIP2 vision encoder + connectorLlmdot.Core
From the blog

Recent writing

All posts →