.NET 8 / 9 / 10 · Pre-alpha · MIT

Run local GGUF language models from .NET.

Name: llmdot
Author: Cognisoc

One core package. One model format. One programming model. A managed-by-default runtime that loads GGUF files directly and streams tokens through idiomatic .NET APIs — no native toolchain, no ONNX conversion, no sidecar service.

Get started Read the docs GitHub

> dotnet add package Llmdot.Core

Program.cs

using Llmdot;

// Load a GGUF file. Stream tokens. That's the whole story.
await using var model = await LlmModel.LoadAsync("phi-3-mini-q4_k_m.gguf");
await using var session = model.CreateChatSession();

await foreach (var token in session.StreamAsync(
    "Explain GGUF in one paragraph."))
{
    Console.Write(token);
}

What is llmdot?

llmdot is a .NET-native runtime for local GGUF language model inference. It loads community GGUF models directly and executes decoder-only and hybrid architectures in the 1–8B range through four architecture-agnostic execution templates resolved from GGUF metadata at load time. The default install is pure managed .NET with zero native runtime dependencies; optional Vulkan and Metal backends add GPU acceleration when you want it.

Why llmdot

A third option for .NET local inference.

Today the .NET local-inference path forces a choice between native llama.cpp bindings and ONNX stacks. Each carries a tax. llmdot takes a third position: the deployment story is the product.

Option A

Native llama.cpp bindings

Broad model coverage, strong performance
Native packaging debt: per-RID binaries, P/Invoke, marshalling
Trimming and single-file publish get complicated

Option B

ONNX Runtime stacks

Excellent hardware acceleration
Conversion friction: export + optimize each model to ONNX
You leave the GGUF ecosystem behind

llmdot

Managed-by-default

Load GGUF directly — no conversion, no native toolchain
Pure managed core: trimming-, NativeAOT-, single-file-friendly
Optional GPU backends offload ops incrementally

Built for .NET developers

One package. The .NET you already write.

GGUF-native

Load community models straight off disk. No conversion pipeline, no proprietary packaging step, no ONNX export.

Pure managed core

The default install is managed .NET with zero native runtime dependencies. Trimming-, NativeAOT-, and single-file-publish friendly.

Idiomatic APIs

IAsyncEnumerable<T> streaming, DI, and Microsoft.Extensions.Hosting — the .NET you already write.

Config-driven architectures

New families plug into four execution templates resolved from GGUF metadata — zero engine code per architecture.

Common case first

1–8B quantized models on consumer hardware. Small enough to fit, big enough to matter.

Incremental acceleration

Optional Vulkan and Metal backends offload individual operations — no all-or-nothing graph rewrites.

See all features →

Architectures

Four templates. All 1–8B families.

Every supported architecture collapses into one of four execution templates. Variation is expressed through a TransformerConfig resolved from GGUF metadata — not conditional branches on architecture strings.

Template	Architectures	Example models
LLaMA-like sequential pre-norm	`llama, phi3, qwen2, stablelm, mistral`	LLaMA-3.2, Qwen-2, Phi-3, Mistral-7B
GPT-NeoX-like parallel residual	`gptneox, phi2`	Pythia, Phi-2
Gemma-like embedding scaling + post-norm	`gemma, gemma2`	Gemma 2B, Gemma-2 2B/9B
LFM2-like hybrid conv-attention	`lfm2, lfm2_moe`	LFM2 350M–2.6B, LFM2-VL

Multimodal variants (vision via SigLIP2) plug in as modality encoders on top of the base LLM backbone — the core runtime is unchanged. How the engine works →

execution templates

1–8B

parameter range

native deps (default)

target frameworks

MIT

open source

Packaging

One required package. Everything else is opt-in.

Package	Purpose	Dependencies
`Llmdot.Core`	GGUF loader, model graph, CPU backend, sampling, tokenizer	Pure managed .NET
`Llmdot.Extensions.AI`	`IChatClient` + `Microsoft.Extensions.AI` integration	`Llmdot.Core`
`Llmdot.Backends.Vulkan` planned	Vulkan compute acceleration	Native Vulkan loader
`Llmdot.Backends.Metal` planned	Metal compute (Apple Silicon)	Native Metal
`Llmdot.Multimodal.Vision` planned	SigLIP2 vision encoder + connector	`Llmdot.Core`

From the blog

Recent writing

May 6, 2026

Managed-by-default: why CPU is the headline path, not the fallback

Most .NET inference stories start with a GPU and treat CPU as the leftover. llmdot starts the other way around — and the deployment story is the advantage.
Apr 22, 2026

Four execution templates for every 1–8B model we care about

How a config-driven design collapses the modern decoder zoo into four execution templates — and why that matters for a .NET runtime that stays small.
Apr 8, 2026

Why we picked GGUF as the ingestion format for .NET

GGUF is what the open model community actually publishes. For a .NET runtime, picking it as the primary format eliminates a class of problems up front.

All posts →

Ship local, private inference in your .NET app.

llmdot is open source (MIT) and pre-alpha. Follow the quickstart, load a GGUF file, and stream tokens through the .NET APIs you already know.

Getting started View on GitHub

Part of the Cognisoc local-inference stack.