What it is
llmdot is a native .NET runtime for local language model inference, built around the GGUF model format. It executes major decoder-only transformer and hybrid architectures in the 1–8B parameter range — including multimodal variants — through architecture-agnostic execution templates resolved from GGUF metadata at load time.
The default path is pure managed code with zero native runtime dependencies, focused on CPU-first execution. Optional packages provide GPU acceleration through thin backend adapters.
What it is for
llmdot is for .NET developers shipping local, private, or offline AI features without fighting the inference stack — desktop, edge, server, and worker workloads where packaging simplicity, deployment predictability, and platform portability matter as much as raw throughput. If you have ever thought “I just want to load a GGUF file in my ASP.NET Core app and stream tokens”, llmdot is built for you.
What it is not
- It is not the fastest inference engine on every hardware target.
- It is not a replacement for vendor-optimized GPU runtimes for large-scale serving.
- It does not require ONNX conversion or proprietary model packaging.
- It does not target frontier-scale (70B+) models as an early milestone.
- It does not target NPUs — NPUs are graph compilers, not programmable compute. See the architecture doc for the reasoning.
Status
Pre-alpha. The specification, architecture, and execution template design are stable. Implementation is in active development. Do not use in production yet.
Track progress in the roadmap.
Supported runtimes
Target frameworks are net8.0, net9.0, and net10.0
— net8.0 LTS is the compatibility floor. The codebase ships with
Nullable enabled, warnings as errors, and LangVersion=13.0.
Optional GPU backends target Metal on Apple Silicon and Vulkan
on Linux and Windows. Both are dispatched per-operation through an
IComputeBackend contract; CPU remains the default fallback.
License & contributing
MIT-licensed. Design feedback is welcome — please read the vision and architecture documents before opening an issue. The areas most useful to contribute to right now are GGUF quantization format coverage, managed kernel optimization, tokenizer correctness across BPE variants, and test fixtures for additional model families.