When people first encounter the list of model families llmdot supports — LLaMA-3.2, Qwen-2, Phi-3, Phi-2, Pythia, StableLM, Mistral, Gemma, Gemma-2, LFM2, LFM2-MoE, plus multimodal variants — the first reaction is usually some version of “that is a lot of architecture-specific code to maintain.” If we were taking that list literally and writing one execution path per family, that reaction would be correct. We are not. The runtime collapses all of those families into four execution templates, and within each template, all variation is expressed through a TransformerConfig resolved from GGUF metadata at load time. There are no conditional branches on architecture strings in the engine.
This post is the longer explanation of why that works, what the four templates are, and what the abstraction buys us in practice.
The taxonomy
Here are the four templates, with the architectures and example models that live inside each.
LLaMA-like — sequential pre-norm. This covers llama, phi3, qwen2, stablelm, and mistral. Example models are LLaMA-3.2, Qwen-2, Phi-3, Mistral-7B, and StableLM-2. The hallmark is the standard modern decoder block: pre-norm, attention with rotary position embeddings, residual, pre-norm again, SwiGLU feedforward, residual.
GPT-NeoX-like — parallel residual. This covers gptneox and phi2. Example models are Pythia and Phi-2. The hallmark is parallel residual structure (attention and FFN computed from the same normalized input, then summed) and a fused QKV projection with partial rotary application.
Gemma-like — embedding scaling plus post-norm. This covers gemma and gemma2. Example models are Gemma 2B and Gemma-2 in 2B/9B sizes. The hallmark is multiplying token embeddings by sqrt(d_model) on the way in, an additional post-norm on the way out of each sub-block, and optional softcapping on logits and attention scores for Gemma-2.
LFM2-like — hybrid convolution-attention. This covers lfm2 and lfm2_moe. Example models are LFM2 in 350M through 2.6B sizes, LFM2-VL, and LFM2-8B-A1B. The hallmark is alternating convolution and attention blocks, where the per-layer type is itself a config value, plus double-gated 1D causal convolution blocks and grouped-query attention.
Four templates cover the full 1–8B catalog we are designing for. Multimodal variants — vision-language via SigLIP2, speech via FastConformer or Mimi — plug in as modality encoders on top of the base LLM backbone. The core runtime is unchanged.
Why “templates” and not “implementations”
The reason this approach works is that the differences across modern decoder architectures are increasingly parametric, not structural. Two LLaMA-like models differ in head counts, hidden size, intermediate size, layer count, rotary base frequency, vocab size, the presence or absence of sliding window attention, the presence or absence of grouped-query attention, the activation function used in the FFN, and a handful of similar dials. They do not differ in what blocks run in what order. The execution graph is the same.
The leverage is to push every per-model difference into a config object, then write one execution path per template that reads from the config. We call that config TransformerConfig. It is resolved from GGUF metadata at load time, and it carries every dial the engine needs: norm type, attention type, FFN type, QKV layout, parallel residual flag, embedding scaling factor, MoE presence (parsed even if execution is deferred), sliding window size, softcapping presence, and, for hybrid models, the per-layer type vector that tells the engine “layer 0 is conv, layer 1 is conv, layer 2 is attention, …” and so on.
When a new family appears in the wild, the work is almost always in two places: the resolver, which learns to read the new metadata keys, and possibly a tensor name resolver update if the new family fuses or splits tensors differently. The execution graph itself does not change. That is the design property we are optimizing for.
What this looks like in practice
For an existing family, adding support is configuration work. Adding LLaMA-3.2 Instruct after LLaMA-3.2 Base, for example, is a chat template change — the model graph is identical. Adding Phi-3 after LLaMA was a question of confirming that the metadata mapped onto the LLaMA-like template.
For a new family with a genuinely new structural pattern, the work is bounded by which template it most resembles. Gemma-2 brought softcapping into the picture — that landed as a config-driven gate inside the Gemma-like template, not a new family branch. LFM2 brought hybrid conv-attention — that landed as the fourth template, with the per-layer type vector dispatched at run time.
A useful test for this design is “what happens when an unsupported model is loaded.” The answer should be a useful diagnostic, not a crash. The model capability inspection surface — ModelCapabilities exposing architecture, template, attention type, MoE status, sliding window, softcapping — gives the engine a place to report what it read and what it cannot run, with enough specificity that the failure is actionable.
What the abstraction buys at the .NET level
There is a second benefit that matters specifically in a managed-.NET context. Because the execution path is parametric, it is small. The hot loop is the same code regardless of which decoder family is running — the same matmul, the same RMS norm, the same RoPE application, the same softmax, the same SwiGLU. That is the code we are interested in optimizing. SIMD-accelerating RmsNorm once benefits every model. Tightening the dequantize path once benefits every model. Adding a Metal compute path for the matmul once benefits every model.
The alternative — a per-family execution path — would mean that every kernel optimization had to be ported across N implementations, with N drift opportunities and N test surfaces. For a small team shipping a .NET-native runtime, that is the wrong trade. The four-template design is what makes the project tractable.
Where this leaves multimodal
Multimodal variants follow the same logic at one level up. A vision encoder (SigLIP2 for current vision-language models) and an audio encoder (FastConformer or Mimi for current speech models) produce embedding tensors that join the language model’s input embedding stream at sentinel-token positions defined in the tokenizer. The base LLM backbone — whichever of the four templates it lives in — runs unchanged. The modality encoder is a pluggable component, not a fork of the engine.
That is what lets us say “small multimodal models work” without committing to a new execution template per modality. The execution template is for the language backbone. The modality is upstream of it.
What this rules out
The four-template design rules out architectures that are not decoder-only transformers or hybrid conv-attention variants of them. Encoder-decoder models (T5-style) and encoder-only models (BERT-style) are deliberately deferred. The runtime is not designed to grow into them — we would rather stay small and tightly focused on the family that matters for local inference today than spread thin across every transformer ever proposed.
It also rules out, for now, full MoE expert routing at execution time. The metadata is parsed in Phase 1 — the engine knows when a model has MoE structure — but the expert routing path itself is deferred work. We would rather ship dense execution correctly across all four templates first.
The summary
Four execution templates, one config object resolved from GGUF metadata, no architecture branches in the engine, modality encoders as pluggable components on top of the language backbone. That is the entire model-support story for llmdot. When a new 1–8B family appears, the question we ask is “which template?” not “what new code do we write?” That is the design choice that makes the runtime small enough to live cleanly inside a normal .NET application — and it is the choice that decides what the next several phases of the project look like.