Skip to Content

Mansplain: Running local LLMs

Just give me the link

Someone recently asked me if they could run Qwen 3.6 on their laptop. I got a bit carried away with my answer, but I think it might have value to someone, so here it is. It’s a sort of distilled crash course in helping you navigate the world of local model hosting.

TL;DR

You can trade model precision for model size to fit the best model into your existing hardware budget. The question becomes - is the biggest version of a model that I can practically run good enough to complete tasks reliably? And the answer probably is - good results with ~24gb of unified memory or a 24gb GPU for simple agentic tasks, but not for long running ‘spec driven development’ workloads.

I’ve added some practical instructions after the mansplain if you just want to rip’n’run.

BEGIN MANSPLAIN

Level 1: What you already know.

Model providers release a range of model sizes for each generation

Model trainers will typically release several sizes of models. These are how many parameters the model is trained on, and correlated to how many layers the model has. This in turn correlates to performance, and to memory footprint. Practically, the sweet spot for local use is in the 25-35b parameter range. Smaller stuff is still useful, but for much more constrained tasks like autocomplete.

You can tell the model generation by the version number after the name of the provider, and you can tell the number of parameters by the next number - eg Qwen3.6-27b.

Level 2: What you probably don’t know

The community provides variants of each model generation-size at various levels of precision to run models in constrained environments

When a model trainer releases a model, they tend to release full float precision models. This means that the outputs of each node in the neural network are put out as full precision float (typically signified by the suffix ‘_FP’), which are then fed into downstream nodes. As you probably know, these occupy more bytes than other numerical types. More bytes=higher information fidelity=better quality outputs. BUT token generation speed is memory bandwidth constrained, so more bytes also means slower inference.

There are many hobbyists who will post-process these full-float models and truncate the node output data types to smaller numerical types. This is known as quantisation. The result is that the model will occupy less disk space, and less memory when running, but they will also push less precise values through the neural net, resulting in lower model performance. The size of the quantisation bit size is signified by the ‘_QX’ suffix . So, Gemma_Q4 will be quantised to 4 bit values. I’ve attached a screenshot of how quantisation can degrade a models performance. Q8 is near-perfect, Q4 is probably as low as you want to go for non-toy workloads.

Practically what it boils down to is that a 27b dense model like Qwen at full precision will run on about 60gb of memory (RAM, VRAM, or unified memory if you’re on Apple). That’s enterprise GPU territory. But a quantised Q4 version of the same model will run on ~17GB of memory - that’s a 6 year old RTX 3090 GPU - and you’ll get >90% of the model performance. Because each inference is only pushing ~17gb of data through the memory channels, as opposed to ~60gb for each inference, you also get a really nice speed bump. There are different file formats for models, but an implicit common standard is the GGUF format, which is nice and portable.

You should now understand what the model name Qwen3.6-27B-Q4.gguf means. It is a file you can download.

Level 3: What you kinda know but not really

There are lots of ways to run a model, but under the hood, there are actually not that many

To actually run a model, you need a model runtime. Common names in this space are ollama, LM Studio, llama.cpp, GPT4All and LLamafile.

These are typically open source projects that load the model, hook it up to the GPU with the right OS kernel modules activated and provide you with an interface to interact with. These runtimes often need to bake in specific optimizations for each new model that is released. So you’ll often see a new model released, then within, say 24 hours, you’ll see a whole slew of community provided quantised versions, and then in 48 hours all the runtimes will release support for those models. Some stuff will work, but because it’s so rushed things like tool calls might not work out of the box, but if you re-download the models 2 weeks later and update your runtime performance dramatically improves.

Actually, under the hood, most of these are UI and sane-default wrappers around either llama.cpp, or vLLM. They both have their pros and cons. I only know llama.cpp. Many people have success with the wrappers, they abstract a lot of the scary command-line stuff associated with setting up llama.cpp or vLLM yourself. But we are programmers, and cloning in a repo and running make are not hard.

There are a lot of knobs to tweak in these runtimes. They store tokens in a key-value store (called the kv cache), and you can quantise that to squeeze as much context window onto your hardware as you’d like - at a precision cost of course. There are things like flash attention, memory mapping, repeat penalty, and nucleus sampling. This can be a rabbithole, but we are not LLM researchers - google for what works for your model, grab the first answer off reddit and just run those parameters.

END MANSPLAIN

If you want to just ‘get up and running’ then the Unsloth quantised models are a really popular community choice, and they recently created their own runtime, Unsloth Studio, that makes running a local model trivial. The Qwen 3.6 guide can be found here: https://unsloth.ai/docs/models/qwen3.6. Remember: runtime support and community model quality typically improve dramatically over the 2 weeks from a model release date, so if it disappoints today, then redownload in 14 days and your mileage may improve.

Here is the full set of commands I use to compile llama.cpp. I do this frequently to benefit from enhanced support for the models I run. Note the build flags - this is hardware dependent.

git -C llama.cpp pull
cmake --build llama.cpp/build --config Release -j --clean-first
cmake llama.cpp -B llama.cpp/build     -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON  -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS

And here is the command I use to run Gemma-4 E2B Q4 via the llama-server runtime in llama.cpp. Llama-server exposes an OpenAI completions API at the specified port, and you can point your agent at that. Don’t worry about the mmproj parameter - that’s if you want audio transcription support.

/llms/llama.cpp/build/bin/llama-server \
--ctx-size 30000 \
-m /llms/unsloth/gemma-4-E2B-it-GGUF/gemma-4-E2B-it-Q4_1.gguf \
--jinja   \
-t 48  \
-a unsloth/gemma-4-E2B-it-GGUF \
--mmproj /llms/unsloth/gemma-4-E2B-it-GGUF/mmproj-F16.gguf \
--no-mmproj-offload \
--host 0.0.0.0 \
--port 8082 \
-fa on \
--api-key redacted \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--repeat-penalty 1.1 --repeat-last-n 256