Home-made LLM Recipe
While many of you will expect a blog post related to Windows vulnerability research and exploit development, we have to manage expectations, as this one will be about LLMs. Not because everyone is jumping on the AI bandwagon and we felt the urge to do so, but because this space is also about things that we like and we’re experimenting with, not because it’s groundbreaking, but just because we believe it might be interesting to others.
It’s no secret that, lately, I (VoidSec) have spent a good amount of time trying to understand how to use local LLMs for the team here in CF, aiding in report filling, code completion, and the general knowledge base of all the documentation the team has written over the past years.
Today I would like to sneak a peek into my high-level process of setting up a homemade local LLM platform, which we’re currently using as an internal pilot project; how we did it, what setup and technologies we’ve used, etc…
Hardware
While deciding on which hardware to use for our pet projects, we started reviewing both dedicated hardware for LLMs, GPU racks and any kind of hardware capable enough to run such models. Given that our research workstations have limited dedicated GPUs, we were unable to run large models, and performance was nowhere near acceptable in terms of speed and token output generation.
While GPUs seem to be the standard, dedicated racks are somewhat expensive, and for a pilot project, we didn’t want to justify new purchases. We opted to test a 2022 Mac Studio that was sitting idle in the office.
Dell XPS:
- CPU: 13th Gen Intel Core i9-13900H
- RAM: 64 GB
- GPU: NVIDIA GeForce RTX 4070
- OS: Windows 11 Pro x64
Mac Studio 2022:
- CPU: M1 Ultra
- RAM: 128 GB
- GPU: 64-core
- OS: macOS 26.1
When testing the hardware, we opted for the 2 following prompts across different models:
- “Write me a poem about the moon”
- “Write a Python binary tree lookup function with a sample”
Both are vague enough to see, where present, the “thinking” process of the models.
For the local setup of the models, we opted to use Ollama, which can be directly installed from its main website without further complications. Then we spent some time selecting models we were interested in testing, opting for some small enough to be compared with our workstation and some bigger ones, as the Mac Studio’s RAM allowed us to do so.
We selected the following:
- 1:8b
- qwen3:8b
- 5-coder:3b
- deepseek-r1:8b
- deepseek-r1:70b
- gpt-oss:120b
- 3:70b
- qwen3-coder:30b
Models can be pulled via the command line: ollama pull llama3.1:8b
RAM Requirement per Billion parameters
A rough rule of thumb is 1GB of RAM per billion parameters, give or take.
GB or RAM for Q8 quantised models: Q8 is 8 bits per byte, so 1B per 1GB, not counting other components to run inference. Q4 is half that usage, FP16 is double, etc.
| LLM Size | Q8 |
| 3B | 3,3 |
| 8B | 7,7 |
| 33B | 36,3 |
| 70B | 77,0 |
| 123B | 135,3 |
| 205B | 225,5 |
| 405B | 445,5 |
Benchmark
Using ollama serve --verbose, we can retrieve a couple of information and statistics about the provided prompt and its evaluation, specifically:
- total duration (s): total time from when the request is issued to when the model finishes generating the output (smaller is better)
- load duration (ms): how long it took to load the model into memory (smaller is better)
- prompt eval count (tokens): how many tokens were processed from the provided prompt
- prompt eval duration (ms): how long the model took to read and process the prompt (smaller is better)
- prompt eval rate (tokens/s): how fast the model processed the tokens of the prompt (higher is better)
- eval count (tokens): how many tokens the model generated in its response
- eval duration (s): the total time it took to generate the output tokens
- eval rate (tokens/s): speed of generation (higher is better)
Comparison
| Dell XPS | Mac Studio 2022 | Delta | ||
| llama3.1:8b | ||||
| total duration (s) | 15,3619781 | 3,29049775 | -12,0715 | |
| load duration (ms) | 106,8099 | 101,352916 | -5,45698 | |
| prompt eval count (tokens) | 17 | 17 | ||
| prompt eval duration (ms) | 876,9471 | 622,781666 | -254,165 | |
| prompt eval rate (tokens/s) | 19,39 | 27,3 | 7,91 | |
| eval count (tokens) | 156 | 184 | 28 | |
| eval duration (s) | 14,2318621 | 2,495499444 | -11,7364 | |
| eval rate (tokens/s) | 10,96 | 73,73 | 62,77 | |
| qwen3:8b | ||||
| total duration (s) | 61,4240039 | 8,48243875 | -52,9416 | |
| load duration (ms) | 177,8828 | 91,713958 | -86,1688 | |
| prompt eval count (tokens) | 17 | 17 | ||
| prompt eval duration (ms) | 677,5161 | 262,631042 | -414,885 | |
| prompt eval rate (tokens/s) | 25,09 | 64,73 | 39,64 | |
| eval count (tokens) | 701 | 573 | -128 | |
| eval duration (s) | 60,1850603 | 8,019101906 | -52,166 | |
| eval rate (tokens/s) | 11,65 | 71,45 | 59,8 | |
| qwen2.5-coder:3b | ||||
| total duration (s) | 7,867204 | 9,042249625 | 1,175046 | |
| load duration (ms) | 125,2532 | 106,480875 | -18,7723 | |
| prompt eval count (tokens) | 38 | 38 | ||
| prompt eval duration (ms) | 300,7288 | 145,467917 | -155,261 | |
| prompt eval rate (tokens/s) | 126,36 | 261,23 | 134,87 | |
| eval count (tokens) | 541 | 513 | -28 | |
| eval duration (s) | 6,4093473 | 7,911069649 | 1,501722 | |
| eval rate (tokens/s) | 84,41 | 64,85 | -19,56 | |
| deepseek-r1:8b | ||||
| total duration (s) | 72,0693476 | 95,18973533 | 23,12039 | |
| load duration (ms) | 92,4653 | 101,297667 | 8,832367 | |
| prompt eval count (tokens) | 11 | 11 | ||
| prompt eval duration (ms) | 2,0887423 | 475,781125 | 473,6924 | |
| prompt eval rate (tokens/s) | 5,27 | 23,12 | 17,85 | |
| eval count (tokens) | 619 | 6166 | 5547 | |
| eval duration (s) | 69,6013878 | 93,4009973 | 23,79961 | |
| eval rate (tokens/s) | 8,89 | 66,02 | 57,13 | |
Larger models evaluated on Mac Studio only:
| deepseek-r1:70b | |
| total duration (s) | 228,4271719 |
| load duration (ms) | 99,152208 |
| prompt eval count (tokens) | 12 |
| prompt eval duration (ms) | 977,806583 |
| prompt eval rate (tokens/s) | 12,27 |
| eval count (tokens) | 2086 |
| eval duration (s) | 226,6527123 |
| eval rate (tokens/s) | 9,2 |
| gpt-oss:120b | |
| total duration (s) | 39,81334175 |
| load duration (ms) | 157,683958 |
| prompt eval count (tokens) | 74 |
| prompt eval duration (ms) | 4,502564917 |
| prompt eval rate (tokens/s) | 16,44 |
| eval count (tokens) | 329 |
| eval duration (s) | 35,01930734 |
| eval rate (tokens/s) | 9,39 |
| llama3.3:70b | |
| total duration (s) | 39,09012222 |
| load duration (ms) | 101,430583 |
| prompt eval count (tokens) | 17 |
| prompt eval duration (ms) | 2,720329166 |
| prompt eval rate (tokens/s) | 6,25 |
| eval count (tokens) | 185 |
| eval duration (s) | 36,19925901 |
| eval rate (tokens/s) | 5,11 |
| qwen3-coder:30b | |
| total duration (s) | 16,55513133 |
| load duration (ms) | 88,789084 |
| prompt eval count (tokens) | 17 |
| prompt eval duration (ms) | 393,031417 |
| prompt eval rate (tokens/s) | 43,25 |
| eval count (tokens) | 720 |
| eval duration (s) | 15,86309835 |
| eval rate (tokens/s) | 45,39 |
Based on the collected performance metrics, the Mac Studio 2022 significantly outperformed our workstation across every meaningful measurement: while the model load times are similar, the total duration drops significantly. The prompt evaluation (~40% faster) and generation throughput (~6x faster) clearly favour the Mac Studio for real-time workload and interactive development. The Mac Studio is also the only one to support larger models due to Apple’s high memory bandwidth.
Concurrency Limitation
By default, our setup with out-of-the-box tools doesn’t handle concurrency. Specifically, the unified memory doesn’t allow models to run simultaneously, and each submitted request must be fully fulfilled before the next one can be processed (inference plays a huge role here, as a model stuck “thinking” blocks the queue for everyone else). For us, that’s not a big problem, as our team size allows it, and we don’t constantly use LLMs, but it might become a pain the more we rely on them, especially for code completion.
Cost
While we were lucky to already have the hardware, we cannot ignore the costs. A new Mac Studio costs 5,500 USD for the 96GB RAM, up to 10,000 USD for the 512 GB RAM (which, in theory, should allow you to load any model out of there, and performance should still be competitive after a couple of years, especially given the cost).
A single Nvidia GPU can span from 2,500 – 5,000 USD alone, without considering the additional hardware and Enterprise GPU pricing remains staggering, with units close to 30-40,000 USD.
A very viable solution is local marketplaces (e.g., Facebook Marketplace, eBay) and refurbished Apple hardware (especially if living in the US, where the market seems much larger than in Europe). Also, keep the door open to connecting multiple units together via a similar setup: Mac Studio Cluster via MLX.
Platform
I admit that I haven’t spent much time researching which platforms are available to perform all the tasks we intended, but for ease of use, out-of-the-box setup, multiple users, knowledge base, and features, we relied on Open WebUI.
The setup is pretty straightforward via brew and uvx, and we’ve followed this nicely put-together guide. The only modification was to the automatic startup, for which we have demonised the services so they run at startup.
Code Completion
While we initially used Cursor for fast script and prototyping, we’re trying to replace it with Continue, configured to use our local LLMs. Though maybe just for the ease of use and the fact that I’ve gotten used to it, Cursor still feels way better in terms of usability and results.
Knowledge Base
I’ve tested the “Knowledge Base” features of Open WebUI and the related “Document Embedding” of Anything LLM, both of which promise an easy way to build internal company knowledge bases and index our documents.
However, after some testing, I’m not particularly impressed by either. Both seem to struggle to pull data from our company knowledge and mix it with the model’s internal knowledge, most of the time giving mixed results, partial answers, or failing to address questions, especially when dates past the model’s cut-off are involved.
I’m not sure if that’s because both are essentially RAG wrappers around an LLM, and the underlying LLM is not sandboxed from its own knowledge but instead relies on system prompts, or because both rely on chunking, which loses the hierarchical structure, references, and temporal context, but the biggest issue for me is inconsistent retrieval.
IMHO, these features are not yet robust enough for information retrieval, but I hope they will be updated, as I think they might be a game-changer in the future.
Technology Stack
- Hardware: Mac Studio 2022
- Software: Open WebUI, Ollama