Home-made LLM Recipe

By voidsec

Reading Time: 6 minutes

While many of you will expect a blog post related to Windows vulnerability research and exploit development, we have to manage expectations, as this one will be about LLMs. Not because everyone is jumping on the AI bandwagon and we felt the urge to do so, but because this space is also about things that we like and we’re experimenting with, not because it’s groundbreaking, but just because we believe it might be interesting to others.

It’s no secret that, lately, I (VoidSec) have spent a good amount of time trying to understand how to use local LLMs for the team here in CF, aiding in report filling, code completion, and the general knowledge base of all the documentation the team has written over the past years.

Today I would like to sneak a peek into my high-level process of setting up a homemade local LLM platform, which we’re currently using as an internal pilot project; how we did it, what setup and technologies we’ve used, etc…

Table of Contents

Hardware

While deciding on which hardware to use for our pet projects, we started reviewing both dedicated hardware for LLMs, GPU racks and any kind of hardware capable enough to run such models. Given that our research workstations have limited dedicated GPUs, we were unable to run large models, and performance was nowhere near acceptable in terms of speed and token output generation.

While GPUs seem to be the standard, dedicated racks are somewhat expensive, and for a pilot project, we didn’t want to justify new purchases. We opted to test a 2022 Mac Studio that was sitting idle in the office.

Dell XPS:

CPU: 13^th Gen Intel Core i9-13900H
RAM: 64 GB
GPU: NVIDIA GeForce RTX 4070
OS: Windows 11 Pro x64

Mac Studio 2022:

CPU: M1 Ultra
RAM: 128 GB
GPU: 64-core
OS: macOS 26.1

When testing the hardware, we opted for the 2 following prompts across different models:

“Write me a poem about the moon”
“Write a Python binary tree lookup function with a sample”

Both are vague enough to see, where present, the “thinking” process of the models.

For the local setup of the models, we opted to use Ollama, which can be directly installed from its main website without further complications. Then we spent some time selecting models we were interested in testing, opting for some small enough to be compared with our workstation and some bigger ones, as the Mac Studio’s RAM allowed us to do so.

We selected the following:

1:8b
qwen3:8b
5-coder:3b
deepseek-r1:8b
deepseek-r1:70b
gpt-oss:120b
3:70b
qwen3-coder:30b

Models can be pulled via the command line: ollama pull llama3.1:8b

RAM Requirement per Billion parameters

A rough rule of thumb is 1GB of RAM per billion parameters, give or take.

GB or RAM for Q8 quantised models: Q8 is 8 bits per byte, so 1B per 1GB, not counting other components to run inference. Q4 is half that usage, FP16 is double, etc.

LLM Size	Q8
3B	3,3
8B	7,7
33B	36,3
70B	77,0
123B	135,3
205B	225,5
405B	445,5

Benchmark

Using ollama serve --verbose, we can retrieve a couple of information and statistics about the provided prompt and its evaluation, specifically:

total duration (s): total time from when the request is issued to when the model finishes generating the output (smaller is better)
load duration (ms): how long it took to load the model into memory (smaller is better)
prompt eval count (tokens): how many tokens were processed from the provided prompt
prompt eval duration (ms): how long the model took to read and process the prompt (smaller is better)
prompt eval rate (tokens/s): how fast the model processed the tokens of the prompt (higher is better)
eval count (tokens): how many tokens the model generated in its response
eval duration (s): the total time it took to generate the output tokens
eval rate (tokens/s): speed of generation (higher is better)

Comparison

	Dell XPS	Mac Studio 2022	Delta
llama3.1:8b
total duration (s)	15,3619781	3,29049775	-12,0715
load duration (ms)	106,8099	101,352916	-5,45698
prompt eval count (tokens)	17	17
prompt eval duration (ms)	876,9471	622,781666	-254,165
prompt eval rate (tokens/s)	19,39	27,3	7,91
eval count (tokens)	156	184	28
eval duration (s)	14,2318621	2,495499444	-11,7364
eval rate (tokens/s)	10,96	73,73	62,77
qwen3:8b
total duration (s)	61,4240039	8,48243875	-52,9416
load duration (ms)	177,8828	91,713958	-86,1688
prompt eval count (tokens)	17	17
prompt eval duration (ms)	677,5161	262,631042	-414,885
prompt eval rate (tokens/s)	25,09	64,73	39,64
eval count (tokens)	701	573	-128
eval duration (s)	60,1850603	8,019101906	-52,166
eval rate (tokens/s)	11,65	71,45	59,8
qwen2.5-coder:3b
total duration (s)	7,867204	9,042249625	1,175046
load duration (ms)	125,2532	106,480875	-18,7723
prompt eval count (tokens)	38	38
prompt eval duration (ms)	300,7288	145,467917	-155,261
prompt eval rate (tokens/s)	126,36	261,23	134,87
eval count (tokens)	541	513	-28
eval duration (s)	6,4093473	7,911069649	1,501722
eval rate (tokens/s)	84,41	64,85	-19,56
deepseek-r1:8b
total duration (s)	72,0693476	95,18973533	23,12039
load duration (ms)	92,4653	101,297667	8,832367
prompt eval count (tokens)	11	11
prompt eval duration (ms)	2,0887423	475,781125	473,6924
prompt eval rate (tokens/s)	5,27	23,12	17,85
eval count (tokens)	619	6166	5547
eval duration (s)	69,6013878	93,4009973	23,79961
eval rate (tokens/s)	8,89	66,02	57,13

Larger models evaluated on Mac Studio only:

deepseek-r1:70b
total duration (s)	228,4271719
load duration (ms)	99,152208
prompt eval count (tokens)	12
prompt eval duration (ms)	977,806583
prompt eval rate (tokens/s)	12,27
eval count (tokens)	2086
eval duration (s)	226,6527123
eval rate (tokens/s)	9,2
gpt-oss:120b
total duration (s)	39,81334175
load duration (ms)	157,683958
prompt eval count (tokens)	74
prompt eval duration (ms)	4,502564917
prompt eval rate (tokens/s)	16,44
eval count (tokens)	329
eval duration (s)	35,01930734
eval rate (tokens/s)	9,39
llama3.3:70b
total duration (s)	39,09012222
load duration (ms)	101,430583
prompt eval count (tokens)	17
prompt eval duration (ms)	2,720329166
prompt eval rate (tokens/s)	6,25
eval count (tokens)	185
eval duration (s)	36,19925901
eval rate (tokens/s)	5,11
qwen3-coder:30b
total duration (s)	16,55513133
load duration (ms)	88,789084
prompt eval count (tokens)	17
prompt eval duration (ms)	393,031417
prompt eval rate (tokens/s)	43,25
eval count (tokens)	720
eval duration (s)	15,86309835
eval rate (tokens/s)	45,39

Based on the collected performance metrics, the Mac Studio 2022 significantly outperformed our workstation across every meaningful measurement: while the model load times are similar, the total duration drops significantly. The prompt evaluation (~40% faster) and generation throughput (~6x faster) clearly favour the Mac Studio for real-time workload and interactive development. The Mac Studio is also the only one to support larger models due to Apple’s high memory bandwidth.

Concurrency Limitation

By default, our setup with out-of-the-box tools doesn’t handle concurrency. Specifically, the unified memory doesn’t allow models to run simultaneously, and each submitted request must be fully fulfilled before the next one can be processed (inference plays a huge role here, as a model stuck “thinking” blocks the queue for everyone else). For us, that’s not a big problem, as our team size allows it, and we don’t constantly use LLMs, but it might become a pain the more we rely on them, especially for code completion.

Cost

While we were lucky to already have the hardware, we cannot ignore the costs. A new Mac Studio costs 5,500 USD for the 96GB RAM, up to 10,000 USD for the 512 GB RAM (which, in theory, should allow you to load any model out of there, and performance should still be competitive after a couple of years, especially given the cost).

A single Nvidia GPU can span from 2,500 – 5,000 USD alone, without considering the additional hardware and Enterprise GPU pricing remains staggering, with units close to 30-40,000 USD.

A very viable solution is local marketplaces (e.g., Facebook Marketplace, eBay) and refurbished Apple hardware (especially if living in the US, where the market seems much larger than in Europe). Also, keep the door open to connecting multiple units together via a similar setup: Mac Studio Cluster via MLX.

Platform

I admit that I haven’t spent much time researching which platforms are available to perform all the tasks we intended, but for ease of use, out-of-the-box setup, multiple users, knowledge base, and features, we relied on Open WebUI.

The setup is pretty straightforward via brew and uvx, and we’ve followed this nicely put-together guide. The only modification was to the automatic startup, for which we have demonised the services so they run at startup.

Code Completion

While we initially used Cursor for fast script and prototyping, we’re trying to replace it with Continue, configured to use our local LLMs. Though maybe just for the ease of use and the fact that I’ve gotten used to it, Cursor still feels way better in terms of usability and results.

Knowledge Base

I’ve tested the “Knowledge Base” features of Open WebUI and the related “Document Embedding” of Anything LLM, both of which promise an easy way to build internal company knowledge bases and index our documents.

However, after some testing, I’m not particularly impressed by either. Both seem to struggle to pull data from our company knowledge and mix it with the model’s internal knowledge, most of the time giving mixed results, partial answers, or failing to address questions, especially when dates past the model’s cut-off are involved.

I’m not sure if that’s because both are essentially RAG wrappers around an LLM, and the underlying LLM is not sandboxed from its own knowledge but instead relies on system prompts, or because both rely on chunking, which loses the hierarchical structure, references, and temporal context, but the biggest issue for me is inconsistent retrieval.

IMHO, these features are not yet robust enough for information retrieval, but I hope they will be updated, as I think they might be a game-changer in the future.

Technology Stack

Hardware: Mac Studio 2022
Software: Open WebUI, Ollama

Home-made LLM Recipe