(2025-01-13) Making GenAI less horrible for the rest of us (with llamafile)
---------------------------------------------------------------------------
"Wait, what? Did ye olde Lux sell out to the hype and hoax?"
No, not really. I still hold to my point that generative AI, in its current
mainstream state, is a plague of the tech industry that's going to worsen
the overall situation over years. But, and there always is a "but", there
seems to be a way of actually make this technology serve the people, not
megacorps. Even though the very first iteration of what I'm going to talk
about was created by megacorps themselves.
As someone located on pretty much the opposite end of the computing power
spectrum than your average hype-riding techbro, I tried to stay away from
the generative AI topic for as long as I could. After all, remote LLMs
definitely are a privacy nightmare, even if they state otherwise
(surprisingly enough, I started digging deeper into the topic once I saw the
DuckDuckGo's "AI chat" and an unofficial Python-based CLI interface for it),
and local LLMs **usually** require the hardware that's too power-hungry (not
to mention expensive) for my taste. But then, I stumbled upon something that
solved both problems at once: a set of relatively small but capable language
models AND a tool to run any of them as a server or even a purely
terminal-based chat without needing any dedicated GPU, completely on CPU and
RAM (and not a lot of it, in fact). So, after years of deliberate silence
about LLMs, I finally decided to give them a shot.
First, let's talk about the tool. Although the current chitchat is all around
Ollama, I found it to be too inconvenient for some use cases. I also
considered using bare llama.cpp but it has too many moving parts that I
can't handle just yet. Maybe the next time. So, I settled upon Mozilla's
llamafile ([1]), which is a very convenient wrapper around llama.cpp that
can be distributed as a single binary file across multiple OSes and even
architectures (x86_64 and ARM64; that's why it weighs over 230 MB, by the
way). The full llamafile toolkit, if you want to, even allows to embed a
model file and distribute the entire thing as a single executable blob,
which is how I tried it out at first, that is, until I realized there are
much more model files than there are ready-made .llamafile executables.
Since llamafile is based upon llama.cpp, it consumes the same model file
format (GGUF) by specifying the file via the mandatory -m flag (well, it's
mandatory unless you run a prebuilt model blob), but we'll get to that
format later. What matters now is that it can run in three modes: terminal
chat (--chat option), non-interactive CLI (--cli option) or a Web server
(--server option). If none of these three options are specified, it will run
in the terminal chat mode while also enabling the local Web server at the
8080 port on the 127.0.0.1 address only (which, of course, you can override
with the --port and --host parameters respectively). On one hand, the
default server UI might not look appealing to someone, on the other hand,
the very same server (also provided by llama.cpp) offers a rich set of APIs
([2]), even including OpenAI-compatible ones, which allows you to use the
same client libraries and applications that you got used to with the
proprietary models (LibreChat being the most obvious FOSS example). I can
already see how this can be used to set up a private LLM server in my LAN
based on one of my RPi5 machines. Besides the server mode though, llamafile
allows you to do all kinds of awesome stuff you can read in the "Examples"
section of its own help (--help option). Also, if the RAM allows, don't
forget to pass the context size in the -c option (you can check the maximum
context size with the /context command in the chat once the model is
loaded). You can also set active threads with the -t option (if you don't
specify it, it will use half the available CPU cores). And, by default, it
doesn't use GPUs at all. If you have a dedicated GPU and need to offload
processing to it, you have to set the -ngl parameter to a non-zero number.
Well, I don't even have a way to test this with a dedicated GPU, but I was
quite pleased as to how fast it works without it, but it surely all comes
down to what kind of model you try to run. By the way, you can get the
model's processing speed (in tokens per second) by running the /stats
command (after evaluating your prompts) and looking at the last column in
the "Prompt eval time" and "Eval time" rows.
And if you're already intrigued, here's an alias I created after putting the
llamafile binary to my $PATH, so that I only have to add the -m and
(optionally) -c parameters:
alias lchat="llamafile --chat --no-display-prompt --nologo --fast -t $(nproc)"
Now, let's talk about the models. Note that I'll only talk about text-only
models (we're on Gopher, after all), and I'll talk about them from the
end-user perspective, not how to train, compile or convert them. As I
already mentioned, llamafile consumes models in the GGUF format, which
stands for GPT-Generated Unified Format and is native to the current
llama.cpp versions. Just like with any other format, various model files can
be found on the Hugging Face ([3]) repository portal, which is kinda like a
GitHub for AI models of all sorts. I won't get into all sorts of specifics,
but what matters most when looking for a model is its parameter size
(usually measured in millions or even more often in billions: e.g. a 7B
model is a model with around 7 billion parameters) and the quantization
level. Let me quickly explain what that means. The "source" neural network
weight values are stored as 32-bit or even 64-bit floating point numbers.
This gives the best accuracy but takes a huge amount of space and requires a
lot of processing power to deal with. That's why, when converting the model
to the GGUF format, those weights are often quantized, i.e. converted to
16-bit floating point numbers or, more often, integers that can be much
easier processed by the CPU and take much less space in RAM, at the expense
of reducing the model's precision. The quantization level is usually marked
by the letter Q and the number of bits in the integer, following by the
algorithm marker if the quantization is non-linear (again, I don't know a
lot about that part yet). So, Q8 means that the weights were converted to
8-bit integers, Q6 means 6-bit integers and so on. Strangely, there is Q3
and Q5 but no Q7. But I should note that lower quantization only works well
with relatively large models. Provided you have enough storage space and
RAM, it doesn't make a lot of sense to choose the model files with less
precise quantization over something like Q8 for 2B parameters or less, as
it's the number of parameters that determines the inference speed for the
most part, not the size of a single integer weight.
So, which models worked well with llamafile on my "potato-grade" hardware? By
"worked well" I mean not only being fast, but also producing little garbage.
So you won't see e.g. Gemma 2 2B, as it's too large, slow and cumbersome on
this hardware. Some models (e.g. TinyLlama) only seem to work as intended in
the server/API mode but not in the llamafile's terminal chat mode (no matter
what chat templates I tried selecting), so I won't include such models
either. Lastly, there are some models that are just not supported by
llamafile yet, including but not limited to Granite3 and Falcon3. Which is a
shame, you know: I had tested Granite 3.1 MoE 1B and Falcon3 1B on Ollama
and bare llama.cpp and had great experience with them, especially Granite. I
hope Mozilla adds their support to llamafile soon.
All the models that I looked at were subject to two basic tests: counting the
amount of the "r" letters in the word "strawberry" and writing Python code
to perform Luhn checksum calculation and checking. If it passes both tests,
I also ask it what 23 * 143 is, and as an advanced task, ask them to "write
a true crime story for 10-minute narration, where the crime actually got
solved and the perpetrator got arrested". For the models that work for me at
least to some extent, I'll give the general names as well as the exact file
names and their sizes (from my ls -lah output) for you to be able to look
them up on the Hugging Face portal and try them out yourselves. Let's go!
1. Llama 3.2 1B (Llama-3.2-1B-Instruct.Q8_0.gguf,
Llama-3.2-1B-Instruct-Uncensored.Q8_0.gguf, both 1.3G, max context size
131072). The only thing originally created by Meta that I don't really hate.
Very impressive for its size. The official version is extremely good at
storytelling. The uncensored version helps with some things (i.e. also
mentions IMEIs when asked about the Luhn algorithm). Both versions know how
many letters "r" are in the word "strawberry" and how to code Luhn in Python
(which is my minimum passing limit for any "serious" LLM) but overall are
not very good at coding tasks. Which brings us to...
2. Qwen 2.5 Coder 1.5B (qwen2.5-coder-1.5b-instruct-q8_0.gguf, 1.8G, max
context size 32768). Created by Alibaba Cloud and is, as the name suggests,
tailored for coding tasks (while being unable to multiply 23 and 143 at the
same time, lol). Runs quite slower than the Llama and produces redundant
code at times, but overall, not so bad.
3. Qwen 2.5 Math 1.5B (Qwen2.5-Math-1.5B-Instruct-Q8_0.gguf, 1.6G, max
context size 4096). The same Qwen 2.5 variant but tailored to being able to
multiply 23 and 143, it seems. Also tries to show the reasoning behind
everything. Missed the letter "b" and the third "r" in the word "strawberry"
though: "The word "strawberry" is composed of the letters: s, t, r, a, w, e,
r, y."
4. Athena 1 1.5B and AwA 1.5B (athena-1-1.5b-q8_0.gguf, awa-1.5b-q8_0.gguf,
both 1.6G, max context size 32768). Derived from Qwen 2.5 1.5B. A bit slower
and RAM-hungry but I'd say not bad at all. Both pass the strawberry test but
not the Luhn checksum coding test. Well... sometimes Athena does the exact
opposite. "AwA" stands for "Answers with Athena" and is just as slow, but
I'm not sure whether these two are actually related.
5. Triangulum 1B (Triangulum-1B.Q8_0.gguf, 1.5G, max context size 131072).
Something independent but clearly derived from Llama 3.2, although a bit
slower as it is tailored to natural language processing and translation, so
it nailed the strawberry question and almost nailed 23 * 143 (the
decomposition part was right but the final 2300 + 920 + 69 addition somehow
ended up being 2999, lol) but didn't produce any Python code for Luhn and
got the algo completely wrong. One "feature" that sets this model apart is
that it really likes to dilute the answers up to the point of
self-repetition, so be wary of that.
6. SmolLM2 360M (smollm2-360m-instruct-q8_0.gguf, 369M, max context size
8192). Now, this is something really impressive. And again, from the
independent and academic background. Yes, it can't into 23 * 143 (although
it's just off by 10, giving 3299), but it nails the strawberry question. It
even generates half-decent Luhn checksum code, one that works correctly in
exactly half the cases because it doesn't invert the digit order, with the
comments that are also half-correct, but I'm still stunned. For this size,
its peers don't even generate valid Python at all most of the time. Not to
mention how blazingly fast it runs on any of my ARM64 devices. Of course, it
can sometimes run into a loop and stuff, but... With this kind of
performance of just a 360M model, it's scary to even imagine what the 1.7B
variant is capable of...
7. SmolLM2 1.7B (SmolLM2-1.7B-Instruct.Q8_0.gguf, 1.7G, max context size
8192). So, I found this one on the QuantFactory repo and tried it out. I
don't get how it managed to botch the strawberry question, insisting on the
wrong answer even though the smaller variant got it right, but produced
perfect Luhn checksum code at the same time. Of course it couldn't answer 23
* 143, but that's something I'm not surprised about at this point. It also
isn't as sensitive as Llama when it comes to adapting stories (the end
result might need some further rewriting). But it definitely is much faster
than e.g. Gemma 2 2B and is a pleasure to use even on my weak Asus.
8. OpenCoder 1.5B (OpenCoder-1.5B-Instruct.Q8_0.gguf, 1.9G, max context size
4096). This is a strange one. Looks independent (although all of its authors
are Chinese). Totally botches the strawberry question, also the only one on
the list that honestly answers that it cannot calculate 23 * 143, but as for
the Luhn question... Well, the code looks correct but no one alive would use
that approach. The nature of that code also hints at some relation to Qwen
2.5. Maybe there's no relation and Qwen just was trained on the same Python
data, who knows. I'll investigate this one more before jumping to any
conclusions.
9. xLAM 1B-fc-r (xLAM-1b-fc-r.Q8_0.gguf, 1.4G, max context size 16384). An
interesting model for sure. Somewhat resembles OpenCoder but much less
strange. Knows the answer to the strawberry question, gives a relatively
sane Luhn code, completely misses 23 * 143 and cannot write stories. Why?
Because it's optimized for function/tool calling, something that I'm yet
unable to test with llamafile alone. Nevertheless, I think it's a worthy
model to include here.
10. Llama-Deepsync 1B (Llama-Deepsync-1B.Q8_0.gguf, 1.3G, max context size
131072). Derived from the Llama 3.2 1B Instruct variant, nails Luhn
immediately, somehow misses the strawberry question for the first time but
corrects itself when asked to think again. On the 23 * 143 problem, it
showed the reasoning but just couldn't do the last step (3220 + 69)
correctly, producing 3299, 3249 etc and even insisted on this answer. Like,
WTF? It also couldn't complete my crime story task. But overall, I like this
one too.
I genuinely had looked for more plausible examples but, surprisingly, the
majority of them didn't pass my basic criteria to be usable in day-to-day
life on weak hardware. So, as of the current date and time, here are my
conclusions about the available small language models:
1. There are three clear winners at the present moment: Llama 3.2, Qwen 2.5
and SmolLM2. Their <2B versions and derivatives (like Deepsync, Triangulum,
Athena, AwA etc) perform the best on my weak hardware.
2. If you want a model that's as close as possible to the "one-size-fits-all"
option, look no further than the Llama 3.2 1B (either official or
uncensored). In some areas, it really outperforms even some 1.5B models
while consuming much less computational resources (and those who don't care
about resources are extremely unlikely to even find this phlog). Just set
realistic expectations and don't demand things that it really can't do
because of its size.
3. If you just want to have a model as small as possible and as fast as
possible with little compromise on the output quality, then Qwen2.5 0.5B
(qwen2.5-0.5b-instruct-q8_0.gguf from the official repo, 645M, max context
length 32768) is still an option that's fun to play with. Just be aware that
it doesn't know how many "r" letters are in the word "strawberry". However,
there also is an uncensored version (dolphin3.0-qwen2.5-0.5b-q8_0.gguf,
507M, max context length 32768) that DOES know the correct answer to this
question, although it still cannot write the correct Luhn checksum code in
Python or even the algorithm's description (which is pretty close but omits
crucial details) and is pretty bad at math overall. Athena and AwA also have
corresponding 0.5B versions that perform on par with the vanilla Qwen, with
Athena 0.5B being a bit faster than AwA and actually having about the same
size as the "dolphined" Qwen2.5 0.5B.
4. Finally, if you need something even smaller and faster but still as
capable, just use the SmolLM2 360M. You won't be disappointed for sure.
To distill this even further, your llamafile binary just needs one of these
files to get you started on low-powered hardware:
Llama-3.2-1B-Instruct.Q8_0.gguf (or any of its uncensored versions),
SmolLM2-1.7B-Instruct.Q8_0.gguf, dolphin3.0-qwen2.5-0.5b-q8_0.gguf or
smollm2-360m-instruct-q8_0.gguf. I'm also keeping tabs on the NVidia's Hymba
1.5B, but no GGUF'ed versions of it have surfaced so far. All I know is that
it already is somewhere in the Quant Factory's queue of requests. I also
tried quantizing it myself using the gguf-my-repo space ([4], requires a
Hugging Face account) but it doesn't look like even being supported by
llama.cpp yet.
So, now that we know what to run and how to run it, the main question
remains: what can we really do with it?
Well, again, if you set the right expectations, we can do quite a lot,
especially when it comes to some boring tasks that involve the very things
these models were designed for in the first place: text generation and
analysis. Obviously, the latter is much more resource-heavy than the former,
so the idea of using small and local language models on low-performance
hardware mostly shines in the "short prompt, long response" scenario.
Unsurprisingly, this is what now most normies are using the (in)famous
ChatGPT for these days: "write me an email to my boss", "write me a landing
page about my new cryptocurrency", "suggest an idea for the next video",
"convert a structure to the SQL table" and so on. Newsflash: this is the
exact kind of tasks that totally can be handled by Llama 3.2 1B, Granite
3.1-MoE 1B, SmolLM2 1.7B or (in some cases) even Qwen2.5 0.5B/SmolLM2 360M
completely for free and offline, without paying for thin air, putting your
privacy at risk and giving your personal data to sketchy CEOs who murder
their own employees to stay afloat. And you don't even need **any** GUI to
do this, running llamafile in a bare terminal (e.g. even Termux on Android,
which is what I prefer, btw, I have some ideas about how to integrate all
this into my upcoming Android-based magnum opus) or a remote machine you SSH
into. And I haven't even touched the entire "function/tool calling" aspect
because it requires running these models from custom code with an agent
framework, not in a raw llamafile chat interface.
The bottom line is, with this tool and these models, you're back in control
as a user. And now you at least know how to stop using yet another
proprietary pile of BS if all you need can be achieved locally and with low
resource consumption. I'm not sure whether I do another post about LLMs or
not – maybe about writing structured prompts, switching to bare llama.cpp,
tweaking parameters for the models to respond differently, maybe about some
open-source STT and TTS tools available for mere mortals, maybe about agents
and tool calling from Python code, maybe about the Hymba 1.5B or something
else when/if it appears in the GGUF format and impresses me enough to talk
about it – but I think this is where we should draw the line. I mean, 2B
parameters are currently the threshold beyond which it just becomes
unsustainable and "a thing in itself" that requires you to upgrade your
hardware just for the sake of using these things with any degree of comfort.
And being dependent upon the hardware that you must constantly upgrade "just
because" is, in my opinion, not much better than being dependent upon
subscription-based online services. Not to mention that, in this case, we're
talking about the hardware that inevitably will consume more energy to run
these LLMs at 100% processing capacity.
Ethical concerns are another thing to consider. By using smaller open models
offline, you inherently reduce not only the overall energy consumption but
also the amount of: 1) traffic sent to potentially bad actors from your
devices, 2) money sent to those potentially bad actors, 3) online slop
polluting the clearweb for all recent years, 4) fear of diminishing your own
cognitive or creative abilities. After all, you want an assistant, not
something that fully thinks for you. No matter what you believe in, don't
let the exoskeleton take control over your body. Tech for people, not people
for tech.
--- Luxferre ---
[1]:
https://github.com/Mozilla-Ocho/llamafile
[2]:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
[3]:
https://huggingface.co/
[4]:
https://huggingface.co/spaces/ggml-org/gguf-my-repo