(2025-10-13) A short rant about "democratizing" genAI, hobbyist software etc

(2025-10-13) A short rant about "democratizing" genAI, hobbyist software etc
----------------------------------------------------------------------------
As you might have noticed, my break from phlog posting had been extended by a
week from what I'd been planning. Too much had been going on throughout this
month, so I definitely needed to keep my attention span narrow enough. I
promise that I am going to eventually return to the previous topics like
homebrew VMs or abacus, but right now I'd like to rant about the state of
genAI for "mere mortals" and some other adjacent topics without a particular
structure. So, here's my chain of thoughts (no pun intented).

Recently, I had something remind me about the existence of KoboldCpp ([1]): a
single-binary LLM inference runner that uses llama.cpp under the hood but
also incorporates a simple web-based chat UI, a set of various API endpoints
(its own, OpenAI-compatible and Ollama-compatible, among many others) and a
simple GUI to easily run not only text generation, but also image
generation, TTS and speech recognition models (via the included
stable-diffusion.cpp, TTS.cpp and whisper.cpp distributions respectively). A
couple years ago, before LM Studio even was a usable thing, KoboldCpp
already gained some traction among lusers who just wanted something fully
local for roleplaying scenarios. It was one of the first pieces of software
that was considered something to "democratize" running LLMs for a layman.
For me, however, as fast and efficient as it is, there are two caveats: 1)
it doesn't directly use the llama.cpp binaries but the underlying GGML
engine instead (exposing a totally different set of commandline parameters),
hence some strange defaults like the default maximum generated token limit,
2) the GGML engine version it uses doesn't get updated as often as the
upstream llama.cpp, so some models still can be out of reach, like the
recent Granite 4 MoE series. It's only a matter of time and patience when
the newer engine gets finally merged into Kobold.

As of now, I haven't yet explored its image generation or speech capabilities
but I run the text generation server as follows:

#!/bin/sh
# Sensible defaults to run KoboldCpp
CTXSIZE=32768
# hardware options
# HARDOPTS="--usecpu --gpulayers 0 --usemmap"
HARDOPTS="--usevulkan --flashattention --gpulayers 99"
# run it
koboldcpp --defaultgenamt $(($CTXSIZE<8192?$CTXSIZE:8192)) --skiplauncher
--contextsize $CTXSIZE $HARDOPTS $*

Then I just pass the GGUF file as the parameter to the script and that's it.
In the model list in the API, it gets exposed as
"koboldcpp/filename_without_ext". For CPU-only inference, the second
(commented) line is necessary. Also, as you can see, KoboldCpp can only
generate 8192 tokens at a time at most, even when the context window is
larger. The upstream llama.cpp doesn't have such limitations. On the upside,
in case you need to only deploy KoboldCpp to non-GPU systems, there is a
"nocuda" binary version that weighs much less, and also an "oldpc" version
that disables the AVX2 instructions automatically (which requires a separate
CLI flag in the other builds) and performs some other tricks to run the
models on older hardware.

Well, why would I even consider using this instead of the usual llama-server
(or llamafile, previously covered here, if I need a single-file deployment)?
Well... KoboldCpp really is easier on RAM consumption. I've purchased a
really cheap VPS (around $23/year!) with the only caveat being it having a
single core and 512 MB or RAM. Are there any modern LLMs suitable for such
amounts of RAM? Sure, there are, like Gemma 3 270M in the 4-bit QAT version.
But аre there any engines suitable for running this LLM on such amounts of
RAM? Well, KoboldCpp is one of them, I just had to append the "--noavx2"
flag for it to not throw the "illegal instruction" error and still get up to
the speed of 11 to 14 tokens per second (given that the context window size
had to be decreased to 4096 tokens). Ten times slower than on a "normal"
system but still perfectly adequate. The llamafile-based deployment, on the
other hand, showed about 5 to 6 t/s at most while obviously consuming much
more RAM per inference. If/when time permits, I'm also going to dig up some
of my older hardware and run some more tests.

Another side of the coin is the actual usage of all this goodness. Recently,
a controversial question has emerged: "If LLMs became so good at coding,
where's all the hobbyist level and indie software, why aren't we
experiencing a boom of it?" Well, from my own experience, I think I know the
answer to this question, and it's not a simple one either, it's as
multifaceted as the technology itself. First, true indie developers most
often are resource-conscious, so they will be the last ones to adopt LLMs
for any coding assistance. They, and some of them rightfully so, view
themselves as artisans instead of code monkeys, and like to be in control of
everything that happens in their creations. I share this view myself, so I'm
never going to use LLMs for anything but some boring but inevitable
boilerplate parts of code. The second reason we still aren't experiencing
that indie boom to the fullest is that many "vibe coders" really thought
that LLM could think FOR them throughout the entire development process,
rather than merely assisting with the boilerplate coding part. Offloading
design decisions and mission-critical bits of code to LLMs already has lead
to some disasters, with many and many more to come. We just aren't at the
"good enough" phase yet, no matter how corporate marketoids are trying to
convince you otherwise. The third reason is something I already have
discussed a bit in the past in this phlog, and it has little to do with LLMs
per se: desktop software just isn't that popular anymore. An average vibe
coder would try generating a mobile app at best, but most of the time, it's
just some React-based browser-oriented crap. I mean, instead of React, you
can insert Next.js, Expo or whatever RAM-hogging framework is popular this
week of the month, that doesn't really matter. What matters is that, in the
eyes of the masses, the definition of "software" really has shifted towards
this. And because of the amount of already existing JSX/TSX garbage to train
on, it's pretty much the only coding thing LLMs are *kinda* good at now. Can
it be usable by the general public? Maybe. Can it compete with the industry
giants? Probably. Does that qualify as "indie development" in my eyes? Hell
no. Truly independent software never requires a gigabyte-sized browser
engine to run it.

Want to create independent desktop software? Learn Tcl/Tk, or at least
Tkinter if you already know Python (just because Python already ships with
it). Wanna go mobile too? Learn Go + Fyne then. Wanna go online? Learn
Elixir. There are lots of better ways of doing stuff than just succumbing to
the mainstream crapware frameworks for languages that aren't supposed to be
used for that stuff in the first place. By choosing the right tool for the
job in the beginning, you make it much easier for your future self. And
then, and only then, like I already said some time ago, you may use LLMs to
assist you. Not think for you but to help you with writing the boring parts
of your implementation while you're still in control. Sober, aware and
independent.

--- Luxferre ---

[1]: https://github.com/LostRuins/koboldcpp