_______ __ _______ | |
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
on Gopher (inofficial) | |
Visit Hacker News on the Web | |
COMMENT PAGE FOR: | |
Building an AI server on a budget | |
eachro wrote 8 hours 48 min ago: | |
A lot of people are saying 12gb is too small to do anything interesting | |
with. What's the most useful thing people __have__ gotten to work? | |
ntlm1686 wrote 10 hours 25 min ago: | |
Building a PC that can play video games and run some LLMs. | |
dubrado wrote 16 hours 27 min ago: | |
If you want to save some money and test things out, check out | |
Hyperbolic (app.hyperbolic.xyz). | |
They're based in the US, don't store any data, and you can rent | |
(self-serve) style in less than a minute. | |
diggan wrote 16 hours 24 min ago: | |
If you're gonna promote your own product, at least be honest and | |
brave enough to acknowledge that you built/manage it. | |
v3ss0n wrote 17 hours 37 min ago: | |
12GB GPU can't do a thing that is useful. Minium should be 32GB VRam | |
where you can run actual models (Mistral-Small , Qwen3-32B , etc). | |
alganet wrote 22 hours 31 min ago: | |
Let me try to put this in the scale of coffee: | |
-- | |
Using LLM via api: Starbucks. | |
Inference at home: Nespresso capsules. | |
Fine-tune a small model at home: Owning a grinder and an italian | |
espresso machine. | |
Pre-training a model: Owning a moderate coffee plantation. | |
teleforce wrote 23 hours 6 min ago: | |
>DECISION: Nvidia RTX 4070 | |
I'm curiuos why OP didn't go for the more recent Nvidia RTX 4060 Ti | |
with 16 GB VRAM that cost cheaper (~USD500) brand new and lesser power | |
consumption at 165W [1] RTX 5060 Ti 16GB sucks for gaming, but seems | |
like a diamond in the rough for AI: | |
[1]: https://news.ycombinator.com/item?id=44196991 | |
qingcharles wrote 21 hours 36 min ago: | |
And if you're gonna be fine with 12GB, why not a 2080ti instead? | |
Fluorescence wrote 16 hours 51 min ago: | |
Only 11GB... but I guess it will allow you to not do anything | |
useful just as well as 12GB will :) | |
You can however solder on double-capacity memory chips to get 22GB: | |
[1] I hoped the article would be more along these lines than | |
calling an unremarkable second-hand last-gen gaming pc an "AI | |
Server". | |
[1]: https://forums.overclockers.com.au/threads/double-your-gpu... | |
zlies wrote 23 hours 34 min ago: | |
Did you not use any thermal paste at all, or did you just forget to | |
mention it in your post? | |
PeterStuer wrote 1 day ago: | |
For image generation the article's setup might be viable, but do not | |
expect to run LLM's with satisfactory quality and speed on 12GB vram. | |
lazylizard wrote 1 day ago: | |
why not one of these? | |
[1]: https://www.amazon.sg/NVIDIA-Jetson-Orin-64GB-Developer/dp/B0B... | |
numpad0 wrote 20 hours 55 min ago: | |
Jetsons aren't so fast, those are intended for mobile robots. The one | |
supposed to be just around the corner is DGX Spark(Project DIGITS) | |
and DGX Station. | |
Those DGX machines are still at right around the corner state. | |
romanovcode wrote 1 day ago: | |
Doesn't the new computer that is about to be released from NVIDIA | |
much better than this one and is same price? Why would anyone buy | |
this one now, seems like a waste of money. | |
mythz wrote 1 day ago: | |
Good value but a 12GB card isn't going to let you do too much given the | |
low quality of small models. Curious what "home AI" use cases small | |
models are being used for? | |
It would be nice to see a best value home AI setups under different | |
budgets or RAM tiers, e.g. best value configuration for 128 GPU VRAM, | |
etc. | |
My 48GB GPU VRAM "Home AI Server" cost ~$3100 from all parts on eBay | |
running 3x A4000's in a Supermicro 128GB RAM, 32/64 core Xeon 1U rack | |
server. Nothing amazing but wanted the most GPU VRAM before paying the | |
premium Nvidia tax on their larger GPUs. | |
This works well for Ollama/llama-server which can make use of all GPU | |
VRAM unfortunately ComfyUI can't make use of all GPU VRAM to run larger | |
models, so on the lookout for a lot more RAM in my next GPU Server. | |
Really hoping Intel can deliver with its upcoming Arc Pro B60 Dual GPU | |
for a great value 48GB option which can be run 4x in an affordable | |
192GB VRAM workstation [1]. If it runs Ollama and ComfyUI efficiently | |
I'm sold. | |
[1]: https://www.servethehome.com/maxsun-intel-arc-pro-b60-dual-gpu... | |
rwyinuse wrote 22 hours 33 min ago: | |
I use a Proxmox server with RTX 3060 to generate paintings (I have a | |
couple of old jailbroken Amazon Kindle's attached to walls for that | |
purpose), and to run ollama, which is connected to Home Assistant & | |
their voice preview device, allowing me to talk with LLM without | |
transmitting anything to cloud services. | |
Admittedly with that amount of VRAM the models I can run are fairly | |
useless for stuff like controlling lights via Home Assistant, | |
occasionally does what I tell it to do but usually not. It is pretty | |
okay for telling me information, like temperature or value of some | |
sensors I have connected to HA. For generating AI paintings it's | |
enough. My server also hosts tons of virtual machines, docker | |
containers and is used for remote gameplay, so the AI thing is just | |
an extra. | |
garyfirestorm wrote 17 hours 37 min ago: | |
Why do you say that? You can easily finetune 8B parameter model for | |
function calling. | |
msgodel wrote 22 hours 34 min ago: | |
It's really not going to let you train much which IMO is the only | |
reason I'd personally bother with a big GPU. Gradients get huge and | |
everything does them with single/half precision floating point. | |
itake wrote 1 day ago: | |
My home AI machine does image classification. | |
naavis wrote 1 day ago: | |
What kind of image classification do you do at home? | |
itake wrote 22 hours 36 min ago: | |
My side project accepts and publishes user generated content. To | |
stay compliant with regulations, I use ML to remove adult | |
content: | |
[1]: https://github.com/KevinColemanInc/NSFW-FLASK | |
mythz wrote 1 day ago: | |
Using just an Ollama VL Model (gemma3/mistral-small3.1/qwen2.5vl) | |
or a specific library? | |
itake wrote 22 hours 35 min ago: | |
My home server detects NSFW images in user generated content on | |
my side project. | |
source code: | |
[1]: https://github.com/KevinColemanInc/NSFW-FLASK | |
mythz wrote 21 hours 57 min ago: | |
Cool, I've tried a few but settled on using EraX NSFW to do the | |
same. | |
jononor wrote 1 day ago: | |
Agreed, 12 GB does not seem useful. For coding LLM, it seems 128 GB | |
is needed to be even close to the frontier models. | |
For generative image processing (not video), it looks like one can | |
get started with 16GB. | |
noufalibrahim wrote 1 day ago: | |
This is interesting. We recently built a similar machine to implement a | |
product that we're building on a customer site. | |
I didn't buy second hand parts since i wasn't sure of the quality so it | |
was a little pricey but we have the entire thing working now and over | |
the last week, we added the llm server to the mix. Haven't released it | |
yet though. | |
I wrote about some "fun" we had getting it together here but it's not | |
as technically detailed as the original article. | |
[1]: https://blog.hpcinfra.com/when-linkedin-met-reality-our-bangal... | |
danielhep wrote 1 day ago: | |
What are the practical uses of a self hosted LLM? Is it actually | |
possible to approach the likes of Claude or one of the other big ones | |
on your own hardware for a reasonable budget? I donât know if this is | |
something thatâs actually worth it or if people are just building | |
these rigs for fun or niche use cases that donât require the | |
intelligence of a hosted LLM. | |
tmountain wrote 16 hours 1 min ago: | |
Personal opinion, it's for fun with some internal narrative of | |
justification. It doesn't seem like it would be cost effective or | |
provide better results, as all the major LLM vendors benefit | |
tremendously from economies of scale, and the monthly fees for these | |
services are extremely reasonable for what you are getting. Going | |
further, the cloud based LLM receive upgrades constantly while static | |
hardware will likely lock you out of future models at some time | |
horizon. | |
numpad0 wrote 1 day ago: | |
Couple best vram for buck && borderline space heater GPUs off top of my | |
head: Tesla K80(12GBx2), M40(24GB), Radeon Instinct | |
MI(25|50|60|100)(8-32GB?), Radeon Pro V340(16GBx2), bunch of other | |
Radeon Vega 8GB cards e.g. Vega 56, NVIDIA P102/P104(~16GB), Intel | |
A770(16GB). Note: some of these are truly just space heaters. | |
I'm not sure if right now is the best timing for building an LLM rig, | |
as Intel Arc B60(24GBx2) is about to go on sale. Or maybe it is to | |
secure multiples of 16GB cards hastily offloaded before its launch? | |
AJRF wrote 1 day ago: | |
Why a 4070 over a 3090? A 4070 has half the VRAM. In the UK you can get | |
a 3090 for like 600GBP. | |
Havoc wrote 1 day ago: | |
> You pay a lot upfront for the hardware, but if your usage of the GPU | |
is heavy, then you save a lot of money in the long run. | |
Last I saw data on this wasnât true. A like for like comparison (same | |
model and quant) API is cheaper than elec so you never make back | |
hardware cost. That was a year ago and api costs have plummeted so | |
Iâd imagine itâs even worse now. | |
Datacenters have cheaper elec, can do batch inference at scale and more | |
efficient cards. And thatâs before we consider the huge free | |
allowances by Google etc | |
Own AI gear is coolâ¦but not due to economics | |
edg5000 wrote 1 day ago: | |
Is this also the case for token-heavy uses such as Claude Code? Not | |
sure if I will end up using CC for development in the future, but if | |
I end up leaning on that, I wonder if there would be a desire to | |
essentially have it run 24/7. When ran 24/7, CC would possibly incur | |
more API fees than residential electricity would cost when running on | |
your own gear? I have no idea about the numbers. Just wondering. | |
Havoc wrote 16 hours 54 min ago: | |
I doubt youâre going to beat datacenter under any conditions in | |
any model that is vaguely like for like | |
The comparison I saw was a small llama 8B model. ie something you | |
can actually get usable numbers on both home and api. So something | |
pretty commoditized | |
> When ran 24/7, CC would possibly incur more API fees than | |
residential electricity would cost when running on your own gear? | |
Claude is pretty damn expensive so plausible that you can undercut | |
it with another model. That implies you throw out the like for like | |
assumptions out the door though. Valid play practically, but kinda | |
undermines the buy own rig to save argument | |
whalesalad wrote 1 day ago: | |
I would rather spend $1,300 on openai/anthropic credits. The | |
performance from that 4070 cannot be worth the squeeze. | |
T-A wrote 1 day ago: | |
I would consider adding $400 for something like this instead: | |
[1]: https://www.bosgamepc.com/products/bosgame-m5-ai-mini-desktop-... | |
atentaten wrote 1 day ago: | |
Do you use this? If so, what's your use case and performance? | |
T-A wrote 19 hours 41 min ago: | |
No, they start shipping in July. The main advertised use case is | |
self-hosting LLMs. | |
usercvapp wrote 1 day ago: | |
I have a server at home sitting IDLE for the last 2 years with 2 TB of | |
RAM and 4 CPUs. | |
I am gonna push it this week and launch some LLM models to see how they | |
perform! | |
How much electric bill efficient are they running locally? | |
AzN1337c0d3r wrote 19 min ago: | |
Depends on the server. Probably not going to be cost effective. I get | |
barely ~0.5 tokens/sec. | |
I have Dual E5-2699A v4 w/1.5 TB DDR4-2933 spread across 2 sockets. | |
The full Deepseek-R1 671B (~1.4 TB) with llama.cpp seems to have a in | |
that local engines that run the LLMs don't do NUMA aware allocation, | |
so cores will often have to pull the weights in from another socket's | |
memory controllers through the inter-socket links | |
(QPI/UPI/Hypertransport) and bottleneck there. | |
For my platform that's 2x QPI links @ ~39.2GB/s/link that get | |
saturated. | |
I give it a prompt, go to work and check back on it at lunch and | |
sometimes it's still going. | |
If you're going to want to achieve interactively I'd aim for 7-10 | |
tokens/s, so realistically it means you'll run one of the 8b models | |
on a GPU (~30 tokens/s) or maybe a 70b model on an M4 Max (~8 | |
tokens/s). | |
pshirshov wrote 1 day ago: | |
3090 for ~1000 is much more solid choice. Also these old mining mobos | |
play very well for multi-gpu ollama. | |
msp26 wrote 1 day ago: | |
> 12GB vram | |
waste of effort, why would you go through the trouble of building + | |
blogging for this? | |
timnetworks wrote 5 hours 58 min ago: | |
Can easily be replaced with a 24GB one, drop-in upgrayyed like ram. | |
brought to you by carl's jr. | |
jacekm wrote 1 day ago: | |
For $100 more you could get a used 3090 with twice as much VRAM. You | |
could also get 4060 Ti which is cheaper than 4070 and it has 16 GB VRAM | |
(although it's less powerfull too, so I guess depends on the use case) | |
iJohnDoe wrote 1 day ago: | |
Details about the ML software or | |
AI software? | |
JKCalhoun wrote 1 day ago: | |
Someone posted that they had used a "mining rig" [0] from AliExpress | |
for less than $100. It even has RAM and a CPU. He picked up a 2000W (!) | |
DELL server PS for cheap off eBay. The GPUs were NVIDIA TESLAs (M40 for | |
example) since they often have a lot of RAM and are less expensive. | |
I followed in those footsteps to create my own [1] (photo [2]). | |
I picked up a 24GB M40 for around $300 off eBay. I 3D printed a "cowl" | |
for the GPU that I found online and picked up two small fans from | |
Amazon that got int he cowl. Attached the cowl + fans keep the GPU | |
cool. (These TESLA server GPUs have no fan since they're expected to | |
live in one of those wind-tunnels called a server rack). | |
I bought the same cheap DELL server PS that the original person had | |
used and I also had to get a break-out board (and power-supply cables | |
and adapters) for the GPU. | |
Thanks to LLMs, I was able to successfully install Rocky Linux as well | |
as CUDA and NVIDIA drivers. I SSH into it and run ollama commands. | |
My own hurdle at this point is: I have a 2nd 24 GB M40 TESLA but when | |
installed on the motherboard, Linux will not boot. LLMs are helping me | |
try to set up BIOS correctly or otherwise determine what the issue is. | |
(We'll see.) I would love to get to 48 GB. | |
[0] [1] [2] | |
[1]: https://www.aliexpress.us/item/3256806580127486.html | |
[2]: https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4fk... | |
[3]: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxjqlammq... | |
reginald78 wrote 14 hours 29 min ago: | |
My first guess would be to change the Above 4G decoding setting but | |
depending upon how old the motherboard is it may not have that | |
setting. | |
jedbrooke wrote 1 day ago: | |
I had an old Tesla M40 12 GB lying around and figured Iâd try it | |
out with some 8-13B llms, but was disappointed to find that itâs | |
around the same speed as my mac mini m2. I suppose the mac mini is a | |
10 years newer chip, but itâs crazy that mobile today matches data | |
center from 10 years ago | |
rjsw wrote 1 day ago: | |
There was an article on Tom's Hardware recently where someone was | |
using a CPU cooler with a GPU [1] | |
[1]: https://www.tomshardware.com/pc-components/gpus/crazed-modde... | |
ww520 wrote 1 day ago: | |
I use a 10-year old laptop to run a local LLM. The time between prompts | |
are 10-30 seconds. Not for speedy interactive usage. | |
atentaten wrote 1 day ago: | |
Enjoyed the article as I am interested in the same. I would like to | |
have seen more about the specific use cases and how they performed on | |
the rig. | |
djhworld wrote 1 day ago: | |
With system builds like this I always feel the VRAM is the limiting | |
factor when it comes to what models you can run, and consumer grade | |
stuff tends to max out at 16GB or (somemtimes) 24GB for more expensive | |
models. | |
It does make me wonder whether we'll start to see more and more | |
computers with unified memory architecture (like the Mac) - I know | |
nvidia have the Digits thing which has been renamed to something else | |
m0th87 wrote 16 hours 22 min ago: | |
Thatâs what I hope for, but everything that isnât bananas | |
expensive with unified memory has very low memory bandwidth. DGX | |
(Digits), Framework Desktop, and non-Ultra Macs are all around 128 | |
gb/s, and will produce single digits tokens per second for larger | |
models: [1] So thereâs a fundamental tradeoff between cost, | |
inference speed, and hostable model size for the foreseeable future. | |
[1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen... | |
JKCalhoun wrote 1 day ago: | |
Go server GPU (TESLA) and 24 GB is not unusual. (And also about $300 | |
used on eBay.) | |
v3ss0n wrote 17 hours 5 min ago: | |
But compute speed is very low. | |
incomingpain wrote 3 days ago: | |
I've been dreaming on pcpartpicker. | |
I think Radeon RX 7900 XT - 20 GB has been the best bang for your buck. | |
Enables full gpu 32B? | |
Looking at what other people have been doing lately, they arent doing | |
this. | |
They are getting 64+ core cpus and 512GB of ram. Keeping it on cpu and | |
enabling massive models. This setup lets you do deepseek 671B. | |
It makes me wonder, how much better is 671B vs 32B? | |
zargon wrote 21 hours 26 min ago: | |
> It makes me wonder, how much better is 671B vs 32B? | |
32B has improved leaps and bounds in the past year. But Deepseek 671B | |
is still a night and day comparison. 671B just knows so much more | |
stuff. | |
The main issue with RAM-only builds is that prompt ingestion is | |
incredibly slow. If you're going to be feeding in any context at all, | |
it's horrendous. Most people quote their tokens/s with basically | |
non-existent context (a few hundred tokens). Figure out if you're | |
going to be using context, and how much patience you have. Research | |
the speed you'll be getting for prompt processing / token generation | |
at your desired context length in each instance, and make your | |
decision based on that. | |
Aeolun wrote 1 day ago: | |
I bought an RX 7900 XTX with 24GB, and itâs everything I expected | |
of it. Itâs absolutely massive though. I thought I could add one | |
extra for more memory, but thatâs a pipe dream in my little desktop | |
box. | |
Cheap too, compared to a lot of what Iâm seeing. | |
DogRunner wrote 3 days ago: | |
I used a similar budget and build something like this: | |
7x RTX 3060 - 12 GB which results in 84GB Vram | |
AMD Ryzen 5 - 5500GT with 32GB Ram | |
All in a 19-inch rack with a nice cooling solution and a beefy power | |
supply. | |
My costs? 1300 Euro, but yeah, I sourced my parts on ebay / second | |
hand. | |
(Added some 3d printed parts into the mix: [1] [2] [3] if you think | |
about building something similar) | |
My power consumption is below 500 Watt at the wall, when using | |
LLLMs,since I did some optimizations: | |
* Worked on power optimizations and after many weeks of benchmarking, | |
the sweet spot on the RTX3060 12GB cards is a 105 Watt limit | |
* Created Patches for Ollama ( [4] ) to group models to exactly memory | |
allocation instead of spreading over all available GPUs (This also | |
reduces the VRAM overhead) | |
* ensured that ASPM is used on all relevant PCI components (Powertop is | |
your friend) | |
It's not all shiny: | |
* I still use PCIe3 X1 for most of the cards, which limits their | |
capability, but all I found so far (PCIe Gen4 x4 extender and | |
bifurcation/special PCIE routers) are just too expensive to be used on | |
such low powered cards | |
* Due to the slow PCIe bandwidth, the performance drops significantly | |
* Max VRAM per GPU is king. If you split up a model over several cards, | |
the RAM allocation overhead is huge! (See Examples in my ollama patch | |
about). I would rather use 3x 48GB instead of 7x 12G. | |
* Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is | |
unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do | |
idle at 3 Watt, which is a huge difference when you use 7 or more | |
cards. These BIOSes can be patched, but this can be risky | |
All in all, this server idles at 90-100Watt currently, which is perfect | |
as a central service for my tinkerings and my family usage. | |
[1]: https://www.printables.com/model/1142963-inter-tech-and-generi... | |
[2]: https://www.printables.com/model/1142973-120mm-5mm-rised-noctu... | |
[3]: https://www.printables.com/model/1142962-cable-management-fur-... | |
[4]: https://github.com/ollama/ollama/pull/10678 | |
reginald78 wrote 14 hours 17 min ago: | |
Great info in this post with some uncommon questions answered. I have | |
a 3060 with unimpressive idle power consumption, interesting that it | |
varies so much. | |
I know it would increase the idle power consumption, but have you | |
considered a server platform instead of Ryzen to get more lanes? | |
Even so, you could probably get at least 4x for 4 cards without | |
getting to crazy. 2 m.2 -> pcie adapters, the main GPU slot and the | |
fairly common 4x wired secondary slot. | |
Splitting the main 16x GPU slot is possible but whenever I looked | |
into this I kind of found the same thing you did. In addition to | |
being a cabling/mounting nightmare the necessary hardware started to | |
eat up enough total system cost that just ponying up for a 3090 | |
started to make more sense. | |
jononor wrote 1 day ago: | |
Impressive! What kind of motherboard do you use to host 7 GPUs? | |
burnt-resistor wrote 3 days ago: | |
Reminds me of [1] I'll be that guy⢠that says if you're going to do | |
any computing half-way reliably, only use ECC RAM. Silent bit flips | |
suck. | |
[1]: https://cr.yp.to/hardware/build-20090123.html | |
politelemon wrote 3 days ago: | |
If the author is reading this I'll point out that the cuda toolkit you | |
find in the repositories is generally older. You can find the latest | |
versions straight from Nvidia: [1] The caveat is that sometimes a | |
library might be expecting an older version of cuda. | |
The vram on the GPU does make a difference, so it would at some point | |
be worth looking at another GPU or increasing your system ram if you | |
start running into limits. | |
However I wouldn't worry too much right away, it's more important to | |
get started and get an understanding of how these local LLMs operate | |
and take advantage of the optimisations that the community is making to | |
make it more accessible. Not everyone has a 5090, and if LLMs remain in | |
the realms of high end hardware, it's not worth the time. | |
[1]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&ta... | |
throwaway314155 wrote 1 day ago: | |
The other main caveat is that installing from custom sources using | |
apt is a massive pain in the ass. | |
koakuma-chan wrote 1 day ago: | |
I tried running an LLM locally today, installed cuda toolkit, and | |
it was missing cudann.h | |
I gave up. | |
v5v3 wrote 3 days ago: | |
I thought prevailing wisdom was that a used 3090 with it's larger vram | |
was the best budget gpu choice? | |
And in general, if on a budget then why not buy used and not new? And | |
more so as the author himself talks about the resale value for when he | |
sells it on. | |
retinaros wrote 1 day ago: | |
yes it is | |
olowe wrote 3 days ago: | |
> I thought prevailing wisdom was that a used 3090 with it's larger | |
vram was the best budget gpu choice? | |
The trick is memory bandwidth - not just the amount of VRAM - is | |
important for LLM inference. For example, the B50 specs list a memory | |
bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over | |
900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3]. | |
More VRAM helps run larger models but with lower bandwidth tokens | |
could be generating so slowly it's not really practical for | |
day-to-day use or experimenting. | |
[1] [2] | |
[1]: https://www.intel.com/content/www/us/en/products/sku/242615/... | |
[2]: https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622 | |
[3]: https://www.thefpsreview.com/gpu-family/nvidia-geforce-rtx-4... | |
lelanthran wrote 1 day ago: | |
> The trick is memory bandwidth - not just the amount of VRAM - is | |
important for LLM inference. | |
I'm not really knowledgeable about this space, so maybe I'm missing | |
something: | |
Why does the bus performance affect token generation? I would | |
expect it to cause a slow startup when loading the model, but once | |
the model is loaded, just how much bandwidth can the token | |
generation possibly use? | |
Token generation is completely on the card using the memory on the | |
card, without any bus IO at all, no? | |
IOW, I'm trying to think of what IO the card is going to need for | |
token generation, and I can't think of any other than returning the | |
tokens (which, even on a slow 100MB/s transfer is still going to be | |
about 100x the rate at which tokens are being generated. | |
stevenhuang wrote 1 day ago: | |
During inference, each token passes through each parameter of the | |
model as a matrix vector products. And then as context grows, | |
each new token passes through all current context tokens as | |
matrix vector products. | |
This means bandwidth requirements grow as context sizes grow. | |
For datacenter workloads batching can be used to efficiently use | |
this memory bandwidth and make things compute bound instead | |
lelanthran wrote 1 day ago: | |
[I'm still not understanding] | |
It seems to me that even if you pass in a long context on every | |
prompt, that context is still tiny compared to the execution | |
time on the processor/GPU/tensorcore/etc. | |
Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass | |
in a prompt with 1MB of context which causes a response of | |
500kb after 1s. That's still only 1.5MB of IO transferred in | |
1s, which kept the GPU busy for 1s. Increasing the prompt is | |
going to increase the duration to a response accordingly. | |
Unless the GPU is not fully utilised on each prompt-response | |
cycle, I feel that the GPU is still the bottleneck here, not | |
the bus performance. | |
zargon wrote 21 hours 0 min ago: | |
> I feel that the GPU is still the bottleneck here, not the | |
bus performance. | |
PCIe bus performance is basically irrelevant. | |
> Token generation is completely on the card using the memory | |
on the card, without any bus IO at all, no? | |
Right. But the GPU can't instantaneously access data in VRAM. | |
It has to be copied from VRAM to GPU registers first. For | |
every token, the entire contents of VRAM has to be copied to | |
the GPU to be computed. It's a memory-bound process. | |
Right now there's about an 8x difference in memory bandwidth | |
between low-end and high-end consumer cards (e.g., 4060 Ti vs | |
5090). Moving up to a B200 more than doubles that performance | |
again. | |
imtringued wrote 22 hours 42 min ago: | |
1MB of context can maybe hold 10 tokens depending on your | |
model. | |
For reference. llama 3.2 8B used to take 4 KiB per token per | |
layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV | |
cache (context). If your context holds 8000 tokens including | |
responses then you need around 1GB. | |
>Unless the GPU is not fully utilised on each prompt-response | |
cycle, I feel that the GPU is still the bottleneck here, not | |
the bus performance. | |
Matrix vector multiplication implies a single floating point | |
multiplication and addition (2 flops) per parameter. Your GPU | |
can do way more flops than that without using tensor cores at | |
all. In fact, this workload bores your GPU to death. | |
jononor wrote 1 day ago: | |
GPU memory bandwidth is the limiting factor, not PCIe | |
bandwidth. | |
The memory bandwidth is critical because the models rely on | |
getting all the parameters from memory to do computation, and | |
there is a low amount of computation per parameter, so memory | |
tends to be the bottleneck. | |
rcarmo wrote 4 days ago: | |
The trouble with these things is that âon a budgetâ doesnât | |
deliver much when most interesting and truly useful models are creeping | |
beyond the 16GB VRAM limit and/or require a lot of wattage. Even a Mac | |
mini with enough RAM is starting to look like an expensive proposition, | |
and the AMD Stryx Halo APUs (the SKUs that matter, like the Framework | |
Desktop at 128GB) are around $2K. | |
As someone who built a period-equivalent rig (with a 12GB 3060 and | |
128GB RAM) a few years ago, I am not overly optimistic that local | |
models will keep being a cheap alternative (never mind the | |
geopolitics). And yeah, there are vey cheap ways to run inference, but | |
hey become pointless - I can run Qwen and Phi4 locally on an ARM chip | |
like the RK3588, but it is still dog slow. | |
Jedd wrote 4 days ago: | |
In January 2024 there was a similar post ( [1] ) wherein the author | |
selected dual NVidia 4060 Ti's for an at-home-LLM-with-voice-control -- | |
because they were the cheapest cost per GB of well-supported VRAM at | |
the time. | |
(They probably still are, or at least pretty close to it.) | |
That informed my decision shortly after, when I built something similar | |
- that video card model was widely panned by gamers (or more | |
accurately, gamer 'influencers'), but it was an excellent choice if you | |
wanted 16GB of VRAM with relatively low power draw (150W peak). | |
TFA doesn't say where they are, or what currency they're using (which | |
implies the hubris of a North American) - at which point that pricing | |
for a second hand, smaller-capacity, higher-power-drawing 4070 just | |
seems weird. | |
Appreciate the 'on a budget' aspect, it just seems like an objectively | |
worse path, as upgrades are going to require replacement, rather than | |
augment. | |
As per other comments here, 32 / 12 is going to be really limiting. | |
Yes - lower parameter / smaller-quant models are becoming more capable, | |
but at the same time we're seeing increasing interest in larger context | |
for these at home use cases, and that chews up memory real fast. | |
[1]: https://news.ycombinator.com/item?id=38985152 | |
1shooner wrote 1 day ago: | |
>TFA doesn't say where they are, or what currency they're using | |
They say California, and I'm seeing the dollar amount in the title | |
and metadata as $1,3k, was that an edit? | |
T-A wrote 1 day ago: | |
> TFA doesn't say where they are | |
"the 1,440W limit on wall outlets in California" is a pretty good | |
hint. | |
zxexz wrote 1 day ago: | |
Bringing back memories of testing the breakers in my college | |
apartments to verify exactly which outlets were on which circuit, | |
so I could pool as much as possible as needed. I distinctly | |
remember pulling 20kw once, celebrating with a beer; the memory of | |
all those cables snaking through the old apartment makes me almost | |
uneasy now. I do remember we didnât have to pay for heat that | |
winter; which felt like a major win in Massachusetts. Come to think | |
of it, Iâm pretty sure there are still some servers tucked away | |
in a crawlspace in that basement. | |
dcassett wrote 1 day ago: | |
San Francisco specifically: | |
"I prompted ChatGPT to give me recommendations. Prompt: ... The | |
final build will be located at my residence in San Francisco, CA, | |
..." | |
throwaway314155 wrote 1 day ago: | |
> which implies the hubris of a North American | |
No need for that. | |
Jedd wrote 23 hours 28 min ago: | |
Probably true. | |
But for those of us outside the USA bubble, it's incredibly tring | |
to have to intuit geo information (when geo information would add | |
to the understanding). | |
As others noted in sibling comments, TFA had in fact mentioned in | |
passing their location (in their quoted prompt to chatgpt, and at | |
the very end of the third supporting point for the decision to go | |
for an Nvidia 4070) 'California, CA'. I confess that I skimmed over | |
both those paragraphs. | |
Now, sure, CA is a country code, but I stand corrected that the | |
author completely hid their location. Had I spotted those clues I'd | |
not have to have made any assumptions around wall power | |
capabilities & costs, new & second hand market availability / | |
costs, etc. | |
I think I mostly catered for those considerations in the rest of my | |
original comment though - asserted power sensitivity makes it | |
surprising that a higher-power-requiring, smaller-RAM-capacity, | |
more-expensive-than-a-sibling-generation-16GB card was selected. | |
topato wrote 1 day ago: | |
He did soften the blow by saying North American, rather than the | |
more correctly appropos, American | |
dfc wrote 1 day ago: | |
The author also refers to Californian power limits. So it seems | |
the criticism is misplaced. | |
topato wrote 1 day ago: | |
True, though | |
Uehreka wrote 4 days ago: | |
Love the attention to detail, I can tell this was a lot of work to put | |
together and I hope it helps people new to PC building. | |
I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling | |
youâre going to hit pretty quickly if youâre into messing with | |
LLMs. Thereâs basically no way to do a better job at the budget | |
youâre working with though. | |
One thing I hear about a lot is people using things like RunPod to | |
briefly get access to powerful GPUs/servers when they need one. If you | |
spend $2/hr you can get access to an H100. If you have a budget of | |
$1300 that could get you about 600 hours of compute time, which (unless | |
youâre doing training runs) should last you several months. | |
In several months time the specs required to run good models will be | |
different again in ways that are hard to predict, so this approach can | |
help save on the heartbreak of buying an RTX 5090 only to find that | |
even that doesnât help much with LLM inference and weâre all gonna | |
need the cheaper-but-more-VRAM Intel Arc B60s. | |
numpad0 wrote 1 day ago: | |
I don't understand why some people build a "rig", put a lot of | |
thoughts into ever so slightly differently binned CPUs, and then | |
don't max out RAM(put aside DDR5 quirk considerations). It's like | |
buying a sports car only to cheap out on tires. It makes no sense. | |
Uehreka wrote 8 hours 4 min ago: | |
I built my current computer last fall. The Ryzen 7950X was on an | |
awesome sale for black Friday and after looking at the math buying | |
a 9950X just didnât make sense. So I got the 7950X and 96GB of | |
DDR5 RAM (2 sticks, so I can double later if I need to). Loving it, | |
it was the perfect choice. | |
All this to say some people do in fact do this ;) | |
semi-extrinsic wrote 1 day ago: | |
> save on the heartbreak of buying an RTX 5090 only to find that even | |
that doesnât help much with LLM inference and weâre all gonna | |
need the cheaper-but-more-VRAM Intel Arc B60s | |
When going for more VRAM, with an RTX 5090 currently sitting at $3000 | |
for 32GB, I'm curious why people aren't trying to get the Dell | |
C4140s. Those seem to go for $3000-$4000 for the whole server with 4x | |
V100 16GB, so 64GB total VRAM. | |
Maybe it's just because they produce heat and noise like a small | |
turbojet. | |
nickpsecurity wrote 1 day ago: | |
Don't the parallelizing techniques of a 4x build make using them | |
more difficult than a 1x build with no extra parallelism? Couldn't | |
the 32GB 4090 handle more models in their original configurations? | |
ijk wrote 11 hours 28 min ago: | |
For LLM inference parallel GPUs is mostly fine (you take some | |
performance hit but llama.cpp doesn't care what cards you use and | |
other stuff handles 4 symmetric GPUs just fine). You get more | |
problems when you're doing anything training related, though. | |
zargon wrote 22 hours 4 min ago: | |
> Don't the parallelizing techniques of a 4x build make using | |
them more difficult than a 1x build with no extra parallelism? | |
For inference, no. For training, only slightly. | |
7speter wrote 4 days ago: | |
I dunno everyone, but I think Intel has something big on their hands | |
with their announced workstation gpus. The b50 is a low profile card | |
that doesnât have a powersupply hookup because it only uses something | |
like 60 watts, and comes with 16gb vram at a msrp of 300 dollars. | |
I imagine companies will have first dibs via the likes of agreements | |
with suppliers like CDW, etc, but if Intel had enough of these | |
battlemage dies accumulated, it could also drastically change the local | |
ai enthusiast/hobbyist landscape; for starters this could drive down | |
the price of workstation cards that are ideal for inference, at the | |
very least. Iâm cautiously excited. | |
On the AMD front (really, a sort of open compute front), Vulkan Kompute | |
is picking up steam and it would be really cool to have a standard that | |
mostly(?) ships with Linux, and older ports available for Freebsd, so | |
that we can actually run free as in freedom inference locally. | |
golly_ned wrote 4 days ago: | |
Whenever I get to a section that was clearly autogenerated by an LLM I | |
lose interest in the entire article. Suddenly the entire thing is | |
suspect and I feel like Iâm wasting my time, since Iâm lo lingering | |
encountering the mind of another person, just interacting with a | |
system. | |
throwaway314155 wrote 1 day ago: | |
Eh, yeah - the article starts off pretty specific but then gets into | |
the weeds of stuff like how to put your PC together, which is far | |
from novel information and certainly not on-topic in my opinion. | |
memcg wrote 15 hours 29 min ago: | |
I sent the article link to my son because he does not have | |
experience building or assembling hardware or installing or using | |
Linux. Also took the author's ChatGPT prompt and changed it to ask | |
about reusing two HPE ML150 Gen9 servers I picked up free. I think | |
my son will benefit from the details in the article that many find | |
off-topic. | |
bravesoul2 wrote 4 days ago: | |
I didn't see anything like that here. Yeah they used bullets. | |
golly_ned wrote 3 days ago: | |
Thereâs a section that says what the parts of a pc are, and what | |
that part is. | |
Nevermark wrote 1 day ago: | |
> I used the AI-generated recommendations as a starting point, | |
and refined the options with my own research. | |
Referring to this section? | |
I don't see a problem with that. This isn't an article about a | |
design intended for 10,000 systems. Just one person's follow | |
through on an interesting project. With disclosure of | |
methodology. | |
uniposterz wrote 4 days ago: | |
I had a similar setup for a local LLM, 32GB was not enough. I recommend | |
going for 64GB. | |
vunderba wrote 4 days ago: | |
The RTX market is particularly irritating right now, even second-hard | |
4090s are still going for MSRP if you can find them at all. | |
Most of the recommendations for this budget AI system are on point - | |
the only thing I'd recommend is more RAM. 32GB is not a lot - | |
particularly if you start to load larger models through formats such as | |
GGUF and want to take advantage of system ram to split the layers at | |
the cost of inference speed. I'd recommend at least 2 x 32GB or even 4 | |
x 32GB if you can swing it budget-wise. | |
Author mentioned using Claude for recommendations, but another great | |
resource for building machines is PC Part Picker. They'll even show | |
warnings if you try pairing incompatible parts or try to use a PSU that | |
won't supply the minimum recommended power. | |
[1]: https://pcpartpicker.com | |
Aeolun wrote 1 day ago: | |
I thought those 4090âs were weird. You pay more for them than the | |
brand new 5090. And then thereâs AMD, which everyone loves to hate, | |
but has similar GPUâs that cost 1/4th of what a similar Nvidia GPU | |
costs. | |
<- back to front page |