| _______ __ _______ | |
| | | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| | || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
| |___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
| on Gopher (inofficial) | |
| Visit Hacker News on the Web | |
| COMMENT PAGE FOR: | |
| Building an AI server on a budget | |
| eachro wrote 8 hours 48 min ago: | |
| A lot of people are saying 12gb is too small to do anything interesting | |
| with. What's the most useful thing people __have__ gotten to work? | |
| ntlm1686 wrote 10 hours 25 min ago: | |
| Building a PC that can play video games and run some LLMs. | |
| dubrado wrote 16 hours 27 min ago: | |
| If you want to save some money and test things out, check out | |
| Hyperbolic (app.hyperbolic.xyz). | |
| They're based in the US, don't store any data, and you can rent | |
| (self-serve) style in less than a minute. | |
| diggan wrote 16 hours 24 min ago: | |
| If you're gonna promote your own product, at least be honest and | |
| brave enough to acknowledge that you built/manage it. | |
| v3ss0n wrote 17 hours 37 min ago: | |
| 12GB GPU can't do a thing that is useful. Minium should be 32GB VRam | |
| where you can run actual models (Mistral-Small , Qwen3-32B , etc). | |
| alganet wrote 22 hours 31 min ago: | |
| Let me try to put this in the scale of coffee: | |
| -- | |
| Using LLM via api: Starbucks. | |
| Inference at home: Nespresso capsules. | |
| Fine-tune a small model at home: Owning a grinder and an italian | |
| espresso machine. | |
| Pre-training a model: Owning a moderate coffee plantation. | |
| teleforce wrote 23 hours 6 min ago: | |
| >DECISION: Nvidia RTX 4070 | |
| I'm curiuos why OP didn't go for the more recent Nvidia RTX 4060 Ti | |
| with 16 GB VRAM that cost cheaper (~USD500) brand new and lesser power | |
| consumption at 165W [1] RTX 5060 Ti 16GB sucks for gaming, but seems | |
| like a diamond in the rough for AI: | |
| [1]: https://news.ycombinator.com/item?id=44196991 | |
| qingcharles wrote 21 hours 36 min ago: | |
| And if you're gonna be fine with 12GB, why not a 2080ti instead? | |
| Fluorescence wrote 16 hours 51 min ago: | |
| Only 11GB... but I guess it will allow you to not do anything | |
| useful just as well as 12GB will :) | |
| You can however solder on double-capacity memory chips to get 22GB: | |
| [1] I hoped the article would be more along these lines than | |
| calling an unremarkable second-hand last-gen gaming pc an "AI | |
| Server". | |
| [1]: https://forums.overclockers.com.au/threads/double-your-gpu... | |
| zlies wrote 23 hours 34 min ago: | |
| Did you not use any thermal paste at all, or did you just forget to | |
| mention it in your post? | |
| PeterStuer wrote 1 day ago: | |
| For image generation the article's setup might be viable, but do not | |
| expect to run LLM's with satisfactory quality and speed on 12GB vram. | |
| lazylizard wrote 1 day ago: | |
| why not one of these? | |
| [1]: https://www.amazon.sg/NVIDIA-Jetson-Orin-64GB-Developer/dp/B0B... | |
| numpad0 wrote 20 hours 55 min ago: | |
| Jetsons aren't so fast, those are intended for mobile robots. The one | |
| supposed to be just around the corner is DGX Spark(Project DIGITS) | |
| and DGX Station. | |
| Those DGX machines are still at right around the corner state. | |
| romanovcode wrote 1 day ago: | |
| Doesn't the new computer that is about to be released from NVIDIA | |
| much better than this one and is same price? Why would anyone buy | |
| this one now, seems like a waste of money. | |
| mythz wrote 1 day ago: | |
| Good value but a 12GB card isn't going to let you do too much given the | |
| low quality of small models. Curious what "home AI" use cases small | |
| models are being used for? | |
| It would be nice to see a best value home AI setups under different | |
| budgets or RAM tiers, e.g. best value configuration for 128 GPU VRAM, | |
| etc. | |
| My 48GB GPU VRAM "Home AI Server" cost ~$3100 from all parts on eBay | |
| running 3x A4000's in a Supermicro 128GB RAM, 32/64 core Xeon 1U rack | |
| server. Nothing amazing but wanted the most GPU VRAM before paying the | |
| premium Nvidia tax on their larger GPUs. | |
| This works well for Ollama/llama-server which can make use of all GPU | |
| VRAM unfortunately ComfyUI can't make use of all GPU VRAM to run larger | |
| models, so on the lookout for a lot more RAM in my next GPU Server. | |
| Really hoping Intel can deliver with its upcoming Arc Pro B60 Dual GPU | |
| for a great value 48GB option which can be run 4x in an affordable | |
| 192GB VRAM workstation [1]. If it runs Ollama and ComfyUI efficiently | |
| I'm sold. | |
| [1]: https://www.servethehome.com/maxsun-intel-arc-pro-b60-dual-gpu... | |
| rwyinuse wrote 22 hours 33 min ago: | |
| I use a Proxmox server with RTX 3060 to generate paintings (I have a | |
| couple of old jailbroken Amazon Kindle's attached to walls for that | |
| purpose), and to run ollama, which is connected to Home Assistant & | |
| their voice preview device, allowing me to talk with LLM without | |
| transmitting anything to cloud services. | |
| Admittedly with that amount of VRAM the models I can run are fairly | |
| useless for stuff like controlling lights via Home Assistant, | |
| occasionally does what I tell it to do but usually not. It is pretty | |
| okay for telling me information, like temperature or value of some | |
| sensors I have connected to HA. For generating AI paintings it's | |
| enough. My server also hosts tons of virtual machines, docker | |
| containers and is used for remote gameplay, so the AI thing is just | |
| an extra. | |
| garyfirestorm wrote 17 hours 37 min ago: | |
| Why do you say that? You can easily finetune 8B parameter model for | |
| function calling. | |
| msgodel wrote 22 hours 34 min ago: | |
| It's really not going to let you train much which IMO is the only | |
| reason I'd personally bother with a big GPU. Gradients get huge and | |
| everything does them with single/half precision floating point. | |
| itake wrote 1 day ago: | |
| My home AI machine does image classification. | |
| naavis wrote 1 day ago: | |
| What kind of image classification do you do at home? | |
| itake wrote 22 hours 36 min ago: | |
| My side project accepts and publishes user generated content. To | |
| stay compliant with regulations, I use ML to remove adult | |
| content: | |
| [1]: https://github.com/KevinColemanInc/NSFW-FLASK | |
| mythz wrote 1 day ago: | |
| Using just an Ollama VL Model (gemma3/mistral-small3.1/qwen2.5vl) | |
| or a specific library? | |
| itake wrote 22 hours 35 min ago: | |
| My home server detects NSFW images in user generated content on | |
| my side project. | |
| source code: | |
| [1]: https://github.com/KevinColemanInc/NSFW-FLASK | |
| mythz wrote 21 hours 57 min ago: | |
| Cool, I've tried a few but settled on using EraX NSFW to do the | |
| same. | |
| jononor wrote 1 day ago: | |
| Agreed, 12 GB does not seem useful. For coding LLM, it seems 128 GB | |
| is needed to be even close to the frontier models. | |
| For generative image processing (not video), it looks like one can | |
| get started with 16GB. | |
| noufalibrahim wrote 1 day ago: | |
| This is interesting. We recently built a similar machine to implement a | |
| product that we're building on a customer site. | |
| I didn't buy second hand parts since i wasn't sure of the quality so it | |
| was a little pricey but we have the entire thing working now and over | |
| the last week, we added the llm server to the mix. Haven't released it | |
| yet though. | |
| I wrote about some "fun" we had getting it together here but it's not | |
| as technically detailed as the original article. | |
| [1]: https://blog.hpcinfra.com/when-linkedin-met-reality-our-bangal... | |
| danielhep wrote 1 day ago: | |
| What are the practical uses of a self hosted LLM? Is it actually | |
| possible to approach the likes of Claude or one of the other big ones | |
| on your own hardware for a reasonable budget? I donât know if this is | |
| something thatâs actually worth it or if people are just building | |
| these rigs for fun or niche use cases that donât require the | |
| intelligence of a hosted LLM. | |
| tmountain wrote 16 hours 1 min ago: | |
| Personal opinion, it's for fun with some internal narrative of | |
| justification. It doesn't seem like it would be cost effective or | |
| provide better results, as all the major LLM vendors benefit | |
| tremendously from economies of scale, and the monthly fees for these | |
| services are extremely reasonable for what you are getting. Going | |
| further, the cloud based LLM receive upgrades constantly while static | |
| hardware will likely lock you out of future models at some time | |
| horizon. | |
| numpad0 wrote 1 day ago: | |
| Couple best vram for buck && borderline space heater GPUs off top of my | |
| head: Tesla K80(12GBx2), M40(24GB), Radeon Instinct | |
| MI(25|50|60|100)(8-32GB?), Radeon Pro V340(16GBx2), bunch of other | |
| Radeon Vega 8GB cards e.g. Vega 56, NVIDIA P102/P104(~16GB), Intel | |
| A770(16GB). Note: some of these are truly just space heaters. | |
| I'm not sure if right now is the best timing for building an LLM rig, | |
| as Intel Arc B60(24GBx2) is about to go on sale. Or maybe it is to | |
| secure multiples of 16GB cards hastily offloaded before its launch? | |
| AJRF wrote 1 day ago: | |
| Why a 4070 over a 3090? A 4070 has half the VRAM. In the UK you can get | |
| a 3090 for like 600GBP. | |
| Havoc wrote 1 day ago: | |
| > You pay a lot upfront for the hardware, but if your usage of the GPU | |
| is heavy, then you save a lot of money in the long run. | |
| Last I saw data on this wasnât true. A like for like comparison (same | |
| model and quant) API is cheaper than elec so you never make back | |
| hardware cost. That was a year ago and api costs have plummeted so | |
| Iâd imagine itâs even worse now. | |
| Datacenters have cheaper elec, can do batch inference at scale and more | |
| efficient cards. And thatâs before we consider the huge free | |
| allowances by Google etc | |
| Own AI gear is coolâ¦but not due to economics | |
| edg5000 wrote 1 day ago: | |
| Is this also the case for token-heavy uses such as Claude Code? Not | |
| sure if I will end up using CC for development in the future, but if | |
| I end up leaning on that, I wonder if there would be a desire to | |
| essentially have it run 24/7. When ran 24/7, CC would possibly incur | |
| more API fees than residential electricity would cost when running on | |
| your own gear? I have no idea about the numbers. Just wondering. | |
| Havoc wrote 16 hours 54 min ago: | |
| I doubt youâre going to beat datacenter under any conditions in | |
| any model that is vaguely like for like | |
| The comparison I saw was a small llama 8B model. ie something you | |
| can actually get usable numbers on both home and api. So something | |
| pretty commoditized | |
| > When ran 24/7, CC would possibly incur more API fees than | |
| residential electricity would cost when running on your own gear? | |
| Claude is pretty damn expensive so plausible that you can undercut | |
| it with another model. That implies you throw out the like for like | |
| assumptions out the door though. Valid play practically, but kinda | |
| undermines the buy own rig to save argument | |
| whalesalad wrote 1 day ago: | |
| I would rather spend $1,300 on openai/anthropic credits. The | |
| performance from that 4070 cannot be worth the squeeze. | |
| T-A wrote 1 day ago: | |
| I would consider adding $400 for something like this instead: | |
| [1]: https://www.bosgamepc.com/products/bosgame-m5-ai-mini-desktop-... | |
| atentaten wrote 1 day ago: | |
| Do you use this? If so, what's your use case and performance? | |
| T-A wrote 19 hours 41 min ago: | |
| No, they start shipping in July. The main advertised use case is | |
| self-hosting LLMs. | |
| usercvapp wrote 1 day ago: | |
| I have a server at home sitting IDLE for the last 2 years with 2 TB of | |
| RAM and 4 CPUs. | |
| I am gonna push it this week and launch some LLM models to see how they | |
| perform! | |
| How much electric bill efficient are they running locally? | |
| AzN1337c0d3r wrote 19 min ago: | |
| Depends on the server. Probably not going to be cost effective. I get | |
| barely ~0.5 tokens/sec. | |
| I have Dual E5-2699A v4 w/1.5 TB DDR4-2933 spread across 2 sockets. | |
| The full Deepseek-R1 671B (~1.4 TB) with llama.cpp seems to have a in | |
| that local engines that run the LLMs don't do NUMA aware allocation, | |
| so cores will often have to pull the weights in from another socket's | |
| memory controllers through the inter-socket links | |
| (QPI/UPI/Hypertransport) and bottleneck there. | |
| For my platform that's 2x QPI links @ ~39.2GB/s/link that get | |
| saturated. | |
| I give it a prompt, go to work and check back on it at lunch and | |
| sometimes it's still going. | |
| If you're going to want to achieve interactively I'd aim for 7-10 | |
| tokens/s, so realistically it means you'll run one of the 8b models | |
| on a GPU (~30 tokens/s) or maybe a 70b model on an M4 Max (~8 | |
| tokens/s). | |
| pshirshov wrote 1 day ago: | |
| 3090 for ~1000 is much more solid choice. Also these old mining mobos | |
| play very well for multi-gpu ollama. | |
| msp26 wrote 1 day ago: | |
| > 12GB vram | |
| waste of effort, why would you go through the trouble of building + | |
| blogging for this? | |
| timnetworks wrote 5 hours 58 min ago: | |
| Can easily be replaced with a 24GB one, drop-in upgrayyed like ram. | |
| brought to you by carl's jr. | |
| jacekm wrote 1 day ago: | |
| For $100 more you could get a used 3090 with twice as much VRAM. You | |
| could also get 4060 Ti which is cheaper than 4070 and it has 16 GB VRAM | |
| (although it's less powerfull too, so I guess depends on the use case) | |
| iJohnDoe wrote 1 day ago: | |
| Details about the ML software or | |
| AI software? | |
| JKCalhoun wrote 1 day ago: | |
| Someone posted that they had used a "mining rig" [0] from AliExpress | |
| for less than $100. It even has RAM and a CPU. He picked up a 2000W (!) | |
| DELL server PS for cheap off eBay. The GPUs were NVIDIA TESLAs (M40 for | |
| example) since they often have a lot of RAM and are less expensive. | |
| I followed in those footsteps to create my own [1] (photo [2]). | |
| I picked up a 24GB M40 for around $300 off eBay. I 3D printed a "cowl" | |
| for the GPU that I found online and picked up two small fans from | |
| Amazon that got int he cowl. Attached the cowl + fans keep the GPU | |
| cool. (These TESLA server GPUs have no fan since they're expected to | |
| live in one of those wind-tunnels called a server rack). | |
| I bought the same cheap DELL server PS that the original person had | |
| used and I also had to get a break-out board (and power-supply cables | |
| and adapters) for the GPU. | |
| Thanks to LLMs, I was able to successfully install Rocky Linux as well | |
| as CUDA and NVIDIA drivers. I SSH into it and run ollama commands. | |
| My own hurdle at this point is: I have a 2nd 24 GB M40 TESLA but when | |
| installed on the motherboard, Linux will not boot. LLMs are helping me | |
| try to set up BIOS correctly or otherwise determine what the issue is. | |
| (We'll see.) I would love to get to 48 GB. | |
| [0] [1] [2] | |
| [1]: https://www.aliexpress.us/item/3256806580127486.html | |
| [2]: https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4fk... | |
| [3]: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxjqlammq... | |
| reginald78 wrote 14 hours 29 min ago: | |
| My first guess would be to change the Above 4G decoding setting but | |
| depending upon how old the motherboard is it may not have that | |
| setting. | |
| jedbrooke wrote 1 day ago: | |
| I had an old Tesla M40 12 GB lying around and figured Iâd try it | |
| out with some 8-13B llms, but was disappointed to find that itâs | |
| around the same speed as my mac mini m2. I suppose the mac mini is a | |
| 10 years newer chip, but itâs crazy that mobile today matches data | |
| center from 10 years ago | |
| rjsw wrote 1 day ago: | |
| There was an article on Tom's Hardware recently where someone was | |
| using a CPU cooler with a GPU [1] | |
| [1]: https://www.tomshardware.com/pc-components/gpus/crazed-modde... | |
| ww520 wrote 1 day ago: | |
| I use a 10-year old laptop to run a local LLM. The time between prompts | |
| are 10-30 seconds. Not for speedy interactive usage. | |
| atentaten wrote 1 day ago: | |
| Enjoyed the article as I am interested in the same. I would like to | |
| have seen more about the specific use cases and how they performed on | |
| the rig. | |
| djhworld wrote 1 day ago: | |
| With system builds like this I always feel the VRAM is the limiting | |
| factor when it comes to what models you can run, and consumer grade | |
| stuff tends to max out at 16GB or (somemtimes) 24GB for more expensive | |
| models. | |
| It does make me wonder whether we'll start to see more and more | |
| computers with unified memory architecture (like the Mac) - I know | |
| nvidia have the Digits thing which has been renamed to something else | |
| m0th87 wrote 16 hours 22 min ago: | |
| Thatâs what I hope for, but everything that isnât bananas | |
| expensive with unified memory has very low memory bandwidth. DGX | |
| (Digits), Framework Desktop, and non-Ultra Macs are all around 128 | |
| gb/s, and will produce single digits tokens per second for larger | |
| models: [1] So thereâs a fundamental tradeoff between cost, | |
| inference speed, and hostable model size for the foreseeable future. | |
| [1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen... | |
| JKCalhoun wrote 1 day ago: | |
| Go server GPU (TESLA) and 24 GB is not unusual. (And also about $300 | |
| used on eBay.) | |
| v3ss0n wrote 17 hours 5 min ago: | |
| But compute speed is very low. | |
| incomingpain wrote 3 days ago: | |
| I've been dreaming on pcpartpicker. | |
| I think Radeon RX 7900 XT - 20 GB has been the best bang for your buck. | |
| Enables full gpu 32B? | |
| Looking at what other people have been doing lately, they arent doing | |
| this. | |
| They are getting 64+ core cpus and 512GB of ram. Keeping it on cpu and | |
| enabling massive models. This setup lets you do deepseek 671B. | |
| It makes me wonder, how much better is 671B vs 32B? | |
| zargon wrote 21 hours 26 min ago: | |
| > It makes me wonder, how much better is 671B vs 32B? | |
| 32B has improved leaps and bounds in the past year. But Deepseek 671B | |
| is still a night and day comparison. 671B just knows so much more | |
| stuff. | |
| The main issue with RAM-only builds is that prompt ingestion is | |
| incredibly slow. If you're going to be feeding in any context at all, | |
| it's horrendous. Most people quote their tokens/s with basically | |
| non-existent context (a few hundred tokens). Figure out if you're | |
| going to be using context, and how much patience you have. Research | |
| the speed you'll be getting for prompt processing / token generation | |
| at your desired context length in each instance, and make your | |
| decision based on that. | |
| Aeolun wrote 1 day ago: | |
| I bought an RX 7900 XTX with 24GB, and itâs everything I expected | |
| of it. Itâs absolutely massive though. I thought I could add one | |
| extra for more memory, but thatâs a pipe dream in my little desktop | |
| box. | |
| Cheap too, compared to a lot of what Iâm seeing. | |
| DogRunner wrote 3 days ago: | |
| I used a similar budget and build something like this: | |
| 7x RTX 3060 - 12 GB which results in 84GB Vram | |
| AMD Ryzen 5 - 5500GT with 32GB Ram | |
| All in a 19-inch rack with a nice cooling solution and a beefy power | |
| supply. | |
| My costs? 1300 Euro, but yeah, I sourced my parts on ebay / second | |
| hand. | |
| (Added some 3d printed parts into the mix: [1] [2] [3] if you think | |
| about building something similar) | |
| My power consumption is below 500 Watt at the wall, when using | |
| LLLMs,since I did some optimizations: | |
| * Worked on power optimizations and after many weeks of benchmarking, | |
| the sweet spot on the RTX3060 12GB cards is a 105 Watt limit | |
| * Created Patches for Ollama ( [4] ) to group models to exactly memory | |
| allocation instead of spreading over all available GPUs (This also | |
| reduces the VRAM overhead) | |
| * ensured that ASPM is used on all relevant PCI components (Powertop is | |
| your friend) | |
| It's not all shiny: | |
| * I still use PCIe3 X1 for most of the cards, which limits their | |
| capability, but all I found so far (PCIe Gen4 x4 extender and | |
| bifurcation/special PCIE routers) are just too expensive to be used on | |
| such low powered cards | |
| * Due to the slow PCIe bandwidth, the performance drops significantly | |
| * Max VRAM per GPU is king. If you split up a model over several cards, | |
| the RAM allocation overhead is huge! (See Examples in my ollama patch | |
| about). I would rather use 3x 48GB instead of 7x 12G. | |
| * Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is | |
| unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do | |
| idle at 3 Watt, which is a huge difference when you use 7 or more | |
| cards. These BIOSes can be patched, but this can be risky | |
| All in all, this server idles at 90-100Watt currently, which is perfect | |
| as a central service for my tinkerings and my family usage. | |
| [1]: https://www.printables.com/model/1142963-inter-tech-and-generi... | |
| [2]: https://www.printables.com/model/1142973-120mm-5mm-rised-noctu... | |
| [3]: https://www.printables.com/model/1142962-cable-management-fur-... | |
| [4]: https://github.com/ollama/ollama/pull/10678 | |
| reginald78 wrote 14 hours 17 min ago: | |
| Great info in this post with some uncommon questions answered. I have | |
| a 3060 with unimpressive idle power consumption, interesting that it | |
| varies so much. | |
| I know it would increase the idle power consumption, but have you | |
| considered a server platform instead of Ryzen to get more lanes? | |
| Even so, you could probably get at least 4x for 4 cards without | |
| getting to crazy. 2 m.2 -> pcie adapters, the main GPU slot and the | |
| fairly common 4x wired secondary slot. | |
| Splitting the main 16x GPU slot is possible but whenever I looked | |
| into this I kind of found the same thing you did. In addition to | |
| being a cabling/mounting nightmare the necessary hardware started to | |
| eat up enough total system cost that just ponying up for a 3090 | |
| started to make more sense. | |
| jononor wrote 1 day ago: | |
| Impressive! What kind of motherboard do you use to host 7 GPUs? | |
| burnt-resistor wrote 3 days ago: | |
| Reminds me of [1] I'll be that guy⢠that says if you're going to do | |
| any computing half-way reliably, only use ECC RAM. Silent bit flips | |
| suck. | |
| [1]: https://cr.yp.to/hardware/build-20090123.html | |
| politelemon wrote 3 days ago: | |
| If the author is reading this I'll point out that the cuda toolkit you | |
| find in the repositories is generally older. You can find the latest | |
| versions straight from Nvidia: [1] The caveat is that sometimes a | |
| library might be expecting an older version of cuda. | |
| The vram on the GPU does make a difference, so it would at some point | |
| be worth looking at another GPU or increasing your system ram if you | |
| start running into limits. | |
| However I wouldn't worry too much right away, it's more important to | |
| get started and get an understanding of how these local LLMs operate | |
| and take advantage of the optimisations that the community is making to | |
| make it more accessible. Not everyone has a 5090, and if LLMs remain in | |
| the realms of high end hardware, it's not worth the time. | |
| [1]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&ta... | |
| throwaway314155 wrote 1 day ago: | |
| The other main caveat is that installing from custom sources using | |
| apt is a massive pain in the ass. | |
| koakuma-chan wrote 1 day ago: | |
| I tried running an LLM locally today, installed cuda toolkit, and | |
| it was missing cudann.h | |
| I gave up. | |
| v5v3 wrote 3 days ago: | |
| I thought prevailing wisdom was that a used 3090 with it's larger vram | |
| was the best budget gpu choice? | |
| And in general, if on a budget then why not buy used and not new? And | |
| more so as the author himself talks about the resale value for when he | |
| sells it on. | |
| retinaros wrote 1 day ago: | |
| yes it is | |
| olowe wrote 3 days ago: | |
| > I thought prevailing wisdom was that a used 3090 with it's larger | |
| vram was the best budget gpu choice? | |
| The trick is memory bandwidth - not just the amount of VRAM - is | |
| important for LLM inference. For example, the B50 specs list a memory | |
| bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over | |
| 900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3]. | |
| More VRAM helps run larger models but with lower bandwidth tokens | |
| could be generating so slowly it's not really practical for | |
| day-to-day use or experimenting. | |
| [1] [2] | |
| [1]: https://www.intel.com/content/www/us/en/products/sku/242615/... | |
| [2]: https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622 | |
| [3]: https://www.thefpsreview.com/gpu-family/nvidia-geforce-rtx-4... | |
| lelanthran wrote 1 day ago: | |
| > The trick is memory bandwidth - not just the amount of VRAM - is | |
| important for LLM inference. | |
| I'm not really knowledgeable about this space, so maybe I'm missing | |
| something: | |
| Why does the bus performance affect token generation? I would | |
| expect it to cause a slow startup when loading the model, but once | |
| the model is loaded, just how much bandwidth can the token | |
| generation possibly use? | |
| Token generation is completely on the card using the memory on the | |
| card, without any bus IO at all, no? | |
| IOW, I'm trying to think of what IO the card is going to need for | |
| token generation, and I can't think of any other than returning the | |
| tokens (which, even on a slow 100MB/s transfer is still going to be | |
| about 100x the rate at which tokens are being generated. | |
| stevenhuang wrote 1 day ago: | |
| During inference, each token passes through each parameter of the | |
| model as a matrix vector products. And then as context grows, | |
| each new token passes through all current context tokens as | |
| matrix vector products. | |
| This means bandwidth requirements grow as context sizes grow. | |
| For datacenter workloads batching can be used to efficiently use | |
| this memory bandwidth and make things compute bound instead | |
| lelanthran wrote 1 day ago: | |
| [I'm still not understanding] | |
| It seems to me that even if you pass in a long context on every | |
| prompt, that context is still tiny compared to the execution | |
| time on the processor/GPU/tensorcore/etc. | |
| Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass | |
| in a prompt with 1MB of context which causes a response of | |
| 500kb after 1s. That's still only 1.5MB of IO transferred in | |
| 1s, which kept the GPU busy for 1s. Increasing the prompt is | |
| going to increase the duration to a response accordingly. | |
| Unless the GPU is not fully utilised on each prompt-response | |
| cycle, I feel that the GPU is still the bottleneck here, not | |
| the bus performance. | |
| zargon wrote 21 hours 0 min ago: | |
| > I feel that the GPU is still the bottleneck here, not the | |
| bus performance. | |
| PCIe bus performance is basically irrelevant. | |
| > Token generation is completely on the card using the memory | |
| on the card, without any bus IO at all, no? | |
| Right. But the GPU can't instantaneously access data in VRAM. | |
| It has to be copied from VRAM to GPU registers first. For | |
| every token, the entire contents of VRAM has to be copied to | |
| the GPU to be computed. It's a memory-bound process. | |
| Right now there's about an 8x difference in memory bandwidth | |
| between low-end and high-end consumer cards (e.g., 4060 Ti vs | |
| 5090). Moving up to a B200 more than doubles that performance | |
| again. | |
| imtringued wrote 22 hours 42 min ago: | |
| 1MB of context can maybe hold 10 tokens depending on your | |
| model. | |
| For reference. llama 3.2 8B used to take 4 KiB per token per | |
| layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV | |
| cache (context). If your context holds 8000 tokens including | |
| responses then you need around 1GB. | |
| >Unless the GPU is not fully utilised on each prompt-response | |
| cycle, I feel that the GPU is still the bottleneck here, not | |
| the bus performance. | |
| Matrix vector multiplication implies a single floating point | |
| multiplication and addition (2 flops) per parameter. Your GPU | |
| can do way more flops than that without using tensor cores at | |
| all. In fact, this workload bores your GPU to death. | |
| jononor wrote 1 day ago: | |
| GPU memory bandwidth is the limiting factor, not PCIe | |
| bandwidth. | |
| The memory bandwidth is critical because the models rely on | |
| getting all the parameters from memory to do computation, and | |
| there is a low amount of computation per parameter, so memory | |
| tends to be the bottleneck. | |
| rcarmo wrote 4 days ago: | |
| The trouble with these things is that âon a budgetâ doesnât | |
| deliver much when most interesting and truly useful models are creeping | |
| beyond the 16GB VRAM limit and/or require a lot of wattage. Even a Mac | |
| mini with enough RAM is starting to look like an expensive proposition, | |
| and the AMD Stryx Halo APUs (the SKUs that matter, like the Framework | |
| Desktop at 128GB) are around $2K. | |
| As someone who built a period-equivalent rig (with a 12GB 3060 and | |
| 128GB RAM) a few years ago, I am not overly optimistic that local | |
| models will keep being a cheap alternative (never mind the | |
| geopolitics). And yeah, there are vey cheap ways to run inference, but | |
| hey become pointless - I can run Qwen and Phi4 locally on an ARM chip | |
| like the RK3588, but it is still dog slow. | |
| Jedd wrote 4 days ago: | |
| In January 2024 there was a similar post ( [1] ) wherein the author | |
| selected dual NVidia 4060 Ti's for an at-home-LLM-with-voice-control -- | |
| because they were the cheapest cost per GB of well-supported VRAM at | |
| the time. | |
| (They probably still are, or at least pretty close to it.) | |
| That informed my decision shortly after, when I built something similar | |
| - that video card model was widely panned by gamers (or more | |
| accurately, gamer 'influencers'), but it was an excellent choice if you | |
| wanted 16GB of VRAM with relatively low power draw (150W peak). | |
| TFA doesn't say where they are, or what currency they're using (which | |
| implies the hubris of a North American) - at which point that pricing | |
| for a second hand, smaller-capacity, higher-power-drawing 4070 just | |
| seems weird. | |
| Appreciate the 'on a budget' aspect, it just seems like an objectively | |
| worse path, as upgrades are going to require replacement, rather than | |
| augment. | |
| As per other comments here, 32 / 12 is going to be really limiting. | |
| Yes - lower parameter / smaller-quant models are becoming more capable, | |
| but at the same time we're seeing increasing interest in larger context | |
| for these at home use cases, and that chews up memory real fast. | |
| [1]: https://news.ycombinator.com/item?id=38985152 | |
| 1shooner wrote 1 day ago: | |
| >TFA doesn't say where they are, or what currency they're using | |
| They say California, and I'm seeing the dollar amount in the title | |
| and metadata as $1,3k, was that an edit? | |
| T-A wrote 1 day ago: | |
| > TFA doesn't say where they are | |
| "the 1,440W limit on wall outlets in California" is a pretty good | |
| hint. | |
| zxexz wrote 1 day ago: | |
| Bringing back memories of testing the breakers in my college | |
| apartments to verify exactly which outlets were on which circuit, | |
| so I could pool as much as possible as needed. I distinctly | |
| remember pulling 20kw once, celebrating with a beer; the memory of | |
| all those cables snaking through the old apartment makes me almost | |
| uneasy now. I do remember we didnât have to pay for heat that | |
| winter; which felt like a major win in Massachusetts. Come to think | |
| of it, Iâm pretty sure there are still some servers tucked away | |
| in a crawlspace in that basement. | |
| dcassett wrote 1 day ago: | |
| San Francisco specifically: | |
| "I prompted ChatGPT to give me recommendations. Prompt: ... The | |
| final build will be located at my residence in San Francisco, CA, | |
| ..." | |
| throwaway314155 wrote 1 day ago: | |
| > which implies the hubris of a North American | |
| No need for that. | |
| Jedd wrote 23 hours 28 min ago: | |
| Probably true. | |
| But for those of us outside the USA bubble, it's incredibly tring | |
| to have to intuit geo information (when geo information would add | |
| to the understanding). | |
| As others noted in sibling comments, TFA had in fact mentioned in | |
| passing their location (in their quoted prompt to chatgpt, and at | |
| the very end of the third supporting point for the decision to go | |
| for an Nvidia 4070) 'California, CA'. I confess that I skimmed over | |
| both those paragraphs. | |
| Now, sure, CA is a country code, but I stand corrected that the | |
| author completely hid their location. Had I spotted those clues I'd | |
| not have to have made any assumptions around wall power | |
| capabilities & costs, new & second hand market availability / | |
| costs, etc. | |
| I think I mostly catered for those considerations in the rest of my | |
| original comment though - asserted power sensitivity makes it | |
| surprising that a higher-power-requiring, smaller-RAM-capacity, | |
| more-expensive-than-a-sibling-generation-16GB card was selected. | |
| topato wrote 1 day ago: | |
| He did soften the blow by saying North American, rather than the | |
| more correctly appropos, American | |
| dfc wrote 1 day ago: | |
| The author also refers to Californian power limits. So it seems | |
| the criticism is misplaced. | |
| topato wrote 1 day ago: | |
| True, though | |
| Uehreka wrote 4 days ago: | |
| Love the attention to detail, I can tell this was a lot of work to put | |
| together and I hope it helps people new to PC building. | |
| I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling | |
| youâre going to hit pretty quickly if youâre into messing with | |
| LLMs. Thereâs basically no way to do a better job at the budget | |
| youâre working with though. | |
| One thing I hear about a lot is people using things like RunPod to | |
| briefly get access to powerful GPUs/servers when they need one. If you | |
| spend $2/hr you can get access to an H100. If you have a budget of | |
| $1300 that could get you about 600 hours of compute time, which (unless | |
| youâre doing training runs) should last you several months. | |
| In several months time the specs required to run good models will be | |
| different again in ways that are hard to predict, so this approach can | |
| help save on the heartbreak of buying an RTX 5090 only to find that | |
| even that doesnât help much with LLM inference and weâre all gonna | |
| need the cheaper-but-more-VRAM Intel Arc B60s. | |
| numpad0 wrote 1 day ago: | |
| I don't understand why some people build a "rig", put a lot of | |
| thoughts into ever so slightly differently binned CPUs, and then | |
| don't max out RAM(put aside DDR5 quirk considerations). It's like | |
| buying a sports car only to cheap out on tires. It makes no sense. | |
| Uehreka wrote 8 hours 4 min ago: | |
| I built my current computer last fall. The Ryzen 7950X was on an | |
| awesome sale for black Friday and after looking at the math buying | |
| a 9950X just didnât make sense. So I got the 7950X and 96GB of | |
| DDR5 RAM (2 sticks, so I can double later if I need to). Loving it, | |
| it was the perfect choice. | |
| All this to say some people do in fact do this ;) | |
| semi-extrinsic wrote 1 day ago: | |
| > save on the heartbreak of buying an RTX 5090 only to find that even | |
| that doesnât help much with LLM inference and weâre all gonna | |
| need the cheaper-but-more-VRAM Intel Arc B60s | |
| When going for more VRAM, with an RTX 5090 currently sitting at $3000 | |
| for 32GB, I'm curious why people aren't trying to get the Dell | |
| C4140s. Those seem to go for $3000-$4000 for the whole server with 4x | |
| V100 16GB, so 64GB total VRAM. | |
| Maybe it's just because they produce heat and noise like a small | |
| turbojet. | |
| nickpsecurity wrote 1 day ago: | |
| Don't the parallelizing techniques of a 4x build make using them | |
| more difficult than a 1x build with no extra parallelism? Couldn't | |
| the 32GB 4090 handle more models in their original configurations? | |
| ijk wrote 11 hours 28 min ago: | |
| For LLM inference parallel GPUs is mostly fine (you take some | |
| performance hit but llama.cpp doesn't care what cards you use and | |
| other stuff handles 4 symmetric GPUs just fine). You get more | |
| problems when you're doing anything training related, though. | |
| zargon wrote 22 hours 4 min ago: | |
| > Don't the parallelizing techniques of a 4x build make using | |
| them more difficult than a 1x build with no extra parallelism? | |
| For inference, no. For training, only slightly. | |
| 7speter wrote 4 days ago: | |
| I dunno everyone, but I think Intel has something big on their hands | |
| with their announced workstation gpus. The b50 is a low profile card | |
| that doesnât have a powersupply hookup because it only uses something | |
| like 60 watts, and comes with 16gb vram at a msrp of 300 dollars. | |
| I imagine companies will have first dibs via the likes of agreements | |
| with suppliers like CDW, etc, but if Intel had enough of these | |
| battlemage dies accumulated, it could also drastically change the local | |
| ai enthusiast/hobbyist landscape; for starters this could drive down | |
| the price of workstation cards that are ideal for inference, at the | |
| very least. Iâm cautiously excited. | |
| On the AMD front (really, a sort of open compute front), Vulkan Kompute | |
| is picking up steam and it would be really cool to have a standard that | |
| mostly(?) ships with Linux, and older ports available for Freebsd, so | |
| that we can actually run free as in freedom inference locally. | |
| golly_ned wrote 4 days ago: | |
| Whenever I get to a section that was clearly autogenerated by an LLM I | |
| lose interest in the entire article. Suddenly the entire thing is | |
| suspect and I feel like Iâm wasting my time, since Iâm lo lingering | |
| encountering the mind of another person, just interacting with a | |
| system. | |
| throwaway314155 wrote 1 day ago: | |
| Eh, yeah - the article starts off pretty specific but then gets into | |
| the weeds of stuff like how to put your PC together, which is far | |
| from novel information and certainly not on-topic in my opinion. | |
| memcg wrote 15 hours 29 min ago: | |
| I sent the article link to my son because he does not have | |
| experience building or assembling hardware or installing or using | |
| Linux. Also took the author's ChatGPT prompt and changed it to ask | |
| about reusing two HPE ML150 Gen9 servers I picked up free. I think | |
| my son will benefit from the details in the article that many find | |
| off-topic. | |
| bravesoul2 wrote 4 days ago: | |
| I didn't see anything like that here. Yeah they used bullets. | |
| golly_ned wrote 3 days ago: | |
| Thereâs a section that says what the parts of a pc are, and what | |
| that part is. | |
| Nevermark wrote 1 day ago: | |
| > I used the AI-generated recommendations as a starting point, | |
| and refined the options with my own research. | |
| Referring to this section? | |
| I don't see a problem with that. This isn't an article about a | |
| design intended for 10,000 systems. Just one person's follow | |
| through on an interesting project. With disclosure of | |
| methodology. | |
| uniposterz wrote 4 days ago: | |
| I had a similar setup for a local LLM, 32GB was not enough. I recommend | |
| going for 64GB. | |
| vunderba wrote 4 days ago: | |
| The RTX market is particularly irritating right now, even second-hard | |
| 4090s are still going for MSRP if you can find them at all. | |
| Most of the recommendations for this budget AI system are on point - | |
| the only thing I'd recommend is more RAM. 32GB is not a lot - | |
| particularly if you start to load larger models through formats such as | |
| GGUF and want to take advantage of system ram to split the layers at | |
| the cost of inference speed. I'd recommend at least 2 x 32GB or even 4 | |
| x 32GB if you can swing it budget-wise. | |
| Author mentioned using Claude for recommendations, but another great | |
| resource for building machines is PC Part Picker. They'll even show | |
| warnings if you try pairing incompatible parts or try to use a PSU that | |
| won't supply the minimum recommended power. | |
| [1]: https://pcpartpicker.com | |
| Aeolun wrote 1 day ago: | |
| I thought those 4090âs were weird. You pay more for them than the | |
| brand new 5090. And then thereâs AMD, which everyone loves to hate, | |
| but has similar GPUâs that cost 1/4th of what a similar Nvidia GPU | |
| costs. | |
| <- back to front page |