/hn/comments_44197347.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	Building an AI server on a budget


	eachro wrote 8 hours 48 min ago:
	A lot of people are saying 12gb is too small to do anything interesting
	with. What's the most useful thing people __have__ gotten to work?

	ntlm1686 wrote 10 hours 25 min ago:
	Building a PC that can play video games and run some LLMs.

	dubrado wrote 16 hours 27 min ago:
	If you want to save some money and test things out, check out
	Hyperbolic (app.hyperbolic.xyz).

	They're based in the US, don't store any data, and you can rent
	(self-serve) style in less than a minute.

	diggan wrote 16 hours 24 min ago:
	If you're gonna promote your own product, at least be honest and
	brave enough to acknowledge that you built/manage it.

	v3ss0n wrote 17 hours 37 min ago:
	12GB GPU can't do a thing that is useful. Minium should be 32GB VRam
	where you can run actual models (Mistral-Small , Qwen3-32B , etc).

	alganet wrote 22 hours 31 min ago:
	Let me try to put this in the scale of coffee:

	--

	Using LLM via api: Starbucks.

	Inference at home: Nespresso capsules.

	Fine-tune a small model at home: Owning a grinder and an italian
	espresso machine.

	Pre-training a model: Owning a moderate coffee plantation.

	teleforce wrote 23 hours 6 min ago:
	>DECISION: Nvidia RTX 4070

	I'm curiuos why OP didn't go for the more recent Nvidia RTX 4060 Ti
	with 16 GB VRAM that cost cheaper (~USD500) brand new and lesser power
	consumption at 165W [1] RTX 5060 Ti 16GB sucks for gaming, but seems
	like a diamond in the rough for AI:

	[1]: https://news.ycombinator.com/item?id=44196991

	qingcharles wrote 21 hours 36 min ago:
	And if you're gonna be fine with 12GB, why not a 2080ti instead?

	Fluorescence wrote 16 hours 51 min ago:
	Only 11GB... but I guess it will allow you to not do anything
	useful just as well as 12GB will :)

	You can however solder on double-capacity memory chips to get 22GB:
	[1] I hoped the article would be more along these lines than
	calling an unremarkable second-hand last-gen gaming pc an "AI
	Server".

	[1]: https://forums.overclockers.com.au/threads/double-your-gpu...

	zlies wrote 23 hours 34 min ago:
	Did you not use any thermal paste at all, or did you just forget to
	mention it in your post?

	PeterStuer wrote 1 day ago:
	For image generation the article's setup might be viable, but do not
	expect to run LLM's with satisfactory quality and speed on 12GB vram.

	lazylizard wrote 1 day ago:
	why not one of these?

	[1]: https://www.amazon.sg/NVIDIA-Jetson-Orin-64GB-Developer/dp/B0B...

	numpad0 wrote 20 hours 55 min ago:
	Jetsons aren't so fast, those are intended for mobile robots. The one
	supposed to be just around the corner is DGX Spark(Project DIGITS)
	and DGX Station.

	Those DGX machines are still at right around the corner state.

	romanovcode wrote 1 day ago:
	Doesn't the new computer that is about to be released from NVIDIA
	much better than this one and is same price? Why would anyone buy
	this one now, seems like a waste of money.

	mythz wrote 1 day ago:
	Good value but a 12GB card isn't going to let you do too much given the
	low quality of small models. Curious what "home AI" use cases small
	models are being used for?

	It would be nice to see a best value home AI setups under different
	budgets or RAM tiers, e.g. best value configuration for 128 GPU VRAM,
	etc.

	My 48GB GPU VRAM "Home AI Server" cost ~$3100 from all parts on eBay
	running 3x A4000's in a Supermicro 128GB RAM, 32/64 core Xeon 1U rack
	server. Nothing amazing but wanted the most GPU VRAM before paying the
	premium Nvidia tax on their larger GPUs.

	This works well for Ollama/llama-server which can make use of all GPU
	VRAM unfortunately ComfyUI can't make use of all GPU VRAM to run larger
	models, so on the lookout for a lot more RAM in my next GPU Server.

	Really hoping Intel can deliver with its upcoming Arc Pro B60 Dual GPU
	for a great value 48GB option which can be run 4x in an affordable
	192GB VRAM workstation [1]. If it runs Ollama and ComfyUI efficiently
	I'm sold.

	[1]: https://www.servethehome.com/maxsun-intel-arc-pro-b60-dual-gpu...

	rwyinuse wrote 22 hours 33 min ago:
	I use a Proxmox server with RTX 3060 to generate paintings (I have a
	couple of old jailbroken Amazon Kindle's attached to walls for that
	purpose), and to run ollama, which is connected to Home Assistant &
	their voice preview device, allowing me to talk with LLM without
	transmitting anything to cloud services.

	Admittedly with that amount of VRAM the models I can run are fairly
	useless for stuff like controlling lights via Home Assistant,
	occasionally does what I tell it to do but usually not. It is pretty
	okay for telling me information, like temperature or value of some
	sensors I have connected to HA. For generating AI paintings it's
	enough. My server also hosts tons of virtual machines, docker
	containers and is used for remote gameplay, so the AI thing is just
	an extra.

	garyfirestorm wrote 17 hours 37 min ago:
	Why do you say that? You can easily finetune 8B parameter model for
	function calling.

	msgodel wrote 22 hours 34 min ago:
	It's really not going to let you train much which IMO is the only
	reason I'd personally bother with a big GPU. Gradients get huge and
	everything does them with single/half precision floating point.

	itake wrote 1 day ago:
	My home AI machine does image classification.

	naavis wrote 1 day ago:
	What kind of image classification do you do at home?

	itake wrote 22 hours 36 min ago:
	My side project accepts and publishes user generated content. To
	stay compliant with regulations, I use ML to remove adult
	content:

	[1]: https://github.com/KevinColemanInc/NSFW-FLASK

	mythz wrote 1 day ago:
	Using just an Ollama VL Model (gemma3/mistral-small3.1/qwen2.5vl)
	or a specific library?

	itake wrote 22 hours 35 min ago:
	My home server detects NSFW images in user generated content on
	my side project.

	source code:

	[1]: https://github.com/KevinColemanInc/NSFW-FLASK

	mythz wrote 21 hours 57 min ago:
	Cool, I've tried a few but settled on using EraX NSFW to do the
	same.

	jononor wrote 1 day ago:
	Agreed, 12 GB does not seem useful. For coding LLM, it seems 128 GB
	is needed to be even close to the frontier models.
	For generative image processing (not video), it looks like one can
	get started with 16GB.

	noufalibrahim wrote 1 day ago:
	This is interesting. We recently built a similar machine to implement a
	product that we're building on a customer site.

	I didn't buy second hand parts since i wasn't sure of the quality so it
	was a little pricey but we have the entire thing working now and over
	the last week, we added the llm server to the mix. Haven't released it
	yet though.

	I wrote about some "fun" we had getting it together here but it's not
	as technically detailed as the original article.

	[1]: https://blog.hpcinfra.com/when-linkedin-met-reality-our-bangal...

	danielhep wrote 1 day ago:
	What are the practical uses of a self hosted LLM? Is it actually
	possible to approach the likes of Claude or one of the other big ones
	on your own hardware for a reasonable budget? I donât know if this is
	something thatâs actually worth it or if people are just building
	these rigs for fun or niche use cases that donât require the
	intelligence of a hosted LLM.

	tmountain wrote 16 hours 1 min ago:
	Personal opinion, it's for fun with some internal narrative of
	justification. It doesn't seem like it would be cost effective or
	provide better results, as all the major LLM vendors benefit
	tremendously from economies of scale, and the monthly fees for these
	services are extremely reasonable for what you are getting. Going
	further, the cloud based LLM receive upgrades constantly while static
	hardware will likely lock you out of future models at some time
	horizon.

	numpad0 wrote 1 day ago:
	Couple best vram for buck && borderline space heater GPUs off top of my
	head: Tesla K80(12GBx2), M40(24GB), Radeon Instinct
	MI(25\|50\|60\|100)(8-32GB?), Radeon Pro V340(16GBx2), bunch of other
	Radeon Vega 8GB cards e.g. Vega 56, NVIDIA P102/P104(~16GB), Intel
	A770(16GB). Note: some of these are truly just space heaters.

	I'm not sure if right now is the best timing for building an LLM rig,
	as Intel Arc B60(24GBx2) is about to go on sale. Or maybe it is to
	secure multiples of 16GB cards hastily offloaded before its launch?

	AJRF wrote 1 day ago:
	Why a 4070 over a 3090? A 4070 has half the VRAM. In the UK you can get
	a 3090 for like 600GBP.

	Havoc wrote 1 day ago:
	> You pay a lot upfront for the hardware, but if your usage of the GPU
	is heavy, then you save a lot of money in the long run.

	Last I saw data on this wasnât true. A like for like comparison (same
	model and quant) API is cheaper than elec so you never make back
	hardware cost. That was a year ago and api costs have plummeted so
	Iâd imagine itâs even worse now.

	Datacenters have cheaper elec, can do batch inference at scale and more
	efficient cards. And thatâs before we consider the huge free
	allowances by Google etc

	Own AI gear is coolâ¦but not due to economics

	edg5000 wrote 1 day ago:
	Is this also the case for token-heavy uses such as Claude Code? Not
	sure if I will end up using CC for development in the future, but if
	I end up leaning on that, I wonder if there would be a desire to
	essentially have it run 24/7. When ran 24/7, CC would possibly incur
	more API fees than residential electricity would cost when running on
	your own gear? I have no idea about the numbers. Just wondering.

	Havoc wrote 16 hours 54 min ago:
	I doubt youâre going to beat datacenter under any conditions in
	any model that is vaguely like for like

	The comparison I saw was a small llama 8B model. ie something you
	can actually get usable numbers on both home and api. So something
	pretty commoditized

	> When ran 24/7, CC would possibly incur more API fees than
	residential electricity would cost when running on your own gear?

	Claude is pretty damn expensive so plausible that you can undercut
	it with another model. That implies you throw out the like for like
	assumptions out the door though. Valid play practically, but kinda
	undermines the buy own rig to save argument

	whalesalad wrote 1 day ago:
	I would rather spend $1,300 on openai/anthropic credits. The
	performance from that 4070 cannot be worth the squeeze.

	T-A wrote 1 day ago:
	I would consider adding $400 for something like this instead:

	[1]: https://www.bosgamepc.com/products/bosgame-m5-ai-mini-desktop-...

	atentaten wrote 1 day ago:
	Do you use this? If so, what's your use case and performance?

	T-A wrote 19 hours 41 min ago:
	No, they start shipping in July. The main advertised use case is
	self-hosting LLMs.

	usercvapp wrote 1 day ago:
	I have a server at home sitting IDLE for the last 2 years with 2 TB of
	RAM and 4 CPUs.

	I am gonna push it this week and launch some LLM models to see how they
	perform!

	How much electric bill efficient are they running locally?

	AzN1337c0d3r wrote 19 min ago:
	Depends on the server. Probably not going to be cost effective. I get
	barely ~0.5 tokens/sec.

	I have Dual E5-2699A v4 w/1.5 TB DDR4-2933 spread across 2 sockets.

	The full Deepseek-R1 671B (~1.4 TB) with llama.cpp seems to have a in
	that local engines that run the LLMs don't do NUMA aware allocation,
	so cores will often have to pull the weights in from another socket's
	memory controllers through the inter-socket links
	(QPI/UPI/Hypertransport) and bottleneck there.

	For my platform that's 2x QPI links @ ~39.2GB/s/link that get
	saturated.

	I give it a prompt, go to work and check back on it at lunch and
	sometimes it's still going.

	If you're going to want to achieve interactively I'd aim for 7-10
	tokens/s, so realistically it means you'll run one of the 8b models
	on a GPU (~30 tokens/s) or maybe a 70b model on an M4 Max (~8
	tokens/s).

	pshirshov wrote 1 day ago:
	3090 for ~1000 is much more solid choice. Also these old mining mobos
	play very well for multi-gpu ollama.

	msp26 wrote 1 day ago:
	> 12GB vram

	waste of effort, why would you go through the trouble of building +
	blogging for this?

	timnetworks wrote 5 hours 58 min ago:
	Can easily be replaced with a 24GB one, drop-in upgrayyed like ram.

	brought to you by carl's jr.

	jacekm wrote 1 day ago:
	For $100 more you could get a used 3090 with twice as much VRAM. You
	could also get 4060 Ti which is cheaper than 4070 and it has 16 GB VRAM
	(although it's less powerfull too, so I guess depends on the use case)

	iJohnDoe wrote 1 day ago:
	Details about the ML software or
	AI software?

	JKCalhoun wrote 1 day ago:
	Someone posted that they had used a "mining rig" [0] from AliExpress
	for less than $100. It even has RAM and a CPU. He picked up a 2000W (!)
	DELL server PS for cheap off eBay. The GPUs were NVIDIA TESLAs (M40 for
	example) since they often have a lot of RAM and are less expensive.

	I followed in those footsteps to create my own [1] (photo [2]).

	I picked up a 24GB M40 for around $300 off eBay. I 3D printed a "cowl"
	for the GPU that I found online and picked up two small fans from
	Amazon that got int he cowl. Attached the cowl + fans keep the GPU
	cool. (These TESLA server GPUs have no fan since they're expected to
	live in one of those wind-tunnels called a server rack).

	I bought the same cheap DELL server PS that the original person had
	used and I also had to get a break-out board (and power-supply cables
	and adapters) for the GPU.

	Thanks to LLMs, I was able to successfully install Rocky Linux as well
	as CUDA and NVIDIA drivers. I SSH into it and run ollama commands.

	My own hurdle at this point is: I have a 2nd 24 GB M40 TESLA but when
	installed on the motherboard, Linux will not boot. LLMs are helping me
	try to set up BIOS correctly or otherwise determine what the issue is.
	(We'll see.) I would love to get to 48 GB.

	[0] [1] [2]

	[1]: https://www.aliexpress.us/item/3256806580127486.html
	[2]: https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4fk...
	[3]: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxjqlammq...

	reginald78 wrote 14 hours 29 min ago:
	My first guess would be to change the Above 4G decoding setting but
	depending upon how old the motherboard is it may not have that
	setting.

	jedbrooke wrote 1 day ago:
	I had an old Tesla M40 12 GB lying around and figured Iâd try it
	out with some 8-13B llms, but was disappointed to find that itâs
	around the same speed as my mac mini m2. I suppose the mac mini is a
	10 years newer chip, but itâs crazy that mobile today matches data
	center from 10 years ago

	rjsw wrote 1 day ago:
	There was an article on Tom's Hardware recently where someone was
	using a CPU cooler with a GPU [1]

	[1]: https://www.tomshardware.com/pc-components/gpus/crazed-modde...

	ww520 wrote 1 day ago:
	I use a 10-year old laptop to run a local LLM. The time between prompts
	are 10-30 seconds. Not for speedy interactive usage.

	atentaten wrote 1 day ago:
	Enjoyed the article as I am interested in the same. I would like to
	have seen more about the specific use cases and how they performed on
	the rig.

	djhworld wrote 1 day ago:
	With system builds like this I always feel the VRAM is the limiting
	factor when it comes to what models you can run, and consumer grade
	stuff tends to max out at 16GB or (somemtimes) 24GB for more expensive
	models.

	It does make me wonder whether we'll start to see more and more
	computers with unified memory architecture (like the Mac) - I know
	nvidia have the Digits thing which has been renamed to something else

	m0th87 wrote 16 hours 22 min ago:
	Thatâs what I hope for, but everything that isnât bananas
	expensive with unified memory has very low memory bandwidth. DGX
	(Digits), Framework Desktop, and non-Ultra Macs are all around 128
	gb/s, and will produce single digits tokens per second for larger
	models: [1] So thereâs a fundamental tradeoff between cost,
	inference speed, and hostable model size for the foreseeable future.

	[1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

	JKCalhoun wrote 1 day ago:
	Go server GPU (TESLA) and 24 GB is not unusual. (And also about $300
	used on eBay.)

	v3ss0n wrote 17 hours 5 min ago:
	But compute speed is very low.

	incomingpain wrote 3 days ago:
	I've been dreaming on pcpartpicker.

	I think Radeon RX 7900 XT - 20 GB has been the best bang for your buck.
	Enables full gpu 32B?

	Looking at what other people have been doing lately, they arent doing
	this.

	They are getting 64+ core cpus and 512GB of ram. Keeping it on cpu and
	enabling massive models. This setup lets you do deepseek 671B.

	It makes me wonder, how much better is 671B vs 32B?

	zargon wrote 21 hours 26 min ago:
	> It makes me wonder, how much better is 671B vs 32B?

	32B has improved leaps and bounds in the past year. But Deepseek 671B
	is still a night and day comparison. 671B just knows so much more
	stuff.

	The main issue with RAM-only builds is that prompt ingestion is
	incredibly slow. If you're going to be feeding in any context at all,
	it's horrendous. Most people quote their tokens/s with basically
	non-existent context (a few hundred tokens). Figure out if you're
	going to be using context, and how much patience you have. Research
	the speed you'll be getting for prompt processing / token generation
	at your desired context length in each instance, and make your
	decision based on that.

	Aeolun wrote 1 day ago:
	I bought an RX 7900 XTX with 24GB, and itâs everything I expected
	of it. Itâs absolutely massive though. I thought I could add one
	extra for more memory, but thatâs a pipe dream in my little desktop
	box.

	Cheap too, compared to a lot of what Iâm seeing.

	DogRunner wrote 3 days ago:
	I used a similar budget and build something like this:

	7x RTX 3060 - 12 GB which results in 84GB Vram
	AMD Ryzen 5 - 5500GT with 32GB Ram

	All in a 19-inch rack with a nice cooling solution and a beefy power
	supply.

	My costs? 1300 Euro, but yeah, I sourced my parts on ebay / second
	hand.

	(Added some 3d printed parts into the mix: [1] [2] [3] if you think
	about building something similar)

	My power consumption is below 500 Watt at the wall, when using
	LLLMs,since I did some optimizations:

	* Worked on power optimizations and after many weeks of benchmarking,
	the sweet spot on the RTX3060 12GB cards is a 105 Watt limit

	* Created Patches for Ollama ( [4] ) to group models to exactly memory
	allocation instead of spreading over all available GPUs (This also
	reduces the VRAM overhead)

	* ensured that ASPM is used on all relevant PCI components (Powertop is
	your friend)

	It's not all shiny:

	* I still use PCIe3 X1 for most of the cards, which limits their
	capability, but all I found so far (PCIe Gen4 x4 extender and
	bifurcation/special PCIE routers) are just too expensive to be used on
	such low powered cards

	* Due to the slow PCIe bandwidth, the performance drops significantly

	* Max VRAM per GPU is king. If you split up a model over several cards,
	the RAM allocation overhead is huge! (See Examples in my ollama patch
	about). I would rather use 3x 48GB instead of 7x 12G.

	* Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is
	unacceptable. Good BIOSes like the one from Gigabyte (Windforce xxx) do
	idle at 3 Watt, which is a huge difference when you use 7 or more
	cards. These BIOSes can be patched, but this can be risky

	All in all, this server idles at 90-100Watt currently, which is perfect
	as a central service for my tinkerings and my family usage.

	[1]: https://www.printables.com/model/1142963-inter-tech-and-generi...
	[2]: https://www.printables.com/model/1142973-120mm-5mm-rised-noctu...
	[3]: https://www.printables.com/model/1142962-cable-management-fur-...
	[4]: https://github.com/ollama/ollama/pull/10678

	reginald78 wrote 14 hours 17 min ago:
	Great info in this post with some uncommon questions answered. I have
	a 3060 with unimpressive idle power consumption, interesting that it
	varies so much.

	I know it would increase the idle power consumption, but have you
	considered a server platform instead of Ryzen to get more lanes?

	Even so, you could probably get at least 4x for 4 cards without
	getting to crazy. 2 m.2 -> pcie adapters, the main GPU slot and the
	fairly common 4x wired secondary slot.

	Splitting the main 16x GPU slot is possible but whenever I looked
	into this I kind of found the same thing you did. In addition to
	being a cabling/mounting nightmare the necessary hardware started to
	eat up enough total system cost that just ponying up for a 3090
	started to make more sense.

	jononor wrote 1 day ago:
	Impressive! What kind of motherboard do you use to host 7 GPUs?

	burnt-resistor wrote 3 days ago:
	Reminds me of [1] I'll be that guyâ¢ that says if you're going to do
	any computing half-way reliably, only use ECC RAM. Silent bit flips
	suck.

	[1]: https://cr.yp.to/hardware/build-20090123.html

	politelemon wrote 3 days ago:
	If the author is reading this I'll point out that the cuda toolkit you
	find in the repositories is generally older. You can find the latest
	versions straight from Nvidia: [1] The caveat is that sometimes a
	library might be expecting an older version of cuda.

	The vram on the GPU does make a difference, so it would at some point
	be worth looking at another GPU or increasing your system ram if you
	start running into limits.

	However I wouldn't worry too much right away, it's more important to
	get started and get an understanding of how these local LLMs operate
	and take advantage of the optimisations that the community is making to
	make it more accessible. Not everyone has a 5090, and if LLMs remain in
	the realms of high end hardware, it's not worth the time.

	[1]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&ta...

	throwaway314155 wrote 1 day ago:
	The other main caveat is that installing from custom sources using
	apt is a massive pain in the ass.

	koakuma-chan wrote 1 day ago:
	I tried running an LLM locally today, installed cuda toolkit, and
	it was missing cudann.h

	I gave up.

	v5v3 wrote 3 days ago:
	I thought prevailing wisdom was that a used 3090 with it's larger vram
	was the best budget gpu choice?

	And in general, if on a budget then why not buy used and not new? And
	more so as the author himself talks about the resale value for when he
	sells it on.

	retinaros wrote 1 day ago:
	yes it is

	olowe wrote 3 days ago:
	> I thought prevailing wisdom was that a used 3090 with it's larger
	vram was the best budget gpu choice?

	The trick is memory bandwidth - not just the amount of VRAM - is
	important for LLM inference. For example, the B50 specs list a memory
	bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090 has over
	900GB/s [2]. The 4070's bandwidth is "just" 500GB/s [3].

	More VRAM helps run larger models but with lower bandwidth tokens
	could be generating so slowly it's not really practical for
	day-to-day use or experimenting.

	[1] [2]

	[1]: https://www.intel.com/content/www/us/en/products/sku/242615/...
	[2]: https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
	[3]: https://www.thefpsreview.com/gpu-family/nvidia-geforce-rtx-4...

	lelanthran wrote 1 day ago:
	> The trick is memory bandwidth - not just the amount of VRAM - is
	important for LLM inference.

	I'm not really knowledgeable about this space, so maybe I'm missing
	something:

	Why does the bus performance affect token generation? I would
	expect it to cause a slow startup when loading the model, but once
	the model is loaded, just how much bandwidth can the token
	generation possibly use?

	Token generation is completely on the card using the memory on the
	card, without any bus IO at all, no?

	IOW, I'm trying to think of what IO the card is going to need for
	token generation, and I can't think of any other than returning the
	tokens (which, even on a slow 100MB/s transfer is still going to be
	about 100x the rate at which tokens are being generated.

	stevenhuang wrote 1 day ago:
	During inference, each token passes through each parameter of the
	model as a matrix vector products. And then as context grows,
	each new token passes through all current context tokens as
	matrix vector products.

	This means bandwidth requirements grow as context sizes grow.

	For datacenter workloads batching can be used to efficiently use
	this memory bandwidth and make things compute bound instead

	lelanthran wrote 1 day ago:
	[I'm still not understanding]

	It seems to me that even if you pass in a long context on every
	prompt, that context is still tiny compared to the execution
	time on the processor/GPU/tensorcore/etc.

	Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass
	in a prompt with 1MB of context which causes a response of
	500kb after 1s. That's still only 1.5MB of IO transferred in
	1s, which kept the GPU busy for 1s. Increasing the prompt is
	going to increase the duration to a response accordingly.

	Unless the GPU is not fully utilised on each prompt-response
	cycle, I feel that the GPU is still the bottleneck here, not
	the bus performance.

	zargon wrote 21 hours 0 min ago:
	> I feel that the GPU is still the bottleneck here, not the
	bus performance.

	PCIe bus performance is basically irrelevant.

	> Token generation is completely on the card using the memory
	on the card, without any bus IO at all, no?

	Right. But the GPU can't instantaneously access data in VRAM.
	It has to be copied from VRAM to GPU registers first. For
	every token, the entire contents of VRAM has to be copied to
	the GPU to be computed. It's a memory-bound process.

	Right now there's about an 8x difference in memory bandwidth
	between low-end and high-end consumer cards (e.g., 4060 Ti vs
	5090). Moving up to a B200 more than doubles that performance
	again.

	imtringued wrote 22 hours 42 min ago:
	1MB of context can maybe hold 10 tokens depending on your
	model.

	For reference. llama 3.2 8B used to take 4 KiB per token per
	layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV
	cache (context). If your context holds 8000 tokens including
	responses then you need around 1GB.

	>Unless the GPU is not fully utilised on each prompt-response
	cycle, I feel that the GPU is still the bottleneck here, not
	the bus performance.

	Matrix vector multiplication implies a single floating point
	multiplication and addition (2 flops) per parameter. Your GPU
	can do way more flops than that without using tensor cores at
	all. In fact, this workload bores your GPU to death.

	jononor wrote 1 day ago:
	GPU memory bandwidth is the limiting factor, not PCIe
	bandwidth.
	The memory bandwidth is critical because the models rely on
	getting all the parameters from memory to do computation, and
	there is a low amount of computation per parameter, so memory
	tends to be the bottleneck.

	rcarmo wrote 4 days ago:
	The trouble with these things is that âon a budgetâ doesnât
	deliver much when most interesting and truly useful models are creeping
	beyond the 16GB VRAM limit and/or require a lot of wattage. Even a Mac
	mini with enough RAM is starting to look like an expensive proposition,
	and the AMD Stryx Halo APUs (the SKUs that matter, like the Framework
	Desktop at 128GB) are around $2K.

	As someone who built a period-equivalent rig (with a 12GB 3060 and
	128GB RAM) a few years ago, I am not overly optimistic that local
	models will keep being a cheap alternative (never mind the
	geopolitics). And yeah, there are vey cheap ways to run inference, but
	hey become pointless - I can run Qwen and Phi4 locally on an ARM chip
	like the RK3588, but it is still dog slow.

	Jedd wrote 4 days ago:
	In January 2024 there was a similar post ( [1] ) wherein the author
	selected dual NVidia 4060 Ti's for an at-home-LLM-with-voice-control --
	because they were the cheapest cost per GB of well-supported VRAM at
	the time.

	(They probably still are, or at least pretty close to it.)

	That informed my decision shortly after, when I built something similar
	- that video card model was widely panned by gamers (or more
	accurately, gamer 'influencers'), but it was an excellent choice if you
	wanted 16GB of VRAM with relatively low power draw (150W peak).

	TFA doesn't say where they are, or what currency they're using (which
	implies the hubris of a North American) - at which point that pricing
	for a second hand, smaller-capacity, higher-power-drawing 4070 just
	seems weird.

	Appreciate the 'on a budget' aspect, it just seems like an objectively
	worse path, as upgrades are going to require replacement, rather than
	augment.

	As per other comments here, 32 / 12 is going to be really limiting.
	Yes - lower parameter / smaller-quant models are becoming more capable,
	but at the same time we're seeing increasing interest in larger context
	for these at home use cases, and that chews up memory real fast.

	[1]: https://news.ycombinator.com/item?id=38985152

	1shooner wrote 1 day ago:
	>TFA doesn't say where they are, or what currency they're using

	They say California, and I'm seeing the dollar amount in the title
	and metadata as $1,3k, was that an edit?

	T-A wrote 1 day ago:
	> TFA doesn't say where they are

	"the 1,440W limit on wall outlets in California" is a pretty good
	hint.

	zxexz wrote 1 day ago:
	Bringing back memories of testing the breakers in my college
	apartments to verify exactly which outlets were on which circuit,
	so I could pool as much as possible as needed. I distinctly
	remember pulling 20kw once, celebrating with a beer; the memory of
	all those cables snaking through the old apartment makes me almost
	uneasy now. I do remember we didnât have to pay for heat that
	winter; which felt like a major win in Massachusetts. Come to think
	of it, Iâm pretty sure there are still some servers tucked away
	in a crawlspace in that basement.

	dcassett wrote 1 day ago:
	San Francisco specifically:

	"I prompted ChatGPT to give me recommendations. Prompt: ... The
	final build will be located at my residence in San Francisco, CA,
	..."

	throwaway314155 wrote 1 day ago:
	> which implies the hubris of a North American

	No need for that.

	Jedd wrote 23 hours 28 min ago:
	Probably true.

	But for those of us outside the USA bubble, it's incredibly tring
	to have to intuit geo information (when geo information would add
	to the understanding).

	As others noted in sibling comments, TFA had in fact mentioned in
	passing their location (in their quoted prompt to chatgpt, and at
	the very end of the third supporting point for the decision to go
	for an Nvidia 4070) 'California, CA'. I confess that I skimmed over
	both those paragraphs.

	Now, sure, CA is a country code, but I stand corrected that the
	author completely hid their location. Had I spotted those clues I'd
	not have to have made any assumptions around wall power
	capabilities & costs, new & second hand market availability /
	costs, etc.

	I think I mostly catered for those considerations in the rest of my
	original comment though - asserted power sensitivity makes it
	surprising that a higher-power-requiring, smaller-RAM-capacity,
	more-expensive-than-a-sibling-generation-16GB card was selected.

	topato wrote 1 day ago:
	He did soften the blow by saying North American, rather than the
	more correctly appropos, American

	dfc wrote 1 day ago:
	The author also refers to Californian power limits. So it seems
	the criticism is misplaced.

	topato wrote 1 day ago:
	True, though

	Uehreka wrote 4 days ago:
	Love the attention to detail, I can tell this was a lot of work to put
	together and I hope it helps people new to PC building.

	I will note though, 12GB of VRAM and 32GB of system RAM is a ceiling
	youâre going to hit pretty quickly if youâre into messing with
	LLMs. Thereâs basically no way to do a better job at the budget
	youâre working with though.

	One thing I hear about a lot is people using things like RunPod to
	briefly get access to powerful GPUs/servers when they need one. If you
	spend $2/hr you can get access to an H100. If you have a budget of
	$1300 that could get you about 600 hours of compute time, which (unless
	youâre doing training runs) should last you several months.

	In several months time the specs required to run good models will be
	different again in ways that are hard to predict, so this approach can
	help save on the heartbreak of buying an RTX 5090 only to find that
	even that doesnât help much with LLM inference and weâre all gonna
	need the cheaper-but-more-VRAM Intel Arc B60s.

	numpad0 wrote 1 day ago:
	I don't understand why some people build a "rig", put a lot of
	thoughts into ever so slightly differently binned CPUs, and then
	don't max out RAM(put aside DDR5 quirk considerations). It's like
	buying a sports car only to cheap out on tires. It makes no sense.

	Uehreka wrote 8 hours 4 min ago:
	I built my current computer last fall. The Ryzen 7950X was on an
	awesome sale for black Friday and after looking at the math buying
	a 9950X just didnât make sense. So I got the 7950X and 96GB of
	DDR5 RAM (2 sticks, so I can double later if I need to). Loving it,
	it was the perfect choice.

	All this to say some people do in fact do this ;)

	semi-extrinsic wrote 1 day ago:
	> save on the heartbreak of buying an RTX 5090 only to find that even
	that doesnât help much with LLM inference and weâre all gonna
	need the cheaper-but-more-VRAM Intel Arc B60s

	When going for more VRAM, with an RTX 5090 currently sitting at $3000
	for 32GB, I'm curious why people aren't trying to get the Dell
	C4140s. Those seem to go for $3000-$4000 for the whole server with 4x
	V100 16GB, so 64GB total VRAM.

	Maybe it's just because they produce heat and noise like a small
	turbojet.

	nickpsecurity wrote 1 day ago:
	Don't the parallelizing techniques of a 4x build make using them
	more difficult than a 1x build with no extra parallelism? Couldn't
	the 32GB 4090 handle more models in their original configurations?

	ijk wrote 11 hours 28 min ago:
	For LLM inference parallel GPUs is mostly fine (you take some
	performance hit but llama.cpp doesn't care what cards you use and
	other stuff handles 4 symmetric GPUs just fine). You get more
	problems when you're doing anything training related, though.

	zargon wrote 22 hours 4 min ago:
	> Don't the parallelizing techniques of a 4x build make using
	them more difficult than a 1x build with no extra parallelism?

	For inference, no. For training, only slightly.

	7speter wrote 4 days ago:
	I dunno everyone, but I think Intel has something big on their hands
	with their announced workstation gpus. The b50 is a low profile card
	that doesnât have a powersupply hookup because it only uses something
	like 60 watts, and comes with 16gb vram at a msrp of 300 dollars.

	I imagine companies will have first dibs via the likes of agreements
	with suppliers like CDW, etc, but if Intel had enough of these
	battlemage dies accumulated, it could also drastically change the local
	ai enthusiast/hobbyist landscape; for starters this could drive down
	the price of workstation cards that are ideal for inference, at the
	very least. Iâm cautiously excited.

	On the AMD front (really, a sort of open compute front), Vulkan Kompute
	is picking up steam and it would be really cool to have a standard that
	mostly(?) ships with Linux, and older ports available for Freebsd, so
	that we can actually run free as in freedom inference locally.

	golly_ned wrote 4 days ago:
	Whenever I get to a section that was clearly autogenerated by an LLM I
	lose interest in the entire article. Suddenly the entire thing is
	suspect and I feel like Iâm wasting my time, since Iâm lo lingering
	encountering the mind of another person, just interacting with a
	system.

	throwaway314155 wrote 1 day ago:
	Eh, yeah - the article starts off pretty specific but then gets into
	the weeds of stuff like how to put your PC together, which is far
	from novel information and certainly not on-topic in my opinion.

	memcg wrote 15 hours 29 min ago:
	I sent the article link to my son because he does not have
	experience building or assembling hardware or installing or using
	Linux. Also took the author's ChatGPT prompt and changed it to ask
	about reusing two HPE ML150 Gen9 servers I picked up free. I think
	my son will benefit from the details in the article that many find
	off-topic.

	bravesoul2 wrote 4 days ago:
	I didn't see anything like that here. Yeah they used bullets.

	golly_ned wrote 3 days ago:
	Thereâs a section that says what the parts of a pc are, and what
	that part is.

	Nevermark wrote 1 day ago:
	> I used the AI-generated recommendations as a starting point,
	and refined the options with my own research.

	Referring to this section?

	I don't see a problem with that. This isn't an article about a
	design intended for 10,000 systems. Just one person's follow
	through on an interesting project. With disclosure of
	methodology.

	uniposterz wrote 4 days ago:
	I had a similar setup for a local LLM, 32GB was not enough. I recommend
	going for 64GB.

	vunderba wrote 4 days ago:
	The RTX market is particularly irritating right now, even second-hard
	4090s are still going for MSRP if you can find them at all.

	Most of the recommendations for this budget AI system are on point -
	the only thing I'd recommend is more RAM. 32GB is not a lot -
	particularly if you start to load larger models through formats such as
	GGUF and want to take advantage of system ram to split the layers at
	the cost of inference speed. I'd recommend at least 2 x 32GB or even 4
	x 32GB if you can swing it budget-wise.

	Author mentioned using Claude for recommendations, but another great
	resource for building machines is PC Part Picker. They'll even show
	warnings if you try pairing incompatible parts or try to use a PSU that
	won't supply the minimum recommended power.

	[1]: https://pcpartpicker.com

	Aeolun wrote 1 day ago:
	I thought those 4090âs were weird. You pay more for them than the
	brand new 5090. And then thereâs AMD, which everyone loves to hate,
	but has similar GPUâs that cost 1/4th of what a similar Nvidia GPU
	costs.


	<- back to front page