/hn/comments_46275111.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	Nvidia Nemotron 3 Family of Models


	thoughtpeddler wrote 10 hours 12 min ago:
	Is it fair to view this release as Nvidia strategically flexing that
	they can compete with their own customers in the model layer -- that
	they can be as vertically integrated as, say, GDM?

	omneity wrote 13 hours 24 min ago:
	Nemotron now works on LM Studio if you update the runtime (from the
	settings > Runtime screen).

	The default chat template is incorrect though and will fail but I
	published a corrected one you can replace it with:

	[1]: https://gist.github.com/omarkamali/a594b6cb07347f501babed48989...

	Tepix wrote 20 hours 4 min ago:
	Is it just me or is Nvidia trolling hard by calling a model with 30b
	parameters "nano"? With a bit of context, it doesn't even fit on a RTX
	5090.

	Other LLMs with the "nano" moniker are around 1b parameters or less.

	patpatpat wrote 5 hours 17 min ago:
	FWIW It runs on my 9060xt(AMD) 16gb, without any tweaks just fine.
	It's very useable.
	I asked it to write a prime sieve in c#, started responding in .38
	seconds, and wrote an implementation @ 20 tokens/sec

	genpfault wrote 1 hour 3 min ago:
	Getting ~150 tok/s on an empty context with a 24 GB 7900XTX via
	llama.cpp's Vukan backend.

	jonrosner wrote 20 hours 53 min ago:
	after testing it for a little I am pretty disappointed. While I do get
	90 token per second out of it from my M4 Pro which is more than enough
	for a real world use case, the quality is just not there. I gave it a
	codebase that it should analyze and answer me some questions and it
	started hallucinating right away. No replacement for a "real" coding
	agent - maybe for other agentic work like sorting emails though.

	dJLcnYfsE3 wrote 20 hours 53 min ago:
	I would say it is weird, that NVidia competes with own customers but
	looking back at "Founders Edition" cards maybe it isn't that weird at
	all. The better question probably is - with every big corporation
	having its own LLM, what exactly is OpenAI moat that would explain
	their valuation?

	lukeinator42 wrote 10 hours 52 min ago:
	I wonder if they also want to create more of a market for their
	products such as the DGX Spark.

	notyourwork wrote 18 hours 29 min ago:
	They and Tesla know something no one else does.

	beng-nl wrote 12 hours 40 min ago:
	Can you tell us more? Iâm curious to hear what is behind this
	implication.

	leobg wrote 11 hours 23 min ago:
	A guess:

	They both believe the product people focus on will commoditize.
	Tesla realized early that EVs without autonomy are a dead end for
	long-term dominance, just as NVIDIA believes models without
	infrastructure are a dead end for durable AI profits.

	(Am I close?)

	radarsat1 wrote 21 hours 7 min ago:
	I find it really interesting that it uses a Mamba hybrid with
	Transformers. Is it the only significant model right now using (at
	least partially) SSM layers? This must contribute to lower VRAM
	requirements right? Does it impact how KV caching works?

	ofermend wrote 1 day ago:
	We just evaluated Nemotron-3 for Vectara's hallucination leaderboard.

	It scores at 9.6% hallucination rate, similar to
	qwen3-next-80b-a3b-thinking (9.3%) but of course it is much smaller.

	[1]: https://github.com/vectara/hallucination-leaderboard

	DoctorOetker wrote 1 day ago:
	can it understand input in and generate output for different language
	tokens? does it know narrow IPA transcription of sentences in arbitrary
	languages?

	sosodev wrote 1 day ago:
	The claim that a small, fast, and decently accurate model makes a good
	foundation for agentic workloads seems like a reasonable claim.

	However, is cost the biggest limiting factor for agent adoption at this
	point? I would suspect that the much harder part is just creating an
	agent that yields meaningful results.

	ineedasername wrote 1 day ago:
	No, I really don't think cost is the limiting factor- it's tooling
	and competent workforce to implement it. Every company of any
	substantial size, or near enough, is trying to implement and hire for
	those roles, and the # of people familiar with the specific tooling +
	lack of maturity in tooling increasing the learning curve, these are
	the bottlenecks.

	all2 wrote 1 day ago:
	This has been my major concern, so much do that I'm going to be
	launching a tool to handle this specific task: agent conception and
	testing. There is so little visibility in the tools I've used that
	debug is just a game of whackamole.

	sosodev wrote 13 hours 58 min ago:
	Did you see this HN submission? [1] It seems similar to what you're
	describing.

	[1]: https://news.ycombinator.com/item?id=46242838

	all2 wrote 9 hours 56 min ago:
	I did not. Thanks for the heads up!

	kristopolous wrote 1 day ago:
	I was just using the embeddings model last night. Boy is it slow. Nice
	results but this 5090 isn't cutting it.

	I'm guessing there's some sophistication in the instrumentation I'm
	just not up to date with.

	sosodev wrote 1 day ago:
	I love how detailed and transparent the data set statistics are on the
	huggingface pages. [1] I've noticed that open models have made huge
	efficiency gains in the past several months. Some amount of that is
	explainable as architectural improvements but it seems quite obvious
	that a huge portion of the gains come from the heavy use of synthetic
	training data.

	In this case roughly 33% of the training tokens are synthetically
	generated by a mix of other open weight models. I wonder if this trend
	is sustainable or if it might lead to model collapse as some have
	predicted. I suspect that the proliferation of synthetic data
	throughout open weight models has lead to a lot of the ChatGPT writing
	style replication (many bullet points, em dashes, it's not X but
	actually Y, etc).

	[1]: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-F...

	jtbayly wrote 1 day ago:
	Any chance of running this nano model on my Mac?

	keyle wrote 1 day ago:
	LMStudio and 32+ gb of RAM. [1] Simplest to just install it from the
	app.

	[1]: https://lmstudio.ai/models/nemotron-3

	jonrosner wrote 1 day ago:
	running it on my M4 @ 90tps, takes 18GB of RAM.

	Tepix wrote 20 hours 0 min ago:
	If it uses 18GB of RAM, you're not using the official model
	(released in BF16 and FP8), but a quantization of unknown quality.

	If you write "M4", you mean M4 and not M4 Pro or M4 Max?

	pylotlight wrote 22 hours 8 min ago:
	M2 Max @ 17tps btw

	mark_l_watson wrote 1 day ago:
	I used Nemotron 3 nana on LM Studio yesterday on my 32G M2-Pro mac
	mini. It is fast and passed all of my personal tool use tests, and
	did a good job analyzing code. Love it.

	Today I ran a few simple cases on Ollama, but not much real testing.

	axoltl wrote 1 day ago:
	There's MLX versions of the model, so yes. LM Studio hasn't updated
	their mlx-lm runtime yet though, you'll get an exception.

	But if you're OK running it without a UI wrapper, mlx_lm==0.30.0 will
	serve you fine.

	anon373839 wrote 1 day ago:
	Looks like LM Studio just updated the MLX runtime, so there's
	compatibility now.

	axoltl wrote 12 hours 14 min ago:
	Yep! 60t/s on the 8 bit MLX on an M4 Pro with 64GB of RAM.

	netghost wrote 1 day ago:
	Kind of depends on your mac, but if it's a relatively recent apple
	silicon modelâ¦ maybe, probably?

	> Nemotron 3 Nano is a 3.2B active (3.6B with embeddings) 31.6B total
	parameter model.

	So I don't know the exact math once you have a MoE, but 3.2b will run
	on most anything, 31.6b and you're looking at needing a pretty large
	amount of ram.

	vessenes wrote 1 day ago:
	Given Mac bandwidth, you'll generally want to load the whole thing
	in RAM. You get speed benefits based on smaller-size active
	experts, since the Mac compute is slow compared to Nvidia hardware.
	This should be relatively snappy on a Mac, if you can load the
	entire thing.

	kristianp wrote 1 day ago:
	The article seem to focus on the nano model. Where are the details of
	the larger ones?

	shikon7 wrote 1 day ago:
	> We are releasing the Nemotron 3 Nano model and technical report.
	Super and Ultra releases will follow in the coming months.

	max002 wrote 1 day ago:
	Im upvoting, im happy to finally see open source model with commercial
	use from Nvidia as most of the models ive been checking from you guys
	couldnt be used in commercial settings. Bravo Nvidia!

	teleforce wrote 19 hours 53 min ago:
	Just wondering is any commercial restriction can be considered open
	source at all? Even the most stringent GPL allows you to
	commercialize [1].

	But we are talking about LLM model here not software, but the same
	principle should applies. [1] Open-source license:

	[1]: https://en.wikipedia.org/wiki/Open-source_license

	wcallahan wrote 2 days ago:
	I donât do âevalsâ, but I do process billions of tokens every
	month, and Iâve found these small Nvidia models to be the best by far
	for their size currently.

	As someone else mentioned, the GPT-OSS models are also quite good
	(though I havenât found how to make them great yet, though I think
	they might age well like the Llama 3 models did and get better with
	time!).

	But for a defined task, Iâve found task compliance, understanding,
	and tool call success rates to be some of the highest on these Nvidia
	models.

	For example, I have a continuous job that evaluates if the data for a
	startup company on aVenture.vc could have overlapping/conflated two
	similar but unrelated companies for news articles, research details,
	investment rounds, etcâ¦ which is a token hungry ETL task! And I
	recently retested this workflow on the top 15 or so models today with
	<125b parameters, and the Nvidia models were among the best performing
	for this type of work, particularly around non-hallucination if given
	adequate grounding.

	Also, re: cost - I run local inference on several machines that run
	continuously, in addition to routing through OpenRouter and the
	frontier providers, and was pleasantly surprised to find that if Iâm
	a paying customer of OpenRouter otherwise, the free variant there from
	Nvidia is quite generous for limits, too.

	selfhoster11 wrote 21 hours 19 min ago:
	You may want to use the new "derestricted" variants of gpt-oss. While
	the ostensible goal of these variants is to de-censor them, it ends
	up removing the models' obsession with policy and wasting thinking
	tokens that could be used towards actually reasoning through a
	problem.

	dandelionv1bes wrote 22 hours 0 min ago:
	Completely agree. I was working on something with TensorRT LLM and
	threw Nemotron in there more on a whim. It completely mopped the
	floor with other models for my task (text style transfer), following
	joint moderation with another LLM & humans. Really impressed.

	kgeist wrote 1 day ago:
	>the GPT-OSS models are also quite good

	I recently pitted gpt-oss 120b against Qwen3-Next 80b on a lot of
	internal benchmarks (for production use), and for me, gpt-oss was
	slightly slower (vLLM, both fit in VRAM), much worse at multilingual
	tasks (33 languages evaluated), and had worse instruction following
	(e.g., Qwen3-Next was able to reuse the same prompts I used for
	Gemma3 perfectly, while gpt-oss struggled and RAG benchmarks suddenly
	went from 90% to 60% without additional prompt engineering).

	And that's with Qwen3-Next being a random unofficial 4-bit quant
	(compared to gpt-oss having native support) + I had to disable
	multi-token prediction in Qwen3-Next because vLLM crashed with it.

	Has someone here tried both gpt-oss 120b and Qwen3-Next 80b? Maybe I
	was doing something wrong because I've seen a lot of people praise
	gpt-oss.

	scrlk wrote 1 day ago:
	gpt-oss is STEM-maxxed, so I imagine most of the praise comes from
	people using it for agentic coding.

	> We trained the models on a mostly English, text-only dataset,
	with a focus on STEM, coding, and general knowledge.

	[1]: https://openai.com/index/introducing-gpt-oss/

	andy99 wrote 1 day ago:
	What do you mean about not doing evals? Just literally that you
	donât run any benchmarks or do you have something against them?

	danielmarkbruce wrote 1 day ago:
	He's just saying anecdotally these models are good. A reasonable
	response might be "have you systematically evaluated them?". He has
	pre-answered - no.

	woodson wrote 1 day ago:
	Not OP, but perhaps they mean not putting too much faith in common
	benchmarks (thanks to benchmaxxing).

	btown wrote 1 day ago:
	Would you mind sharing what hardware/card(s) you're using? And is [1]
	one of the ones you've tested?

	[1]: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B...

	heavyset_go wrote 23 hours 2 min ago:
	Support for this landed in llama.cpp recently if anyone is
	interested in running it locally.

	red2awn wrote 2 days ago:
	Very interesting release:

	* Hybrid MoE: 2-3x faster than pure MoE transformers

	* 1M context length

	* Trained on NVFP4

	* Open Source! Pretraining, mid-training, SFT and RL dataset released
	(SFT HF link is 404...)

	* Open model training recipe (coming soon)

	Really appreciate Nvidia being the most open lab but they really should
	make sure all the links/data are available on day 0.

	Also interesting that the model is trained in NVFP4 but the inference
	weights are FP8.

	bcatanzaro wrote 1 day ago:
	The Nano model isnât pretrained in FP4, only Super and Ultra are.
	And posttraining is not in FP4, so the posttrained weights of these
	models are not native FP4.

	pants2 wrote 2 days ago:
	If it's intelligence + speed you want, nothing comes close to
	GPT-OSS-120B on Cerebras or Groq.

	However, this looks like it has great potential for cost-effectiveness.
	As of today it's free to use over API on OpenRouter, so a bit unclear
	what it'll cost when it's not free, but free is free!

	[1]: https://openrouter.ai/nvidia/nemotron-3-nano-30b-a3b:free

	viraptor wrote 2 days ago:
	> nothing comes close to GPT-OSS-120B on Cerebras

	That's temporary. Cerebras speeds up everything, so if Nemotron is
	good quality, it's just a matter of time until they add it.

	credit_guy wrote 2 days ago:
	That's unlikely. Cerebras doesn't speed up everything. Can it speed
	up everything? I don't know, I'm not an insider. But does it speed
	up everything? That is evidently not the case. Their page [1] lists
	only 4 production models and 2 preview models.

	[1]: https://inference-docs.cerebras.ai/models/overview

	agentastic wrote 1 day ago:
	They need to compile the model for their chips. Standard
	transformers are easier, so GPT-OSS, Qwen, GLM, etc if there is
	demand, they will deploy it.

	Nemotron on the other hand is a hybrid (Transformer + Mamba-2) so
	it will be more challenging to compile it on Cerebras/Groq chips.

	(Me thinks Nvidia is purposefully picking architecture+FP4 that
	is easy to ship on Nvidia chips, but harder for TPU or
	Cerebras/Groq to deploy)

	Y_Y wrote 2 days ago:
	Wow, Nvidia keepson pushing the frontier of misleading benchmarks


	<- back to front page