/hn/comments_44455124.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	Baba Is Eval


	zahlman wrote 17 hours 1 min ago:
	> This is why the video of Claude solving level 1 at the top was
	actually (dramatic musical cue) staged, and only possible via a
	move-for-move tutorial that Claude nicely rationalized post hoc.

	One of the things this arc of history has taught me is that post-hoc
	rationalization is depressingly easy. Especially if it doesn't have to
	make sense, but even passing basic logical checks isn't too difficult.
	Ripping the rationalization apart often requires identifying novel,
	non-obvious logical checks.

	I thought I had learned that time and time again from human politics,
	but AI somehow made it even clearer than I thought possible. Perhaps
	simply because of knowing that a machine is doing it.

	Edit: after watching the video more carefully:

	> "This forms WALL IS WIN horizontally. But I need "FLAG IS WIN"
	instead. Let me check if walls now have the WIN property. If they do, I
	just need to touch a wall to win. Let me try moving to a wall:

	There's something extremely uncanny-valley about this. A human player
	absolutely would accidentally win like this, and have similar reasoning
	(not expressed so formally) about how the win was achieved after the
	fact. (Winning depends on the walls having WIN and also not having
	STOP; many players get stuck on later levels, even after having
	supposedly learned the lesson of this one, by trying to make something
	WIN and walk onto it while it is still STOP.)

	But the WIN block was not originally in line with the WALL IS text, so
	a human player would never accidentally form the rule, but would only
	do it with the expectation of being able to win that way. Especially
	since there was already an obvious, clear path to FLAG â a level like
	this has no Sokoban puzzle element to it; it's purely about learning
	that the walls only block the player because they are STOP.

	Nor would (from my experience watching streamers at least) a human
	spontaneously notice that the rule "WALL IS WIN" had been formed and
	treat that as a cue to reconsider the entire strategy. The natural
	human response to unintentionally forming a useful rule is to keep
	pushing in the same direction.

	On the other hand, an actually dedicated AI system (in the way that
	AlphaGo was dedicated to Go) could, I'm sure, figure out a game like
	Baba Is You pretty easily. It would lack the human instinct to treat
	the walls as if they were implicitly always STOP; so it would never
	struggle with overriding it.

	deadbabe wrote 16 hours 25 min ago:
	A simple feed-forward neural network with sufficient training can
	solve levels way better than Claude. Why is Claude being used at all.

	wredcoll wrote 15 hours 42 min ago:
	The question isn't "can we write a computer program that can beat X
	game", it is "do things like claude represent a truly general
	purpose intelligence as demonstrated by its ability to both write a
	limerick and play baba is you"

	WhitneyLand wrote 20 hours 24 min ago:
	âReasoning models like o3 might be better equipped to come up with a
	plan, so a natural step would be to try switching to those, away from
	Claude Desktopâ¦â

	Butâ¦Claude Desktop does have a reasoning mode for both Sonnet and
	Opus.

	popcar2 wrote 21 hours 37 min ago:
	I would be way more interested in it playing niche community levels,
	because I suspect a huge reason it's able to solve these levels is
	because it was trained on a million Baba is You walkthroughs. Same with
	people using Pokemon as a way to test LLMs, it really just depends on
	how well it knows the game.

	fi-le wrote 21 hours 18 min ago:
	Two corrections, as written in the post: At least Claude not able to
	solve the standard levels at all, and community levels are definitely
	in scope.

	andy99 wrote 22 hours 4 min ago:
	I suspect real AGI evals aren't going to be "IQ test"-like which is how
	I'd categorize these benchmarks.

	LLMs will probably continue to scale on such benchmarks, as they have
	been, without needing real ingenuity or intelligence.

	Obviously I don't know the answer but I think it's the same root
	problem as why neural networks will never lead to intelligence. We're
	building and testing idiot savants.

	niemandhier wrote 23 hours 6 min ago:
	I think itâs a great idea for a benchmark.

	One key difference to ARC in its current iteration is that there is a
	defined and learnable game physics.

	Arc requires generalization based on few examples for problems that are
	not well defined per se.

	Hence ARC currently requires the models that work on it to possess
	biases that are comparable to the ones that humans possess.

	ThouTo2C wrote 23 hours 14 min ago:
	There are numerous guides for all levels of Baba Is You available. I
	think it's likely that any modern LLM has them as part of its training
	dataset. That severely degrades this as a test for complex solution
	capabilities.

	Still, its interesting to see the challenges with dynamic rules (like
	"Key is Stop") that change where are you able to move etc.

	ethan_smith wrote 19 hours 51 min ago:
	The dynamic rule changes are precisely what make this a valuable
	benchmark despite available guides. Each rule modification creates a
	novel state-space that requires reasoning about the consequences of
	those changes, not just memorizing solution paths.

	klohto wrote 23 hours 9 min ago:
	Read the article first maybe

	tibastral2 wrote 1 day ago:
	It reminds me of [1] . Hope we are not ourselves in some sort of
	simulation ;)

	[1]: https://en.m.wikipedia.org/wiki/The_Ricks_Must_Be_Crazy

	wohoef wrote 1 day ago:
	In my experience LLMs have a hard time working with text grids like
	this. It seems to find columns harder to âdetectâ then rows.
	Probably because itâs input shows it as a giant row if that makes
	sense.

	It has the same problem with playing chess.
	But Iâm not sure if there is a datatype it could work with for this
	kinda game. Currently it seems more like LLMs canât really work on
	spacial problems. But this should actually be something that can be
	fixed (pretty sure I saw an article about it on HN recently)

	fi-le wrote 21 hours 16 min ago:
	Good point. The architectural solution that would come to mind is 2D
	text embeddings, i.e. we add 2 sines and cosines to each token
	embedding instead of 1. Apparently people have done it before:

	[1]: https://arxiv.org/abs/2409.19700v2

	ninjha wrote 20 hours 45 min ago:
	I think I remember one of the original ViT papers saying something
	about 2D embeddings on image patches not actually increasing
	performance on image recognition or segmentation, so itâs kind of
	interesting that it helps with text!

	E: I found the paper: [1] > We use standard learnable 1D position
	embeddings, since we have not observed significant performance
	gains from using more advanced 2D-aware position embeddings
	(Appendix D.4).

	Although it looks like that was just ImageNet so maybe this isn't
	that surprising.

	[1]: https://arxiv.org/pdf/2010.11929

	yorwba wrote 19 hours 14 min ago:
	They seem to have used a fixed input resolution for each model,
	so the learnable 1D position embeddings are equivalent to
	learnable 2D position embeddings where every grid position gets
	its own embedding. It's when different images may have a
	different number of tokens per row that the correspondence
	between 1D index and 2D position gets broken and a 2D-aware
	position embedding can be expected to produce different results.

	stavros wrote 23 hours 36 min ago:
	If this were a limitation in the architecture, they wouldn't be able
	to work with images, no?

	hnlmorg wrote 21 hours 36 min ago:
	LLMs donât work with images.

	stavros wrote 21 hours 14 min ago:
	They do, though.

	hnlmorg wrote 20 hours 45 min ago:
	Do they? I thought it was completely different models that did
	image generation.

	LLMs might be used to translate requests into keywords, but I
	didnât think LLMs themselves did any of the image generation.

	Am I wrong here?

	stavros wrote 20 hours 43 min ago:
	Yes, that's why ChatGPT can look at an image and change the
	style, or edit things in the image. The image itself is
	converted to tokens and passed to the LLM.

	hnlmorg wrote 20 hours 35 min ago:
	LLMs can be used as an agent to do all sorts of clever
	things, but it doesnât mean the LLM is actually handling
	the original data format.

	Iâve created MCP servers that can scrape websites but
	that doesnât mean the LLM itself can make HTTP calls.

	The reason I make this distinction is because someone
	claimed that LLMs can read images. But they donât. They
	act as an agent for another model that reads images and
	creates metadata from it. LLMs then turn that meta data
	into natural language.

	The LLM itself doesnât see any pixels. It sees textual
	information that another model has provided.

	Edit: reading more about this online, it seems LLMs can
	work with pixel level data. I had no idea that was
	possible.

	My apologies.

	stavros wrote 20 hours 28 min ago:
	No problem. Again, if it happened the way you described
	(which it did, until GPT-4o recently), the LLM wouldn't
	have been able to edit images. You can't get a textual
	description of an image and reconstruct it perfectly just
	from that, with one part edited.

	froobius wrote 1 day ago:
	Transformers can easily be trained / designed to handle grids, it's
	just that off the shelf standard LLMs haven't been particularly,
	(although they would have seen some)

	nine_k wrote 17 hours 26 min ago:
	Are there some well-known examples of success in it?

	thethimble wrote 12 hours 34 min ago:
	Vision transformers effectively encode a grid of pixel patches.
	Itâs ultimately a matter of ensuring the position encoding
	incorporates both X and Y and position.

	For LLMs we only have one axis of position and - more importantly
	- the vast majority of training data only is oriented in this
	way.

	pclmulqdq wrote 1 day ago:
	I have noticed a trend of the word "Desiderata" appearing in a lot more
	writing. Is this an LLM word or is it just in fashion? Most people
	would use the words "Deisres" or "Goals," so I assume this might be the
	new "delve."

	fi-le wrote 21 hours 23 min ago:
	At least in this instance, it came from my fleshy human brain.
	Although I perhaps used it to come off as smarter than I really am -
	just like an LLM might.

	Tomte wrote 1 day ago:
	Itâs academic jargon. Desiderata are often at the end of a paper,
	in the section âsomeone should investigate X, but Iâm moving on
	to the next funded projectâ.

	ginko wrote 22 hours 49 min ago:
	So âFuture Workâ?

	dgfl wrote 21 hours 30 min ago:
	Literally it means âthings that we wish forâ, from the latin
	verb âdesiderareâ (to wish).

	RainyDayTmrw wrote 1 day ago:
	This is interesting. If you approach this game as individual moves, the
	search tree is really deep. However, most levels can be expressed as a
	few intermediate goals.

	In some ways, this reminds me of the history of AI Go (board game). But
	the resolution there was MCTS, which wasn't at all what we wanted
	(insofar as MCTS is not generalizable to most things).

	kadoban wrote 1 day ago:
	> But the resolution there was MCTS

	MCTS wasn't _really_ the solution to go. MCTS-based AIs existed for
	years and they weren't _that_ good. They weren't superhuman for sure,
	and the moves/games they played were kind of boring.

	The key to doing go well was doing something that vaguely looks like
	MCTS but the real guts are a network that can answer: "who's
	winning?" and "what are good moves to try here?" and using that to
	guide search. Additionally essential was realizing that computation
	(run search for a while) with a bad model could be
	effectively+efficiently used to generate better training data to
	train a better model.

	eru wrote 23 hours 46 min ago:
	> Additionally essential was realizing that computation (run search
	for a while) with a bad model could be effectively+efficiently used
	to generate better training data to train a better model.

	That has been known since at least the 1990s with TD-Gammon beating
	the world champions in Backgammon. See eg [1] or [2] In a sense,
	classic chess engines do that, too: alpha-beta-search uses a very
	weak model (eg just checking for checkmate, otherwise counting
	material, or what have you) and search to generate a much stronger
	player. You can use that to generate data for training a better
	model.

	[1]: http://incompleteideas.net/book/ebook/node108.html
	[2]: https://en.wikipedia.org/wiki/TD-Gammon

	kadoban wrote 11 hours 14 min ago:
	> That has been known since at least the 1990s with TD-Gammon
	beating the world champions in Backgammon.

	Yeah, I didn't mean to imply that reinforcement learning (or
	applying it in this way) is novel. It was just important to work
	out how to apply that to go specifically.

	> In a sense, classic chess engines do that, too:
	alpha-beta-search uses a very weak model (eg just checking for
	checkmate, otherwise counting material, or what have you) and
	search to generate a much stronger player. You can use that to
	generate data for training a better model.

	I would say that classic chess AIs specifically don't do the
	important part. They aren't able to use a worst model to, with
	computation, train a better model. They can generate training
	data, but then they have no way to incorporate it back into the
	AI.

	rtpg wrote 1 day ago:
	> However, most levels can be expressed as a few intermediate goals

	I think generally the whole thing with puzzle games is that you have
	to determine the ârightâ intermediate goals. In fact, the naive
	intermediate goals are often entirely wrong!

	A canonical sokoban-like inversion might be where you have to push
	two blocks into goal areas. You might think âok, push one block
	into its goal area and then push another into it.â

	But many of these games will have mechanisms meaning you would first
	want to push one block into its goal, then undo that for some reason
	(it might activate some extra functionality) push the other block,
	and then finally go back and do the thing.

	Thereâs always weird tricks that mean that youâre going to walk
	backwards before walking forwards. I donât think itâs impossible
	for these things to stumble into it, though. Just might spin a lot of
	cycles to get there (humans do too I guess)

	matsemann wrote 1 day ago:
	Yeah, often working backwards and forwards at the same time is how
	to solve some advanced puzzle games. Then you keep it from
	exploding in options. When thinking backwards from the goal, you
	figure out constraints or "invariants" the forward path must
	uphold, thus can discard lots of dead ends earlier in your forward
	path.

	To me, those discoveries are the fun part of most puzzle games.
	When you unlock the "trick" for each level and the dopamine flies,
	heh.

	TeMPOraL wrote 22 hours 53 min ago:
	I usually get a good mileage out of jumping straight in the
	middle :). Like, "hmm let's look at this block; oh cool, there's
	enough space around it that I could push it away from goal, for
	whatever reason". Turns out, if it's possible there usually is a
	good reason. So whenever I get stuck, I skim every object in the
	puzzle and consider in isolation, what can I do with it, and this
	usually gives me anchor points to drive my forward or backward
	thinking through.

	captn3m0 wrote 1 day ago:
	I once made a âRC plays Baba Is Youâ that controlled the game over
	a single shared browser that was streaming video and controls back to
	the game. Was quite fun!

	But I am fairly sure all of Baba Is You solutions are present in the
	training data for modern LLMs so it wonât make for a good eval.

	chmod775 wrote 1 day ago:
	> But I am fairly sure all of Baba Is You solutions are present in
	the training data for modern LLMs so it wonât make for a good eval.

	Claude 4 cannot solve any Baba Is You level (except level 0 that is
	solved by 8 right inputs), so for now it's at least a nice low bar to
	shoot for...

	ekianjo wrote 1 day ago:
	this is definitely a case for fine tuning a LLM on this game's data.
	There is currently no LLM out there that is able to play very well many
	games of different kinds.

	k2xl wrote 1 day ago:
	Baba is You is a great game part of a collection of 2D grid puzzle
	games.

	(Shameless plug: I am one of the developers of Thinky.gg ( [1] ), which
	is a thinky puzzle game site for a 'shortest path style' [Pathology]
	and a Sokoban variant [Sokoath] )

	These games are typically NP Hard so the typical techniques that
	solvers have employed for Sokoban (or Pathology) have been brute forced
	with varying heuristics (like BFS, dead-lock detection, and Zobrist
	hashing). However, once levels get beyond a certain size with enough
	movable blocks you end up exhausting memory pretty quickly.

	These types of games are still "AI Proof" so far in that LLMs are
	absolutely awful at solving these while humans are very good (so seems
	reasonable to consider for for ARC-AGI benchmarks). Whenever a new
	reasoning model gets released I typically try it on some basic
	Pathology levels (like 'One at a Time' [2] ) and they fail miserably.

	Simple level code for the above level (1 is a wall, 2 is a movable
	block, 4 is starting block, 3 is the exit):

	000

	020

	023

	041

	Similar to OP, I've found Claude couldnât manage rule dynamics,
	blocked paths, or game objectives well and spits out random results.

	[1]: https://thinky.gg
	[2]: https://pathology.thinky.gg/level/ybbun/one-at-a-time

	eru wrote 23 hours 41 min ago:
	NP hard isn't much of a problem, because the levels are fairly small,
	and instances are not chosen to be worst case hard but to be
	entertaining for humans to solve.

	SMT/SAT solvers or integer linear programming can get you pretty far.
	Many classic puzzle games like Minesweeper are NP hard, and you can
	solve any instance that a human would be able to solve in their
	lifetime fairly quickly on a computer.

	kinduff wrote 1 day ago:
	In Factorio's paper [1] page 3, the agent receives a semantic
	representation with coordinates. Have you tried this data format?

	[1]

	[1]: https://arxiv.org/pdf/2503.09617

	kinduff wrote 1 day ago:
	Do you think the performance can be improved if the representation of
	the level is different?

	I've seen AI struggle with ASCII, but when presented as other data
	structures, it performs better.

	edit:

	e.g. JSON with structured coordinates, graph based JSON, or a semantic
	representation with the coordinates

	QuadmasterXLII wrote 20 hours 50 min ago:
	These models can âcode,â but they canât code yet. Weâll …
	that they can actually code once their performance on these tasks
	becomes invariant to input representation, because they can just whip
	up a script to convert representations.

	RainyDayTmrw wrote 1 day ago:
	In the limit case, to an actual general intelligence, representation
	is superfluous, because it can figure out how to convert freely.

	To the extent that the current generation of AI isn't general, yeah,
	papering over some of its weaknesses may allow you to expose other
	parts of it, both strengths and other weaknesses.

	kadoban wrote 1 day ago:
	A human can easily struggle at solving a poorly communicated
	puzzle, especially if paper/pencil or something isn't available to
	convert to a better format. LLMs can look back at what they wrote,
	but it seems kind of like a poor format for working out a better
	representation to me.

	kinduff wrote 7 hours 21 min ago:
	I found some papers [n] about this. And I think the answer is
	yes, the format matters asnd hence the representation.

	I wonder if the author would be willing to try with another
	representation.

	[1]: Does Prompt Formatting Have Any Impact on LLM Performance?
	[1] [2]: Large Language Models(LLMs) on Tabular Data: Prediction,
	Generation, and Understanding - A Survey

	[1]: https://arxiv.org/html/2411.10541v1
	[2]: https://arxiv.org/html/2402.17944v2

	hajile wrote 1 day ago:
	If it struggles with the representation, that makes it an even better
	test of the AI's thinking potential.

	eru wrote 23 hours 44 min ago:
	I'm not sure. Adding superficial difficulties to an IQ test for
	humans doesn't (necessarily) improve it as an IQ test.


	<- back to front page