_______ __ _______ | |
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
on Gopher (inofficial) | |
Visit Hacker News on the Web | |
COMMENT PAGE FOR: | |
Baba Is Eval | |
zahlman wrote 17 hours 1 min ago: | |
> This is why the video of Claude solving level 1 at the top was | |
actually (dramatic musical cue) staged, and only possible via a | |
move-for-move tutorial that Claude nicely rationalized post hoc. | |
One of the things this arc of history has taught me is that post-hoc | |
rationalization is depressingly easy. Especially if it doesn't have to | |
make sense, but even passing basic logical checks isn't too difficult. | |
Ripping the rationalization apart often requires identifying novel, | |
non-obvious logical checks. | |
I thought I had learned that time and time again from human politics, | |
but AI somehow made it even clearer than I thought possible. Perhaps | |
simply because of knowing that a machine is doing it. | |
Edit: after watching the video more carefully: | |
> "This forms WALL IS WIN horizontally. But I need "FLAG IS WIN" | |
instead. Let me check if walls now have the WIN property. If they do, I | |
just need to touch a wall to win. Let me try moving to a wall: | |
There's something extremely uncanny-valley about this. A human player | |
absolutely would accidentally win like this, and have similar reasoning | |
(not expressed so formally) about how the win was achieved after the | |
fact. (Winning depends on the walls having WIN and also not having | |
STOP; many players get stuck on later levels, even after having | |
supposedly learned the lesson of this one, by trying to make something | |
WIN and walk onto it while it is still STOP.) | |
But the WIN block was not originally in line with the WALL IS text, so | |
a human player would never accidentally form the rule, but would only | |
do it with the expectation of being able to win that way. Especially | |
since there was already an obvious, clear path to FLAG â a level like | |
this has no Sokoban puzzle element to it; it's purely about learning | |
that the walls only block the player because they are STOP. | |
Nor would (from my experience watching streamers at least) a human | |
spontaneously notice that the rule "WALL IS WIN" had been formed and | |
treat that as a cue to reconsider the entire strategy. The natural | |
human response to unintentionally forming a useful rule is to keep | |
pushing in the same direction. | |
On the other hand, an actually dedicated AI system (in the way that | |
AlphaGo was dedicated to Go) could, I'm sure, figure out a game like | |
Baba Is You pretty easily. It would lack the human instinct to treat | |
the walls as if they were implicitly always STOP; so it would never | |
struggle with overriding it. | |
deadbabe wrote 16 hours 25 min ago: | |
A simple feed-forward neural network with sufficient training can | |
solve levels way better than Claude. Why is Claude being used at all. | |
wredcoll wrote 15 hours 42 min ago: | |
The question isn't "can we write a computer program that can beat X | |
game", it is "do things like claude represent a truly general | |
purpose intelligence as demonstrated by its ability to both write a | |
limerick and play baba is you" | |
WhitneyLand wrote 20 hours 24 min ago: | |
âReasoning models like o3 might be better equipped to come up with a | |
plan, so a natural step would be to try switching to those, away from | |
Claude Desktopâ¦â | |
Butâ¦Claude Desktop does have a reasoning mode for both Sonnet and | |
Opus. | |
popcar2 wrote 21 hours 37 min ago: | |
I would be way more interested in it playing niche community levels, | |
because I suspect a huge reason it's able to solve these levels is | |
because it was trained on a million Baba is You walkthroughs. Same with | |
people using Pokemon as a way to test LLMs, it really just depends on | |
how well it knows the game. | |
fi-le wrote 21 hours 18 min ago: | |
Two corrections, as written in the post: At least Claude not able to | |
solve the standard levels at all, and community levels are definitely | |
in scope. | |
andy99 wrote 22 hours 4 min ago: | |
I suspect real AGI evals aren't going to be "IQ test"-like which is how | |
I'd categorize these benchmarks. | |
LLMs will probably continue to scale on such benchmarks, as they have | |
been, without needing real ingenuity or intelligence. | |
Obviously I don't know the answer but I think it's the same root | |
problem as why neural networks will never lead to intelligence. We're | |
building and testing idiot savants. | |
niemandhier wrote 23 hours 6 min ago: | |
I think itâs a great idea for a benchmark. | |
One key difference to ARC in its current iteration is that there is a | |
defined and learnable game physics. | |
Arc requires generalization based on few examples for problems that are | |
not well defined per se. | |
Hence ARC currently requires the models that work on it to possess | |
biases that are comparable to the ones that humans possess. | |
ThouTo2C wrote 23 hours 14 min ago: | |
There are numerous guides for all levels of Baba Is You available. I | |
think it's likely that any modern LLM has them as part of its training | |
dataset. That severely degrades this as a test for complex solution | |
capabilities. | |
Still, its interesting to see the challenges with dynamic rules (like | |
"Key is Stop") that change where are you able to move etc. | |
ethan_smith wrote 19 hours 51 min ago: | |
The dynamic rule changes are precisely what make this a valuable | |
benchmark despite available guides. Each rule modification creates a | |
novel state-space that requires reasoning about the consequences of | |
those changes, not just memorizing solution paths. | |
klohto wrote 23 hours 9 min ago: | |
Read the article first maybe | |
tibastral2 wrote 1 day ago: | |
It reminds me of [1] . Hope we are not ourselves in some sort of | |
simulation ;) | |
[1]: https://en.m.wikipedia.org/wiki/The_Ricks_Must_Be_Crazy | |
wohoef wrote 1 day ago: | |
In my experience LLMs have a hard time working with text grids like | |
this. It seems to find columns harder to âdetectâ then rows. | |
Probably because itâs input shows it as a giant row if that makes | |
sense. | |
It has the same problem with playing chess. | |
But Iâm not sure if there is a datatype it could work with for this | |
kinda game. Currently it seems more like LLMs canât really work on | |
spacial problems. But this should actually be something that can be | |
fixed (pretty sure I saw an article about it on HN recently) | |
fi-le wrote 21 hours 16 min ago: | |
Good point. The architectural solution that would come to mind is 2D | |
text embeddings, i.e. we add 2 sines and cosines to each token | |
embedding instead of 1. Apparently people have done it before: | |
[1]: https://arxiv.org/abs/2409.19700v2 | |
ninjha wrote 20 hours 45 min ago: | |
I think I remember one of the original ViT papers saying something | |
about 2D embeddings on image patches not actually increasing | |
performance on image recognition or segmentation, so itâs kind of | |
interesting that it helps with text! | |
E: I found the paper: [1] > We use standard learnable 1D position | |
embeddings, since we have not observed significant performance | |
gains from using more advanced 2D-aware position embeddings | |
(Appendix D.4). | |
Although it looks like that was just ImageNet so maybe this isn't | |
that surprising. | |
[1]: https://arxiv.org/pdf/2010.11929 | |
yorwba wrote 19 hours 14 min ago: | |
They seem to have used a fixed input resolution for each model, | |
so the learnable 1D position embeddings are equivalent to | |
learnable 2D position embeddings where every grid position gets | |
its own embedding. It's when different images may have a | |
different number of tokens per row that the correspondence | |
between 1D index and 2D position gets broken and a 2D-aware | |
position embedding can be expected to produce different results. | |
stavros wrote 23 hours 36 min ago: | |
If this were a limitation in the architecture, they wouldn't be able | |
to work with images, no? | |
hnlmorg wrote 21 hours 36 min ago: | |
LLMs donât work with images. | |
stavros wrote 21 hours 14 min ago: | |
They do, though. | |
hnlmorg wrote 20 hours 45 min ago: | |
Do they? I thought it was completely different models that did | |
image generation. | |
LLMs might be used to translate requests into keywords, but I | |
didnât think LLMs themselves did any of the image generation. | |
Am I wrong here? | |
stavros wrote 20 hours 43 min ago: | |
Yes, that's why ChatGPT can look at an image and change the | |
style, or edit things in the image. The image itself is | |
converted to tokens and passed to the LLM. | |
hnlmorg wrote 20 hours 35 min ago: | |
LLMs can be used as an agent to do all sorts of clever | |
things, but it doesnât mean the LLM is actually handling | |
the original data format. | |
Iâve created MCP servers that can scrape websites but | |
that doesnât mean the LLM itself can make HTTP calls. | |
The reason I make this distinction is because someone | |
claimed that LLMs can read images. But they donât. They | |
act as an agent for another model that reads images and | |
creates metadata from it. LLMs then turn that meta data | |
into natural language. | |
The LLM itself doesnât see any pixels. It sees textual | |
information that another model has provided. | |
Edit: reading more about this online, it seems LLMs can | |
work with pixel level data. I had no idea that was | |
possible. | |
My apologies. | |
stavros wrote 20 hours 28 min ago: | |
No problem. Again, if it happened the way you described | |
(which it did, until GPT-4o recently), the LLM wouldn't | |
have been able to edit images. You can't get a textual | |
description of an image and reconstruct it perfectly just | |
from that, with one part edited. | |
froobius wrote 1 day ago: | |
Transformers can easily be trained / designed to handle grids, it's | |
just that off the shelf standard LLMs haven't been particularly, | |
(although they would have seen some) | |
nine_k wrote 17 hours 26 min ago: | |
Are there some well-known examples of success in it? | |
thethimble wrote 12 hours 34 min ago: | |
Vision transformers effectively encode a grid of pixel patches. | |
Itâs ultimately a matter of ensuring the position encoding | |
incorporates both X and Y and position. | |
For LLMs we only have one axis of position and - more importantly | |
- the vast majority of training data only is oriented in this | |
way. | |
pclmulqdq wrote 1 day ago: | |
I have noticed a trend of the word "Desiderata" appearing in a lot more | |
writing. Is this an LLM word or is it just in fashion? Most people | |
would use the words "Deisres" or "Goals," so I assume this might be the | |
new "delve." | |
fi-le wrote 21 hours 23 min ago: | |
At least in this instance, it came from my fleshy human brain. | |
Although I perhaps used it to come off as smarter than I really am - | |
just like an LLM might. | |
Tomte wrote 1 day ago: | |
Itâs academic jargon. Desiderata are often at the end of a paper, | |
in the section âsomeone should investigate X, but Iâm moving on | |
to the next funded projectâ. | |
ginko wrote 22 hours 49 min ago: | |
So âFuture Workâ? | |
dgfl wrote 21 hours 30 min ago: | |
Literally it means âthings that we wish forâ, from the latin | |
verb âdesiderareâ (to wish). | |
RainyDayTmrw wrote 1 day ago: | |
This is interesting. If you approach this game as individual moves, the | |
search tree is really deep. However, most levels can be expressed as a | |
few intermediate goals. | |
In some ways, this reminds me of the history of AI Go (board game). But | |
the resolution there was MCTS, which wasn't at all what we wanted | |
(insofar as MCTS is not generalizable to most things). | |
kadoban wrote 1 day ago: | |
> But the resolution there was MCTS | |
MCTS wasn't _really_ the solution to go. MCTS-based AIs existed for | |
years and they weren't _that_ good. They weren't superhuman for sure, | |
and the moves/games they played were kind of boring. | |
The key to doing go well was doing something that vaguely looks like | |
MCTS but the real guts are a network that can answer: "who's | |
winning?" and "what are good moves to try here?" and using that to | |
guide search. Additionally essential was realizing that computation | |
(run search for a while) with a bad model could be | |
effectively+efficiently used to generate better training data to | |
train a better model. | |
eru wrote 23 hours 46 min ago: | |
> Additionally essential was realizing that computation (run search | |
for a while) with a bad model could be effectively+efficiently used | |
to generate better training data to train a better model. | |
That has been known since at least the 1990s with TD-Gammon beating | |
the world champions in Backgammon. See eg [1] or [2] In a sense, | |
classic chess engines do that, too: alpha-beta-search uses a very | |
weak model (eg just checking for checkmate, otherwise counting | |
material, or what have you) and search to generate a much stronger | |
player. You can use that to generate data for training a better | |
model. | |
[1]: http://incompleteideas.net/book/ebook/node108.html | |
[2]: https://en.wikipedia.org/wiki/TD-Gammon | |
kadoban wrote 11 hours 14 min ago: | |
> That has been known since at least the 1990s with TD-Gammon | |
beating the world champions in Backgammon. | |
Yeah, I didn't mean to imply that reinforcement learning (or | |
applying it in this way) is novel. It was just important to work | |
out how to apply that to go specifically. | |
> In a sense, classic chess engines do that, too: | |
alpha-beta-search uses a very weak model (eg just checking for | |
checkmate, otherwise counting material, or what have you) and | |
search to generate a much stronger player. You can use that to | |
generate data for training a better model. | |
I would say that classic chess AIs specifically don't do the | |
important part. They aren't able to use a worst model to, with | |
computation, train a better model. They can generate training | |
data, but then they have no way to incorporate it back into the | |
AI. | |
rtpg wrote 1 day ago: | |
> However, most levels can be expressed as a few intermediate goals | |
I think generally the whole thing with puzzle games is that you have | |
to determine the ârightâ intermediate goals. In fact, the naive | |
intermediate goals are often entirely wrong! | |
A canonical sokoban-like inversion might be where you have to push | |
two blocks into goal areas. You might think âok, push one block | |
into its goal area and then push another into it.â | |
But many of these games will have mechanisms meaning you would first | |
want to push one block into its goal, then undo that for some reason | |
(it might activate some extra functionality) push the other block, | |
and then finally go back and do the thing. | |
Thereâs always weird tricks that mean that youâre going to walk | |
backwards before walking forwards. I donât think itâs impossible | |
for these things to stumble into it, though. Just might spin a lot of | |
cycles to get there (humans do too I guess) | |
matsemann wrote 1 day ago: | |
Yeah, often working backwards and forwards at the same time is how | |
to solve some advanced puzzle games. Then you keep it from | |
exploding in options. When thinking backwards from the goal, you | |
figure out constraints or "invariants" the forward path must | |
uphold, thus can discard lots of dead ends earlier in your forward | |
path. | |
To me, those discoveries are the fun part of most puzzle games. | |
When you unlock the "trick" for each level and the dopamine flies, | |
heh. | |
TeMPOraL wrote 22 hours 53 min ago: | |
I usually get a good mileage out of jumping straight in the | |
middle :). Like, "hmm let's look at this block; oh cool, there's | |
enough space around it that I could push it away from goal, for | |
whatever reason". Turns out, if it's possible there usually is a | |
good reason. So whenever I get stuck, I skim every object in the | |
puzzle and consider in isolation, what can I do with it, and this | |
usually gives me anchor points to drive my forward or backward | |
thinking through. | |
captn3m0 wrote 1 day ago: | |
I once made a âRC plays Baba Is Youâ that controlled the game over | |
a single shared browser that was streaming video and controls back to | |
the game. Was quite fun! | |
But I am fairly sure all of Baba Is You solutions are present in the | |
training data for modern LLMs so it wonât make for a good eval. | |
chmod775 wrote 1 day ago: | |
> But I am fairly sure all of Baba Is You solutions are present in | |
the training data for modern LLMs so it wonât make for a good eval. | |
Claude 4 cannot solve any Baba Is You level (except level 0 that is | |
solved by 8 right inputs), so for now it's at least a nice low bar to | |
shoot for... | |
ekianjo wrote 1 day ago: | |
this is definitely a case for fine tuning a LLM on this game's data. | |
There is currently no LLM out there that is able to play very well many | |
games of different kinds. | |
k2xl wrote 1 day ago: | |
Baba is You is a great game part of a collection of 2D grid puzzle | |
games. | |
(Shameless plug: I am one of the developers of Thinky.gg ( [1] ), which | |
is a thinky puzzle game site for a 'shortest path style' [Pathology] | |
and a Sokoban variant [Sokoath] ) | |
These games are typically NP Hard so the typical techniques that | |
solvers have employed for Sokoban (or Pathology) have been brute forced | |
with varying heuristics (like BFS, dead-lock detection, and Zobrist | |
hashing). However, once levels get beyond a certain size with enough | |
movable blocks you end up exhausting memory pretty quickly. | |
These types of games are still "AI Proof" so far in that LLMs are | |
absolutely awful at solving these while humans are very good (so seems | |
reasonable to consider for for ARC-AGI benchmarks). Whenever a new | |
reasoning model gets released I typically try it on some basic | |
Pathology levels (like 'One at a Time' [2] ) and they fail miserably. | |
Simple level code for the above level (1 is a wall, 2 is a movable | |
block, 4 is starting block, 3 is the exit): | |
000 | |
020 | |
023 | |
041 | |
Similar to OP, I've found Claude couldnât manage rule dynamics, | |
blocked paths, or game objectives well and spits out random results. | |
[1]: https://thinky.gg | |
[2]: https://pathology.thinky.gg/level/ybbun/one-at-a-time | |
eru wrote 23 hours 41 min ago: | |
NP hard isn't much of a problem, because the levels are fairly small, | |
and instances are not chosen to be worst case hard but to be | |
entertaining for humans to solve. | |
SMT/SAT solvers or integer linear programming can get you pretty far. | |
Many classic puzzle games like Minesweeper are NP hard, and you can | |
solve any instance that a human would be able to solve in their | |
lifetime fairly quickly on a computer. | |
kinduff wrote 1 day ago: | |
In Factorio's paper [1] page 3, the agent receives a semantic | |
representation with coordinates. Have you tried this data format? | |
[1] | |
[1]: https://arxiv.org/pdf/2503.09617 | |
kinduff wrote 1 day ago: | |
Do you think the performance can be improved if the representation of | |
the level is different? | |
I've seen AI struggle with ASCII, but when presented as other data | |
structures, it performs better. | |
edit: | |
e.g. JSON with structured coordinates, graph based JSON, or a semantic | |
representation with the coordinates | |
QuadmasterXLII wrote 20 hours 50 min ago: | |
These models can âcode,â but they canât code yet. Weâll … | |
that they can actually code once their performance on these tasks | |
becomes invariant to input representation, because they can just whip | |
up a script to convert representations. | |
RainyDayTmrw wrote 1 day ago: | |
In the limit case, to an actual general intelligence, representation | |
is superfluous, because it can figure out how to convert freely. | |
To the extent that the current generation of AI isn't general, yeah, | |
papering over some of its weaknesses may allow you to expose other | |
parts of it, both strengths and other weaknesses. | |
kadoban wrote 1 day ago: | |
A human can easily struggle at solving a poorly communicated | |
puzzle, especially if paper/pencil or something isn't available to | |
convert to a better format. LLMs can look back at what they wrote, | |
but it seems kind of like a poor format for working out a better | |
representation to me. | |
kinduff wrote 7 hours 21 min ago: | |
I found some papers [n] about this. And I think the answer is | |
yes, the format matters asnd hence the representation. | |
I wonder if the author would be willing to try with another | |
representation. | |
[1]: Does Prompt Formatting Have Any Impact on LLM Performance? | |
[1] [2]: Large Language Models(LLMs) on Tabular Data: Prediction, | |
Generation, and Understanding - A Survey | |
[1]: https://arxiv.org/html/2411.10541v1 | |
[2]: https://arxiv.org/html/2402.17944v2 | |
hajile wrote 1 day ago: | |
If it struggles with the representation, that makes it an even better | |
test of the AI's thinking potential. | |
eru wrote 23 hours 44 min ago: | |
I'm not sure. Adding superficial difficulties to an IQ test for | |
humans doesn't (necessarily) improve it as an IQ test. | |
<- back to front page |