_______ __ _______ | |
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
on Gopher (inofficial) | |
Visit Hacker News on the Web | |
COMMENT PAGE FOR: | |
The last six months in LLMs, illustrated by pelicans on bicycles | |
beefnugs wrote 1 hour 43 min ago: | |
I think its hilarious how humans can make mistakes interpreting the | |
crazy drawings : He says "I like how it solved the problem of pelicans | |
not fitting on bicycles by adding a second smaller bicycle to the | |
stack." | |
no... that is an attempt at it actually drawing the pedals, and putting | |
the pelicans feet right on the pedals! | |
buserror wrote 6 hours 22 min ago: | |
The hilarious bit is that this page will soon be scraped by ai-bots as | |
learning material, and they'll all learn to draw pelicans on bicycles | |
using this as their primary example material, as they'll be the only | |
examples. | |
GIGO in motion :-) | |
darkoob12 wrote 7 hours 40 min ago: | |
Should we be that excited about AI and calling a fraud and plagiarism | |
machine "ChatGPT Mischief Buddy" without any moral deliberation? | |
simonw wrote 7 hours 13 min ago: | |
The "mischief buddy" joke is a poke at exactly that. | |
0points wrote 13 hours 19 min ago: | |
So the only bird slightly resembling a pelican beak was drawn by gemini | |
2.5 pro. In general, none of the output resembles a pelican enough so | |
you could separate it from "a bird". | |
OP seem to ignore that pelican has a distinct look when evaluating | |
these doodles. | |
simonw wrote 13 hours 17 min ago: | |
The pelican's distinct look - and the fact that none of the models | |
can capture it - is the whole point. | |
irthomasthomas wrote 20 hours 2 min ago: | |
The best pelicans come from running a consortium of models. I use | |
pelicans as evals now. [1] Test it using VibeLab (wip) | |
[1]: https://x.com/xundecidability/status/1921009133077053462 | |
[2]: https://x.com/xundecidability/status/1926779393633857715 | |
m3047 wrote 20 hours 27 min ago: | |
TIL: Snitchbench! | |
NohatCoder wrote 22 hours 28 min ago: | |
If you calculate ELO based on a round-robin tournament with all | |
participants starting out on the same score, then the resulting ratings | |
should simply correspond to the win count. I guess the algorithm in use | |
take into account the order of the matches, but taking order into | |
account is only meaningful when competitors are expected to develop | |
significantly, otherwise it is just added noise, so we never want to do | |
so in competitions between bots. | |
I also can't help but notice that the competition is exactly one match | |
short, for some reason exactly one of the 561 possible pairings has not | |
been included. | |
simonw wrote 21 hours 56 min ago: | |
Yeah, that's a good call out: Elo isn't actually necessary if you can | |
have every competitor battle every other competitor exactly once. | |
The missing match is because one single round was declared a draw by | |
the model, and I didn't have time to run it again (the Elo stuff was | |
very much rushed at the last minute.) | |
NicoSchwandner wrote 23 hours 42 min ago: | |
Nice post, thanks! | |
zurichisstained wrote 1 day ago: | |
Wow, I love this benchmark - I've been doing something similar (as a | |
joke for and much less frequently), where I ask multiple models to | |
attempt to create a data structure like: | |
``` | |
const melody = [ | |
{ freq: 261.63, duration: 'quarter' }, // C4 | |
{ freq: 0, duration: 'triplet' }, // triplet rest | |
{ freq: 293.66, duration: 'triplet' }, // D4 | |
{ freq: 0, duration: 'triplet' }, // triplet rest | |
{ freq: 329.63, duration: 'half' }, // E4 | |
] | |
``` | |
But with the intro to Smoke on the Water by Deep Purple. Then I run it | |
through the Web Audio API and see how it sounds. | |
It's never quite gotten it right, but it's gotten better, to the point | |
where I can ask it to make a website that can play it. | |
I think yours is a lot more thoughtful about testing novelty, but its | |
interesting to see them attempt to do things that they aren't really | |
built for (in theory!). [1] - ChatGPT 4 Turbo [2] - Claude Sonnet 3.7 | |
[3] - Gemini 2.5 Pro | |
Gemini is by far the best sounding one, but it's still off. I'd be | |
curious how the latest and greatest (paid) versions fare. | |
(And just for comparison, here's the first time I did it... you can | |
tell I did the front-end because there isn't much to it!) | |
[1]: https://codepen.io/mvattuone/pen/qEdPaoW | |
[2]: https://codepen.io/mvattuone/pen/ogXGzdg | |
[3]: https://codepen.io/mvattuone/pen/ZYGXpom | |
[4]: https://nitter.space/mvattuone/status/1646610228748730368#m | |
ojosilva wrote 20 hours 24 min ago: | |
Drawbacks for using a pelican on a bicycle svg: it's a very | |
open-ended prompt, no specific criteria to judge, and lately the svg | |
all start to look similar, or at least like they accomplished the | |
same non-goals (there's a pelican, there's a bicycle and I'm not sure | |
its feet should be on the saddle or on the pedals), so it's hard to | |
agree on which is better. And, certainly, having a LLM as a judge, | |
the entire game becomes double-hinged and who knows what to think. | |
Also, if it becomes popular, training sets may pick it up and improve | |
models unfairly and unrealistically. But that's true of any known | |
benchmark. | |
Side note: I'd really like to see the Language Benchmark Game become | |
a prompt based languages * models benchmark game. So we could say | |
model X excels at Python Fasta, etc. although then the risk is that, | |
again, it becomes training set and the whole thing self-rigs itself. | |
dr_kretyn wrote 22 hours 33 min ago: | |
I'm slightly confused by your example. What's the actual prompt? Is | |
your expectation that a text model is going to know how to perform | |
the exact song in audio? | |
zurichisstained wrote 20 hours 15 min ago: | |
Ohhh absolutely not, that would be pretty wild - I just wanted to | |
see if it could understand musical notation enough to come up with | |
the correct melody. | |
I know there are far better ways to do gen AI with music, this was | |
just a joke prompt that worked far better than I expected. | |
My naive guess is all of the guitar tabs and signal processing info | |
it's trained on gives it the ability to do stuff like this (albeit | |
not very well). | |
isx726552 wrote 1 day ago: | |
> Iâve been feeling pretty good about my benchmark! It should stay | |
useful for a long time... provided none of the big AI labs catch on. | |
> And then I saw this in the Google I/O keynote a few weeks ago, in a | |
blink and youâll miss it moment! Thereâs a pelican riding a | |
bicycle! Theyâre on to me. Iâm going to have to switch to something | |
else. | |
Yeah this touches on an issue that makes it very difficult to have a | |
discussion in public about AI capabilities. Any specific test you talk | |
about, no matter how small ⦠if the big companies get wind of it, it | |
will be RLHFâd away, sometimes to the point of absurdity. Just refer | |
to the old âcount the ârâs in strawberryâ canard for one | |
example. | |
lofaszvanitt wrote 6 hours 28 min ago: | |
You push sha512 hashes of things in a githup repo and a short | |
sentence: | |
x8 version: still shit | |
. | |
. | |
x15 version: we are closing, but overall a shit experience :D | |
this way they won't know what to improve upon. of course they can buy | |
access. ;P | |
when they finally solve your problem you can reveal what was the | |
benchmark. | |
Choco31415 wrote 22 hours 36 min ago: | |
Just tried that canard on GPT-4o and it failed: | |
"The word "strawberry" contains 2 letter râs." | |
belter wrote 2 hours 49 min ago: | |
I tried | |
strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said | |
three | |
strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said | |
four | |
stawberrry -> DeepSeek, GeminiPro all correctly said three | |
ChatGPT4o even in a new Chat, incorrectly said the word | |
"stawberrry" contains 4 letter "r" characters. Even provided this | |
useful breakdown to let me know :-) | |
Breakdown: | |
stawberrry â s, t, a, w, b, e, r, r, r, y â 4 r's | |
And then asked if I meant "strawberry" instead and said because | |
that one has 2 r's.... | |
simonw wrote 22 hours 52 min ago: | |
Honestly, if my stupid pelican riding a bicycle benchmark becomes | |
influential enough that AI labs waste their time optimizing for it | |
and produce really beautiful pelican illustrations I will consider | |
that a huge personal win. | |
MattRix wrote 23 hours 4 min ago: | |
This is why things like the ARC Prize are better ways of approaching | |
this: | |
[1]: https://arcprize.org | |
whiplash451 wrote 12 hours 39 min ago: | |
Well, ARC-1 did not end well for the competitors of tech giants and | |
itâs very unclear that ARC-2 wonât follow the same trajectory. | |
joshuajooste05 wrote 1 day ago: | |
Does anyone have any thoughts on privacy/safety regarding what he said | |
about GPT memory. | |
I had heard of prompt injection already. But, this seems different, | |
completely out of humans control. Like even when you consider web | |
search functionality, he is actually right, more and more, users are | |
losing control over context. | |
Is this dangerous atm? Do you think it will become more dangerous in | |
the future when we chuck even more data into context? | |
threeseed wrote 21 hours 53 min ago: | |
I've had Cursor/Claude try to call rm -rf on my entire User directory | |
before. | |
The issue is that LLMs have no ability to organise their memory by | |
importance. Especially as the context size gets larger. | |
So when they are using tools they will become more dangerous over | |
time. | |
ActorNightly wrote 1 day ago: | |
Sort of. The thing is with agentic models, you are basically entering | |
probability space where it can do real actions in the form of http | |
requests if the statistical output leads it to it. | |
Joker_vD wrote 1 day ago: | |
> most people find it difficult to remember the exact orientation of | |
the frame. | |
Isn't it ÎâÎ welded together? The bottom left and right vertices | |
are where the wheels are attached to, the middle bottom point is where | |
the big gear with the pedals is. The lambda is for the front wheel | |
because you wouldn't be able to turn it if it was attached to a delta. | |
Right? | |
I guess having my first bicycle be a cheap Soviet-era produced one paid | |
off: I spent loads of time fidgeting with the chain tension, and | |
pulling the chain back onto the gears, so I guess I had to stare at the | |
frame way too much to forget even by today the way it looks. | |
pbronez wrote 1 day ago: | |
There are a lot of structural details that people tend to gloss over. | |
This was illustrated by an Italian art project: [1] > back in 2009 I | |
began pestering friends and random strangers. I would walk up to them | |
with a pen and a sheet of paper asking that they immediately draw me | |
a menâs bicycle, by heart. Soon I found out that when confronted | |
with this odd request most people have a very hard time remembering | |
exactly how a bike is made. | |
[1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/ | |
zahlman wrote 1 day ago: | |
> If you lost interest in local modelsâlike I did eight months | |
agoâitâs worth paying attention to them again. Theyâve got good | |
now! | |
> As a power user of these tools, I want to stay in complete control of | |
what the inputs are. Features like ChatGPT memory are taking that | |
control away from me. | |
You reap what you sow.... | |
> I already have a tool I built called shot-scraper, a CLI app that | |
lets me take screenshots of web pages and save them as images. I had | |
Claude build me a web page that accepts ?left= and ?right= parameters | |
pointing to image URLs and then embeds them side-by-side on a page. | |
Then I could take screenshots of those two images side-by-side. I | |
generated one of those for every possible match-up of my 34 pelican | |
picturesâ560 matches in total. | |
Surely it would have been easier to use a local tool like ImageMagick? | |
You could even have the AI write a Bash script for you. | |
> ... but prompt injection is still a thing. | |
...Why wouldn't it always be? There's no quoting or escaping mechanism | |
that's actually out-of-band. | |
> Thereâs this thing Iâm calling the lethal trifecta, which is when | |
you have an AI system that has access to private data, and potential | |
exposure to malicious instructionsâso other people can trick it into | |
doing things... and thereâs a mechanism to exfiltrate stuff. | |
People in 2025 actually need to be told this. Franklin missed the mark | |
- people today will trip over themselves to give up both their security | |
and their liberty for mere convenience. | |
simonw wrote 1 day ago: | |
I had the LLM write a bash script for me that used my [1] tool - on | |
the basis that it was a neat opportunity to demonstrate another of my | |
own projects. | |
And honestly, even with LLM assistance getting Image Magick to output | |
a 1200x600 image with two SVGs next to each other that are correctly | |
resized to fill their half of the image sounds pretty tricky. | |
Probably easier (for Claude) to achieve with HTML and CSS. | |
[1]: https://shot-scraper.datasette.io/ | |
voiper1 wrote 1 day ago: | |
Isn't "left or right" _followed_ by rationale asking it to | |
rationalize it's 1 word answer - I thought we need to get AI to do | |
the chain of though _before_ giving it's answer for it to be more | |
accurate? | |
simonw wrote 1 day ago: | |
Yes it is - I would likely have gotten better results if I'd | |
asked for the rationale first. | |
zahlman wrote 1 day ago: | |
> And honestly, even with LLM assistance getting Image Magick to | |
output a 1200x600 image with two SVGs next to each other that are | |
correctly resized to fill their half of the image sounds pretty | |
tricky. | |
FWIW, the next project I want to look at after my current two, is a | |
command-line tool to make this sort of thing easier. Likely | |
featuring some sort of Lisp-like DSL to describe what to do with | |
the input images. | |
username223 wrote 1 day ago: | |
Interesting timeline, though the most relevant part was at the end, | |
where Simon mentions that Google is now aware of the "pelican on | |
bicycle" question, so it is no longer useful as a benchmark. FWIW, many | |
things outside of the training data will pants these models. I just | |
tried this query, which probably has no examples online, and Gemini | |
gave me the standard puzzle answer, which is wrong: | |
"Say I have a wolf, a goat, and some cabbage, and I want to get them | |
across a river. The wolf will eat the goat if they're left alone, which | |
is bad. The goat will eat some cabbage, and will starve otherwise. How | |
do I get them all across the river in the fewest trips?" | |
A child would pick up that you have plenty of cabbage, but can't leave | |
the goat without it, lest it starve. Also, there's no mention of boat | |
capacity, so you could just bring them all over at once. Useful? | |
Sometimes. Intelligent? No. | |
djherbis wrote 1 day ago: | |
Kaggle recently ran a competition to do just this (draw SVGs from | |
prompts, using fairly small models under the hood). | |
The top results (click on the top Solutions) were pretty impressive: | |
[1]: https://www.kaggle.com/competitions/drawing-with-llms/leaderbo... | |
nine_k wrote 1 day ago: | |
Am I the only one who can't but see these attempts much like attempts | |
of a kid learning to draw? | |
Ygg2 wrote 1 day ago: | |
Yes. Kids don't draw that good of a line at the start. | |
Here is better example of start | |
[1]: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfTfAA... | |
nine_k wrote 1 day ago: | |
Have you tried giving a kid a vector-drawing tool? | |
I did that to my daughter when she was not even 6 years old. The | |
results were somehow similar: [1] (Now she's much better, but | |
prefers raster tools, e.g. [2] ) | |
[1]: https://photos.app.goo.gl/XSLnTEUkmtW2n7cX8 | |
[2]: https://www.deviantart.com/sofiac9/art/Ivy-with-riding-gea... | |
pier25 wrote 1 day ago: | |
Definitely getting better but even the best result is not very | |
impressive. | |
jfengel wrote 1 day ago: | |
It's not so great at bicycles, either. None of those are close to | |
rideable. | |
But bicycles are famously hard for artists as well. Cyclists can | |
identify all of the parts, but if you don't ride a lot it can be | |
surprisingly difficult to get all of the major bits of geometry right. | |
mattlondon wrote 1 day ago: | |
Most recent Gemini 2.5 one looks pretty good. Certainly rideable. | |
bredren wrote 1 day ago: | |
Great writeup. | |
This measure of LLM capability could be extended by taking it into the | |
3D domain. | |
That is, having the model write Python code for Blender, then running | |
blender in headless mode behind an API. | |
The talk hints at this but one shot prompting likely wonât be a broad | |
enough measurement of capability by this time next year. (Or perhaps | |
now, even) | |
So the test could also include an agentic portion that includes | |
consultation of the latest blender documentation or even use of a | |
search engine for blog entries detailing syntax and technique. | |
For multimodal input processing, it could take into account a | |
particular photo of a pelican as the test subject. | |
For usability, the objects can be converted to iOSâs native 3d format | |
that can be viewed in mobile safari. | |
I built this workflow, including a service for blender as an initial | |
test of what was possible in October of 2022. It took post processing | |
for common syntax errors back then but id imagine the newer LLMs would | |
make those mistakes less often now. | |
mromanuk wrote 1 day ago: | |
The last animation is hilarious, represents very well the AI Hype cycle | |
vs reality. | |
nowayno583 wrote 1 day ago: | |
That was a very fun recap, thanks for sharing. It's easy to forget how | |
much better these things have gotten. And this was in just six months! | |
Crazy! | |
adrian17 wrote 1 day ago: | |
> This was one of the most successful product launches of all time. | |
They signed up 100 million new user accounts in a week! They had a | |
single hour where they signed up a million new accounts, as this thing | |
kept on going viral again and again and again. | |
Awkwardly, I never heard of it until now. I was aware that at some | |
point they added ability to generate images to the app, but I never | |
realized it was a major thing (plus I already had an offline stable | |
diffusion app on my phone, so it felt less of an upgrade to me | |
personally). With so much AI news each week, feels like unless you're | |
really invested in the space, it's almost impossible to not | |
accidentally miss or dismiss some big release. | |
MattRix wrote 23 hours 8 min ago: | |
To be clear: they already had image generation in ChatGPT, but this | |
was a MUCH better one than what they had previously. Even for you | |
with your stable diffusion app, it would be a significant upgrade. | |
Not just because of image quality, but because it can actually | |
generate coherent images and follow instructions. | |
MIC132 wrote 12 hours 31 min ago: | |
As impressive as it is, for some uses it still is worse than a | |
local SD model. | |
It will refuse to generate named anime characters (because of | |
copyright, or because it just doesn't know them, even not | |
particularly obscure ones) for example. | |
Or obviously anything even remotely spicy. | |
As someone who mostly uses image generation to amuse myself (and | |
not to post it, where copyright might matter) it's honestly | |
somewhat disappointing. But I don't expect any of the major AI | |
companies to release anything without excessive guardrails. | |
bufferoverflow wrote 1 day ago: | |
Have you missed how everyone was Ghiblifying everything? | |
andrepd wrote 21 hours 26 min ago: | |
Oh you mean the trend of the day on the social media monoculture? I | |
don't take that as an indicator of any significance. | |
Philpax wrote 19 hours 1 min ago: | |
One should not be proud of their ignorance. | |
DaSHacka wrote 16 hours 13 min ago: | |
Except when it comes to using social media, where "ignorance" | |
unironically is strength | |
adrian17 wrote 23 hours 21 min ago: | |
I saw that, I just didn't connect it with newly added multimodal | |
image generation. I knew variations of style transfer (or LoRA for | |
SD) were possible for years, so I assumed it exploded in popularity | |
purely as a meme, not due to OpenAI making it much more accessible. | |
Again, I was aware that they added image generation, just not how | |
much of a deal it turned out to be. Think of it like me | |
occasionally noticing merchandise and TV trailers for a new movie | |
without realizing it became the new worldwide box office #1. | |
haiku2077 wrote 1 day ago: | |
Congratulations, you are almost fully unplugged from social media. | |
This product launch was a huge mainstream event; for a few days GPT | |
generated images completely dominated mainstream social media. | |
Semaphor wrote 9 hours 28 min ago: | |
Facebook, discord, reddit, HN. Hadnât heard of it either. But for | |
FB, Reddit, and Discord I strictly curate what I see. | |
sigmoid10 wrote 9 hours 39 min ago: | |
If you primarily consume text-based social media (HN, reddit with | |
legacy UI) then it's kind of easy to not notice all the new kinds | |
of image infographics and comics that now completely flood places | |
like instagram or linkedin. | |
derwiki wrote 1 day ago: | |
Not sure if this is sarcasm or sincere, but I will take it as | |
sincere haha. I came back to work from parental leave and everyone | |
had that same Studio Ghiblized image as their Slack photo, and I | |
had no idea why. It turns out you really can unplug from social | |
media and not miss anything of value: if itâs a big enough deal | |
you will find out from another channel. | |
stavros wrote 7 hours 43 min ago: | |
Why does everyone keep calling news "social media"? Have I missed | |
a trend? Knowing what my friend Steve is up to is social media, | |
knowing what AI is up to is news. | |
loudmax wrote 1 hour 49 min ago: | |
I'm afraid a lot of Americans consume the news like they | |
consume sports media. They root for their team and select a | |
news stream that presents them with the most favorable | |
coverage. | |
stavros wrote 1 hour 47 min ago: | |
As a non-American, I can assure you that's pretty much | |
everywhere. | |
haiku2077 wrote 3 hours 27 min ago: | |
You did miss a trend: | |
[1]: https://www.pewresearch.org/short-reads/2024/09/17/mor... | |
dgfitz wrote 20 hours 45 min ago: | |
I missed it until this thread. I think Iâm proud of myself. | |
tough wrote 8 hours 20 min ago: | |
You're one of today's lucky 10.000 | |
[1]: https://xkcd.com/1053/ | |
azinman2 wrote 1 day ago: | |
Except this went very mainstream. Lots of turn myself into a muppet, | |
what is the human equivalent for my dog, etc. TikTok is all over | |
this. | |
It really is incredible. | |
thierrydamiba wrote 1 day ago: | |
The big trend was around the ghiblification of images. Those images | |
were everywhere for a period of time. | |
herval wrote 1 day ago: | |
They still are. Instagram is full of accounts posting | |
gpt-generated cartoons (and now veo3 videos). Iâve been | |
tracking the image generation space from day one, and it never | |
stuck like this before | |
simonw wrote 1 day ago: | |
Anecdotally, I've had several conversations with people way | |
outside the hyper-online demographic who have been really | |
enjoying the new ChatGPT image generation - using it for | |
cartoon photos of their kids, to create custom birthday cards | |
etc. | |
I think it's broken out into mainstream adoption and is going | |
to stay there. | |
It reminds me a little of Napster. The Napster UI was terrible, | |
but it let people do something they had never been able to do | |
before: listen to any piece of music ever released, on-demand. | |
As a result people with almost no interest in technology at all | |
were learning how to use it. | |
Most people have never had the ability to turn a photo of their | |
kids into a cute cartoon before, and it turns out that's | |
something they really want to be able to do. | |
herval wrote 1 day ago: | |
Definitely. Itâs not just online either - half the | |
billboards I see now are AI. The posters at school. The | |
âweâre hiring!â ad at the local McDonalds. Itâs … | |
cheaper and faster than any alternative (stock images, hiring | |
an editor or illustrator, etc), and most non technical people | |
can get exactly what they want in a single shot, these days. | |
Jedd wrote 1 day ago: | |
Yeah, but so were the bored ape NFTs - none of these ephemeral | |
fads are any indication of quality, longevity, legitimacy, or | |
interest. | |
sandspar wrote 13 hours 8 min ago: | |
I just don't understand how people can see "100 million signups | |
in a week" and immediately dismiss it. We're not talking about | |
fidget spinners. I don't get why this sentiment is so common | |
here on HackerNews. It's become a running joke in other online | |
spaces, "HackerNews commenters keep saying that AI is a | |
nothingburger." It's just a groupthink thing I guess, a | |
kneejerk response. | |
otabdeveloper4 wrote 9 hours 36 min ago: | |
> We're not talking about fidget spinners. | |
We're talking about Hitler memes instead? I don't understand | |
your feigned outrage. | |
The actual valid commercial use case for generative images | |
hasn't been found yet. (No, making blog spam prettier is not | |
a good use case.) | |
simonw wrote 7 hours 15 min ago: | |
Everything Everywhere All At Once won a bunch of Oscars. | |
They used generative AI tools for some of their | |
post-production work (achieved by a tiny team), for example | |
to help clean up the backgrounds in the scene with the | |
silent dialog between the two rocks. | |
stavros wrote 7 hours 40 min ago: | |
You're right, nothing has value unless someone figures out | |
how to make money with it. Except OpenAI, apparently, | |
because the fact that people buy ChatGPT to make images | |
doesn't seem to count as a commercial use case. | |
otabdeveloper4 wrote 6 hours 48 min ago: | |
OpenAI is not profitable and we don't know if it ever | |
will be. | |
stavros wrote 6 hours 44 min ago: | |
Have we shifted the goalposts from "something people | |
will pay for" to "needs to be profitable even with | |
massive R&D" then? | |
otabdeveloper4 wrote 5 hours 25 min ago: | |
OpenAI is not "something people will pay for" at the | |
moment though. | |
stavros wrote 5 hours 14 min ago: | |
Except lots of people are paying for it. I'll refer | |
you to the other post on the front page for the | |
calculation that OpenAI would have to get just an | |
extra $10/yr from their users to break even. | |
otabdeveloper4 wrote 3 hours 10 min ago: | |
Your response reminds me of that joke about | |
selling a dollar bill for ninety cents. | |
stavros wrote 3 hours 7 min ago: | |
Your response makes me think we have different | |
definitions for profitability. | |
pintxo wrote 12 hours 35 min ago: | |
I assume, when people dismiss it, they are not looking at it | |
through the business lens and the 100m user signups KPI, but | |
they are dismissing it on technical grounds, as an LLM is | |
just a very big statistical database which seems incapable of | |
solving problems beyond (impressive looking) text/image/video | |
generation. | |
sandspar wrote 12 hours 14 min ago: | |
Makes sense. Although I think that's an error. TikTok is | |
"just" a video sharing site. Joe Rogan is "just" a | |
podcaster. Dumb things that affect lots of people are | |
important. | |
micromacrofoot wrote 23 hours 36 min ago: | |
they're not but I'm already seeing ai generated images on | |
billboards for local businesses, they're in production | |
workflows now and they aren't going anywhere | |
baq wrote 1 day ago: | |
Itâs hard to think of a worse analogy TBH. My wife is using | |
ChatGPT to change photos (still is to this day), she didnât | |
use it or any other LLM until that feature hit. It is a fad, | |
but itâs also a very useful tool. | |
Ape NFTs are⦠ape NFTs. Useless. Pointless. Negative value | |
for most people. | |
Jedd wrote 11 hours 14 min ago: | |
I would note that I was replying to a comment about the 'big | |
trend of ghiblification' of images. | |
Reproducing a certain style of image has been a regular fad | |
since profile pictures became a thing sometime last century. | |
I was not meaning to suggest that large language & diffusion | |
models are fads. | |
(I do think their capabilities are poorly understood and/or | |
over-estimated by non-technical and some technical people | |
alike, but that invites a more nuanced discussion.) | |
While I'm sure your wife is getting good value out of the | |
system, whether it's a better fit for purpose, produces a | |
better quality, or provides a more satisfying workflow -- | |
than say a decent free photo editor -- or whether other tools | |
were tried but determined to be too limited or difficult, etc | |
-- only you or her could say. It does feel like a small | |
sample set, though. | |
senthil_rajasek wrote 1 day ago: | |
"My wife is using ChatGPT to change photos (still is to this | |
day), she didnât use it or any other LLM until that feature | |
hit." | |
This is deja vu, except instead of ChatGPT to edit photos it | |
was instagram a decade ago. | |
baq wrote 1 day ago: | |
You either havenât tried it or are just trolling. | |
senthil_rajasek wrote 23 hours 20 min ago: | |
I am contrasting how instagram filters gave users some | |
control and increased user base and how today editing | |
photos with LLMs is doing the same and pulling in a wider | |
user base. | |
djhn wrote 23 hours 33 min ago: | |
I tried it and I donât get it. What and where are the | |
legal usecases? What can you do with these low-resolution | |
images? | |
jauntywundrkind wrote 1 day ago: | |
Applying some filters and adding some overlay text is | |
something some folks did, but there's such a massive | |
creative world that's opened up, where all we have to do is | |
ask. | |
mrkurt wrote 1 day ago: | |
If we try really hard, I think we can make an exhaustive list | |
of what viral fads on the internet are not. You made a small | |
start. | |
none of these ephemeral fads are any indication of quality, | |
longevity, legitimacy, interest, substance, endurance, | |
prestige, relevance, credibility, allure, staying-power, | |
refinement, or depth. | |
Aurornis wrote 22 hours 43 min ago: | |
100 million people didnât sign up to make that one image | |
meme and then never use it again. | |
That many signups is impressive no matter what. The attempts | |
to downplay every aspect of LLM popularity are getting really | |
tiresome. | |
otabdeveloper4 wrote 9 hours 40 min ago: | |
> 100 million people didnât sign up to make that one | |
image meme and then never use it again. | |
Source? They did exactly that. | |
simonw wrote 7 hours 12 min ago: | |
What's your source for saying they did exactly that? | |
jodrellblank wrote 22 hours 31 min ago: | |
I think it sounds far more likely that 100M people signed | |
up to poke at the latest viral novelty and create one meme, | |
than that 100M people suddenly discovered they had a | |
pressing long-term need for AI images all on the same day. | |
Doesnât it? | |
ben_w wrote 22 hours 1 min ago: | |
While 100M signing up just for one pic is certainly | |
possible, I note that several hundred million people | |
regularly share photographs of their lunch, so it is very | |
plausible that in signing up for the latest meme | |
generator they found they liked the ability to generate | |
custom images of whatever they consider to be pretty | |
pictures every day. | |
gretch wrote 22 hours 6 min ago: | |
It's neither of these options in this false dichotomy. | |
100M people signed up and did at least 1 task. Then, most | |
likely some % of them discovered it was a useful thing | |
(if for nothing else than just to make more memes), and | |
converted into a MAU. | |
If I had to use my intuition, I would say it's 5% - 10%, | |
which represents a larger product launch than most | |
developers will ever participate in, in the context of a | |
single day. | |
Of course the ongoing stickiness of the MAU also depends | |
on the ability of this particular tool to stay on top | |
amongst increasing competition. | |
oblio wrote 15 hours 39 min ago: | |
Apparently OpenAI is losing money like crazy on this | |
and their conversion rates to paid are abysmal, even | |
for the cheaper licenses. And not even their top | |
subscription covers its cost. | |
Uber at a 10x scale. | |
I should add that compared to the hype, at a global | |
level Uber is a failure. Yes, it's still a big company, | |
yes, it's profitable now, but I think it was launched | |
10+ years ago and it's barely becoming net profitabile | |
over it's existence now and shows no signs of taking | |
over the world. Sure, it's big in the US and a few | |
specific markets. But elsewhere it's either banned for | |
undermining labor practices or has stiff local | |
competition or it's just not cost competitive and it | |
won't enter the market because without the whole "gig | |
economy" scam it's just a regular taxi company with a | |
better app. | |
simonw wrote 14 hours 48 min ago: | |
Is that information about their low conversion rates | |
from credible sources? | |
oblio wrote 13 hours 20 min ago: | |
It's quite hard to say for sure, and I will prefix | |
my comment by saying his blog posts are very long | |
and quite doomerist about LLMs, but he makes a | |
decent case about OpenAI financials: [1] [2] A very | |
solid argument is like that against propaganda: | |
it's not so much about what is being said but what | |
about isn't. OpenAI is basically shouting about | |
every minor achievement from the rooftops so the | |
fact that they are remarkably silent about | |
financial fundamentals says something. At best | |
something mediocre or more likely bad. | |
[1]: https://www.wheresyoured.at/wheres-the-mon... | |
[2]: https://www.wheresyoured.at/openai-is-a-sy... | |
landgenoot wrote 1 day ago: | |
If you would give a human the SVG documentation and ask to write an | |
SVG, I think the results would be quite similar. | |
ramesh31 wrote 1 day ago: | |
>If you would give a human the SVG documentation and ask to write an | |
SVG, I think the results would be quite similar. | |
It certainly would, and it would cost at minimum an hour of the human | |
programmer's time at $50+/hr. Claude does it in seconds for pennies. | |
diggan wrote 1 day ago: | |
Lets give it a try, if you're willing to be the experiment subject :) | |
The prompt is "Generate an SVG of a pelican riding a bicycle" and | |
you're supposed to write it by hand, so no graphical editor. The | |
specification is here: [1] I'm fairly certain I'd lose interest in | |
getting it right before I got something better than most of those. | |
[1]: https://www.w3.org/TR/SVG2/ | |
zahlman wrote 1 day ago: | |
> The colors use traditional bicycle brown (#8B4513) and a classic | |
blue for the pelican (#4169E1) with gold accents for the beak | |
(#FFD700). | |
The output pelican is indeed blue. I can't fathom where the idea | |
that this is "classic", or suitable for a pelican, could have come | |
from. | |
diggan wrote 1 day ago: | |
My guess would be that it doesn't see the web colors (CSS color | |
hexes) as proper hex triplets, but because of tokenization it | |
could be something dumb like '#8B','451','3' instead. I think the | |
same issue happens around multiple special characters after each | |
other too. | |
cap11235 wrote 13 hours 59 min ago: | |
Qwen3, at least, tokenizes each character of "#8B4513" | |
separately. | |
zahlman wrote 19 hours 15 min ago: | |
No, it's understanding the colors properly. The SVG that the | |
LLM created does use #4169E1 for the pelican color, and the LLM | |
correctly describes this color as blue. The problem is that | |
pelicans should not be blue. | |
mormegil wrote 1 day ago: | |
Did the testing prompt for LLMs include a clause forbidding the use | |
of any tools? If not, why are you adding it here? | |
simonw wrote 1 day ago: | |
The way I run the pelican on a bicycle benchmark is to use this | |
exact prompt: | |
Generate an SVG of a pelican riding a bicycle | |
And execute it via the model's API with all default settings, not | |
via their user-facing interface. | |
Currently none of the model APIs enable tools unless you ask them | |
to, so this method excludes the use of additional tools. | |
diggan wrote 1 day ago: | |
The models that are being put under the "Pelican" testing don't | |
use a GUI to create SVGs (either via "tools" or anything else), | |
they're all Text Generation models so they exclusively use text | |
for creating the graphics. | |
There are 31 posts listed under "pelican-riding-a-bicycle" in | |
case you wanna inspect the methodology even closer: | |
[1]: https://simonwillison.net/tags/pelican-riding-a-bicycle/ | |
wohoef wrote 1 day ago: | |
Quite a detailed image using claude sonnet 4: | |
[1]: https://ibb.co/39RbRm5W | |
spaceman_2020 wrote 1 day ago: | |
I donât know what secret sauce Anthropic has, but in real world use, | |
Sonnet is somehow still the best model around. Better than Opus and | |
Gemini Pro | |
diggan wrote 1 day ago: | |
Statements like these are useless without sharing exactly all the | |
models you've tried. Sonnet beats O1 Pro Mode for example? Not in my | |
experience, but I haven't tried the latest Sonnet versions, only the | |
one before, so wouldn't claim O1 Pro Mode beats everything out there. | |
Besides, it's so heavily context-dependent that you really need your | |
own private benchmarks to make head or tails out of this whole thing. | |
big_hacker wrote 1 day ago: | |
Honestly the metric which increased the most is the marketing and | |
astroturfing budget of the major players (OpenAI, Anthropic, Google and | |
Deepseek). | |
Say what you want about Facebook but at least they released their | |
flagship model fully open. | |
mdaniel wrote 1 day ago: | |
> model fully open. | |
uh-huh | |
[1]: https://www.llama.com/llama4/license/ | |
franze wrote 1 day ago: | |
Here Claude Opus Extended Thinking | |
[1]: https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c61... | |
ramesh31 wrote 1 day ago: | |
Single shot? | |
franze wrote 1 day ago: | |
2 shot, first one did just generate the svg not the shareable html | |
page around it. in the second go it also worked on the svg as i did | |
not forbid it. | |
deadbabe wrote 1 day ago: | |
As a control, he should go on fiver and have a human generate a pelican | |
riding a bicycle, just to see what the eventual goal is. | |
gus_massa wrote 1 day ago: | |
Someone did this. Look at this sibling comment by ben_w [1] about an | |
old similar project. | |
[1]: https://news.ycombinator.com/item?id=44216284 | |
zahlman wrote 1 day ago: | |
> back in 2009 I began pestering friends and random strangers. I | |
would walk up to them with a pen and a sheet of paper asking that | |
they immediately draw me a menâs bicycle, by heart. | |
Someone commissioned to draw a bicycle on Fiverr would not have to | |
rely on memory of what it should look like. It would take barely | |
any time to just look up a reference. | |
atxtechbro wrote 1 day ago: | |
Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings | |
and this is great too! I like the personalized benchmark. Hopefully the | |
big LLM providers don't start gaming the pelican index! | |
dirtyhippiefree wrote 1 day ago: | |
Hereâs the spot where we see whoâs TL;DR⦠| |
> Claude 4 will rat you out to the feds! | |
>If you expose it to evidence of malfeasance in your company, and you | |
tell it it should act ethically, and you give it the ability to send | |
email, itâll rat you out. | |
gscott wrote 21 hours 31 min ago: | |
I am interested in this ratting you out thing. At some point you | |
have a video feed into AI from a Jarvis like headset device, you | |
walking down the street and cross the street in the middle not at a | |
sidewalk... does it rat you out? Does it make a list of every crime | |
no matter how small? Or just the big ones? | |
yubblegum wrote 1 day ago: | |
I was looking at that and wondering about swatting via LLMs by | |
malicious users. | |
ben_w wrote 1 day ago: | |
I'd say that's too short. | |
> But itâs not just Claude. Theo Browne put together a new | |
benchmark called SnitchBench, inspired by the Claude 4 System Card. | |
> It turns out nearly all of the models do the same thing. | |
dirtyhippiefree wrote 1 day ago: | |
I totally agree, but I needed you to post the other half because of | |
TL;DR⦠| |
bravesoul2 wrote 1 day ago: | |
Is there a good model (any architecture) for vector graphics out of | |
interest? | |
simonw wrote 1 day ago: | |
I was impressed by Recraft v3, which gave me an editable vector | |
illustration with different layers - [1] - but as I understand it | |
that one is actually still a raster image generator with a separate | |
step to convert to vector at the end. | |
[1]: https://simonwillison.net/2024/Nov/15/recraft-v3/ | |
bravesoul2 wrote 1 day ago: | |
Now that is a pelican on a bicycle! Thanks | |
JimDabell wrote 1 day ago: | |
See also: The recent history of AI in 32 otters | |
[1]: https://www.oneusefulthing.org/p/the-recent-history-of-ai-in-3... | |
pbhjpbhj wrote 1 day ago: | |
That is otterly fantastic. The post there shows the breadth too - | |
both otters generated via text representations (in TikZ) and by image | |
generators. The video at the end, wow (and funny too). | |
Thanks for sharing. | |
qwertytyyuu wrote 1 day ago: | |
[1] here are a few i tried the models, looks like the newer vesion of | |
gemini is another improvement? | |
[1]: https://imgur.com/a/mzZ77xI | |
puttycat wrote 1 day ago: | |
The bicycle are still very far from actual ones. | |
pjs_ wrote 1 day ago: | |
[1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/ | |
simonw wrote 1 day ago: | |
I think the most recent Gemini Pro bicycle may be the best yet - | |
the red frame is genuinely the right shape. | |
layer8 wrote 1 day ago: | |
The pelican, on the other hand... | |
anon373839 wrote 1 day ago: | |
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a | |
really strong release, especially the fine-grained MoE which is unlike | |
anything thatâs come before (in terms of capability and speed on | |
consumer hardware). | |
simonw wrote 1 day ago: | |
Omitting Qwen 3 is my great regret about this talk. Honestly I only | |
realized I had missed it after I had delivered the talk! | |
It's one of my favorite local models right now, I'm not sure how I | |
missed it when I was reviewing my highlights of the last six months. | |
Maxious wrote 1 day ago: | |
Cut for time - qwen3 was pelican tested too | |
[1]: https://simonwillison.net/2025/Apr/29/qwen-3/ | |
nathan_phoenix wrote 1 day ago: | |
My biggest gripe is that he's comparing probabilistic models (LLMs) by | |
a single sample. | |
You wouldn't compare different random number generators by taking one | |
sample from each and then concluding that generator 5 generates the | |
highest numbers... | |
Would be nicer to run the comparison with 10 images (or more) for each | |
LLM and then average. | |
timewizard wrote 1 day ago: | |
My biggest gripe is he didn't include a picture of an actual pelican. | |
[1] The "closest pelican" is not even close. | |
[1]: https://www.google.com/search?q=pelican&udm=2 | |
mooreds wrote 1 day ago: | |
My biggest gripe is that he outsourced evaluation of the pelicans to | |
another LLM. | |
I get it was way easier to do and that doing it took pennies and no | |
time. But I would have loved it if he'd tried alternate methods of | |
judging and seen what the results were. | |
Other ways: | |
* wisdom of the crowds (have people vote on it) | |
* wisdom of the experts (send the pelican images to a few dozen | |
artists or ornithologists) | |
* wisdom of the LLMs (use more than one LLM) | |
Would have been neat to see what the human consensus was and if it | |
differed from the LLM consensus | |
Anyway, great talk! | |
zahlman wrote 1 day ago: | |
It would have been interesting to see if the LLM that Claude judged | |
worst would have attempted to justify itself.... | |
qeternity wrote 1 day ago: | |
I think you mean non-deterministic, instead of probabilistic. | |
And there is no reason that these models need to be | |
non-deterministic. | |
rvz wrote 1 day ago: | |
> I think you mean non-deterministic, instead of probabilistic. | |
My thoughts too. It's more accurate to label LLMs as | |
non-deterministic instead of "probablistic". | |
skybrian wrote 1 day ago: | |
A deterministic algorithm can still be unpredictable in a sense. In | |
the extreme case, a procedural generator (like in Minecraft) is | |
deterministic given a seed, but you will still have trouble | |
predicting what you get if you change the seed, because internally | |
it uses a (pseudo-)random number generator. | |
So thereâs still the question of how controllable the LLM really | |
is. If you change a prompt slightly, how unpredictable is the | |
change? That canât be tested with one prompt. | |
simonw wrote 1 day ago: | |
It might not be 100% clear from the writing but this benchmark is | |
mainly intended as a joke - I built a talk around it because it's a | |
great way to make the last six months of model releases a lot more | |
entertaining. | |
I've been considering an expanded version of this where each model | |
outputs ten images, then a vision model helps pick the "best" of | |
those to represent that model in a further competition with other | |
models. | |
(Then I would also expand the judging panel to three vision LLMs from | |
different model families which vote on each round... partly because | |
it will be interesting to track cases where the judges disagree.) | |
I'm not sure if it's worth me doing that though since the whole | |
"benchmark" is pretty silly. I'm on the fence. | |
dilap wrote 1 day ago: | |
Joke or not, it still correlates much better with my own subjective | |
experiences of the models than LM Arena! | |
fzzzy wrote 1 day ago: | |
Even if it is a joke, having a consistent methodology is useful. I | |
did it for about a year with my own private benchmark of reasoning | |
type questions that I always applied to each new open model that | |
came out. Run it once and you get a random sample of performance. | |
Got unlucky, or got lucky? So what. That's the experimental | |
protocol. Running things a bunch of times and cherry picking the | |
best ones adds human bias, and complicates the steps. | |
simonw wrote 1 day ago: | |
It wasn't until I put these slides together that I realized quite | |
how well my joke benchmark correlates with actual model | |
performance - the "better" models genuinely do appear to draw | |
better pelicans and I don't really understand why! | |
og_kalu wrote 23 hours 57 min ago: | |
LLMs also have a 'g factor' | |
[1]: https://www.sciencedirect.com/science/article/pii/S016... | |
johnrob wrote 1 day ago: | |
Well, the most likely single random sample would be a | |
ârepresentativeâ one :) | |
tuananh wrote 1 day ago: | |
until they start targeting this benchmark | |
simonw wrote 1 day ago: | |
Right, that was the closing joke for the talk. | |
jonstewart wrote 1 day ago: | |
It is funny to think that a hundred years in the future | |
there may be some vestigial area of the modelsâ networks | |
thatâs still tuned to drawing pelicans on bicycles. | |
more-nitor wrote 1 day ago: | |
I just don't get the fuss from the pro-LLM people who don't | |
want anyone to shame their LLMs... | |
people expect LLMs to say "correct" stuff on the first attempt, | |
not 10000 attempts. | |
Yet, these people are perfectly OK with cherry-picked success | |
stories on youtube + advertisements, while being extremely | |
vehement about this simple experiment... | |
...well maybe these people rode the LLM hype-train too early, | |
and are desperate to defend LLMs lest their investment go poof? | |
obligatory hype-graph classic: | |
[1]: https://upload.wikimedia.org/wikipedia/commons/thumb/9... | |
MichaelZuo wrote 1 day ago: | |
I imagine the straightforward reason is that the âbetterâ | |
models are in fact significantly smarter in some tangible way, | |
somehow. | |
pama wrote 1 day ago: | |
How did the pelicans of point releases of V3 and of R1 | |
(R1-0528) do compared to the original versions of the models? | |
demosthanos wrote 1 day ago: | |
I'd say definitely do not do that. That would make the benchmark | |
look more serious while still being problematic for knowledge | |
cutoff reasons. Your prompt has become popular even outside your | |
blog, so the odds of some SVG pelicans on bicycles making it into | |
the training data have been going up and up. | |
Karpathy used it as an example in a recent interview: | |
[1]: https://www.msn.com/en-in/health/other/ai-expert-asks-grok... | |
telotortium wrote 19 hours 28 min ago: | |
Yeah, Simon needs to release a new benchmark under a pen name, | |
like Stephen King did with Richard Bachman. | |
throwaway31131 wrote 1 day ago: | |
Iâd say it doesnât really matter. There is no universally | |
good benchmark and really they should only be used to answer very | |
specific questions which may or may not be relevant to you. | |
Also, as the old saying goes, the only thing worse than using | |
benchmarks is not using benchmarks. | |
6LLvveMx2koXfwn wrote 1 day ago: | |
I would definitely say he had no intention of doing that and was | |
doubling down on the original joke. | |
colecut wrote 1 day ago: | |
The road to hell is paved with the best intentions | |
clarification: I enjoyed the pelican on a bike and don't think | |
it's that bad =p | |
diggan wrote 1 day ago: | |
Yeah, this is the problem with benchmarks where the | |
questions/problems are public. They're valuable for some months, | |
until it bleeds into the training set. I'm certain a lot of the | |
"improvements" we're seeing are just benchmarks leaking into the | |
training set. | |
travisgriggs wrote 1 day ago: | |
Thatâs ok, once bicycle âridingâ pelicans become | |
normative, we can ask it for images of pelicans humping | |
bicycles. | |
The number of subject-verb-objects are near infinite. All are | |
imaginable, but most are not plausible. A plausibility machine | |
(LLM) will struggle with the implausible, until it can abstract | |
well. | |
zahlman wrote 1 day ago: | |
I can't fathom this working, simply because building a model | |
that relates the word "ride" to "hump" seems like something | |
that would be orders of magnitude easier for an LLM than | |
visualizing the result of SVG rendering. | |
diggan wrote 1 day ago: | |
> The number of subject-verb-objects are near infinite. All | |
are imaginable, but most are not plausible | |
Until there is enough unique/new subject-verb-objects | |
examples/benchmarks so the trained model actually generalized | |
it just like you did. (Public) Benchmarks needs to constantly | |
evolve, otherwise they stop being useful. | |
demosthanos wrote 1 day ago: | |
To be fair, once it does generalize the pattern then the | |
benchmark is actually measuring something useful for | |
deciding if the model will be able to product a | |
subject-verb-object SVG. | |
ontouchstart wrote 1 day ago: | |
Very nice talk, acceptable by general public and by AI agent as | |
well. | |
Any concerns about open source âAI celebrity talksâ like yours | |
can be used in contexts that would allow LLM models to optimize | |
their market share in ways that we canât imagine yet? | |
Your talk might influence the funding of AI startups. | |
#butterflyEffect | |
threecheese wrote 1 day ago: | |
I welcome a VC funded pelican ⦠anything! Clippy 2.0 maybe? | |
Simon, hope you are comfortable in your new role of AI Celebrity. | |
planb wrote 1 day ago: | |
And by a sample that has become increasingly known as a benchmark. | |
Newer training data will contain more articles like this one, which | |
naturally improves the capabilities of an LLM to estimate whatâs | |
considered a good âpelican on a bikeâ. | |
viraptor wrote 1 day ago: | |
Would it though? There really aren't that many valid answers to | |
that question online. When this is talked about, we get more broken | |
samples than reasonable ones. I feel like any talk about this | |
actually sabotages future training a bit. | |
I actually don't think I've seen a single correct svg drawing for | |
that prompt. | |
criddell wrote 1 day ago: | |
And thatâs why he says heâs going to have to find a new | |
benchmark. | |
cyanydeez wrote 1 day ago: | |
So what you really need to do is clone this blog post, find and | |
replace pelican with any other noun, run all the tests, and publish | |
that. | |
Call it wikipediaslop.org | |
YuccaGloriosa wrote 1 day ago: | |
If the any other noun becomes fish... I think I disagree. | |
puttycat wrote 1 day ago: | |
You are right, but the companies making these models invest a lot of | |
effort in marketing them as anything but probabilistic, i.e. making | |
people think that these models work discretely like humans. | |
In that case we'd expect a human with perfect drawing skills and | |
perfect knowledge about bikes and birds to output such a simple | |
drawing correctly 100% of the time. | |
In any case, even if a model is probabilistic, if it had correctly | |
learned the relevant knowledge you'd expect the output to be perfect | |
because it would serve to lower the model's loss. These outputs | |
clearly indicate flawed knowledge. | |
bufferoverflow wrote 1 day ago: | |
> work discretely like humans | |
What kind of humans are you surrounded by? | |
Ask any human to write 3 sentences about a specific topic. Then ask | |
them the same exact question next day. They will not write the same | |
3 sentences. | |
cyanydeez wrote 1 day ago: | |
Humans absolutely do not work discretely. | |
loloquwowndueo wrote 1 day ago: | |
They probably meant deterministically as opposed to | |
probabilistically. Which also humans dont work like that :) | |
aspenmayer wrote 1 day ago: | |
I thought they meant discreetly. | |
ben_w wrote 1 day ago: | |
> In that case we'd expect a human with perfect drawing skills and | |
perfect knowledge about bikes and birds to output such a simple | |
drawing correctly 100% of the time. | |
Look upon these works, ye mighty, and despair: | |
[1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/ | |
rightbyte wrote 1 day ago: | |
That blog post is a 10/10. Oh dear I miss the old internet. | |
jodrellblank wrote 1 day ago: | |
You claim those are drawn by people with "perfect knowledge about | |
bikes" and "perfect drawing skills"? | |
ben_w wrote 1 day ago: | |
More that "these models work ⦠like humans" (discretely or | |
otherwise) does not imply the quotation. | |
Most humans do not have perfect drawing skills and perfect | |
knowledge about bikes and birds, they do not output such a | |
simple drawing correctly 100% of the time. | |
"Average human" is a much lower bar than most people want to | |
believe, mainly because most of us are average on most skills, | |
and also overestimate our own competence â the modal human | |
has just a handful of things they're good at, and one of those | |
is the language they use, another is their day job. | |
Most of us can't draw, and demonstrably can't remember (or | |
figure out from first principles) how a bike works. But this | |
also applies to "smart" subsets of the population: physicists | |
have [1] , and there's this famous rocket scientist who weighed | |
in on rescuing kids from a flooded cave, they come up with some | |
nonsense about a submarine. | |
[1]: https://xkcd.com/793/ | |
Retric wrote 1 day ago: | |
Itâs not that humans have perfect drawing skills, itâs | |
that humans can judge their performance and get better over | |
time. | |
Ask 100 random people to draw a bike and in 10 minutes and | |
theyâll on average suck while still beating the LLMâs | |
here. Give em an incentive and 10 months and the average | |
person is going to be able to make at least one quite decent | |
drawing of a bike. | |
The cost and speed advantage of LLMâs is real as long as | |
youâre fine with extremely low quality. Ask a model for | |
10,000 drawings so you can pick the best and you get a | |
marginal improvements based on random chance at a steep | |
price. | |
ben_w wrote 1 day ago: | |
> Ask 100 random people to draw a bike and in 10 minutes | |
and theyâll on average suck while still beating the | |
LLMâs here. | |
Y'see, this is a prime example of what I meant with | |
""Average human" is a much lower bar than most people want | |
to believe, mainly because most of us are average on most | |
skills, and also overestimate our own competence". | |
An expert artist can spend 10 minutes and end up with a | |
brief sketch of a bike. You can witness this exact duration | |
yourself (with non-bike examples) because of a challenge a | |
few years back to draw the same picture in 10 minutes, 1 | |
minute, and 10 seconds. | |
A normal person spending as much time as they like gets you | |
the pictures that I linked to in the previous post, because | |
they don't really know what a bike is. 45 examples of what | |
normal people think a bike looks like: [1] > Give em an | |
incentive and 10 months and the average person is going to | |
be able to make at least one quite decent drawing of a | |
bike. | |
Given mandatory art lessons in school are longer than 10 | |
months, and yet those bike examples exist, I have no reason | |
to believe this. | |
> Ask a model for 10,000 drawings so you can pick the best | |
and you get a marginal improvements based on random chance | |
at a steep price. | |
If you do so as a human, rating and comparing images? Then | |
the cost is your own time. | |
If you automate it in literally the manner in this write-up | |
(pairwise comparison via API calls to another model to get | |
ELO ratings), ten thousand images is like $60-$90, which is | |
on the low end for a human commission. | |
[1]: https://www.gianlucagimini.it/portfolio-item/veloc... | |
Retric wrote 1 day ago: | |
As an objective criteria what percentage include peddles | |
and a chain connecting one of the wheels? I quickly found | |
a dozen and stopped counting. Now do the same for those | |
LLM images and itâs clear humans win. | |
> ""Average human" is a much lower bar than most people | |
want to believe | |
I have some basis for comparison. Iâve seen 6 years | |
olds draw better bikes than those LLMâs. | |
Look through that list again the worst example does even | |
have wheels, multiple of them have wheels without being | |
connected to anything. | |
Now if youâre arguing the average human is worse than | |
the average 6 year old Iâm going to disagree here. | |
> Given mandatory art lessons in school are longer than | |
10 months, and yet those bike examples exist, I have no | |
reason to believe this. | |
Art lessons donât cumulatively spend 10 months teaching | |
people how to draw a bike. I donât think I | |
cumulatively spent 6 months drawing anything. Painting, | |
collage, sculpture, coloring, etc art covers a lot and | |
wasnât an every day or even every year thing. My | |
mandatory collage class was art history, we didnât | |
create any art. | |
You may have spent more time in class studying drawing, | |
but thatâs not some universal average. | |
> If you automate it in literally the manner in this | |
write-up (pairwise comparison via API calls to another | |
model to get ELO ratings), ten thousand images is like | |
$60-$90, which is on the low end for a human commission. | |
Not every one of those images had a price tag but one was | |
88 cents, * 10,000 = 8,800$ just to make the image for a | |
test even at 4c/image your looking at 400$. Cheaper | |
models existed but fairly consistently had worse | |
performance. | |
simonw wrote 1 day ago: | |
The 88 cent one was the most expensive almost my an | |
order of magnitude. Most of these cost less than a cent | |
to generate - that's why I highlighted the price on the | |
o1 pro output. | |
Retric wrote 1 day ago: | |
Yes, but if youâre averaging cheap and expensive | |
options the expensive ones make a significant | |
difference. Cheaper is bound by 0 so it canât | |
differ as much from the average. | |
Also, when youâre talking about how cheap something | |
is, including the price makes sense. I had no idea | |
on many of those models. | |
simonw wrote 1 day ago: | |
If you're interested, you can get cost estimates | |
from my pricing calculator site here: [1] That link | |
seeds it with 11 input tokens and 1200 output | |
tokens - 11 input tokens is what most models use | |
for "Generate an SVG of a pelican riding a bicycle" | |
and 1200 is the number of output tokens used for | |
some of the larger outputs. | |
Click on different models to see estimated prices. | |
They range from 0.0168 cents for Amazon Nova Micro | |
(that's less than 2/100ths of a cent) up to 72 | |
cents for o1-pro. | |
The most expensive model most people would consider | |
is Claude 4 Opus, at 9 cents. | |
GPT-4o is the upper end of the most common prices, | |
at 1.2 cents. | |
[1]: https://www.llm-prices.com/#it=11&ot=1200 | |
Retric wrote 1 day ago: | |
Thanks | |
zahlman wrote 1 day ago: | |
> A normal person spending as much time as they like gets | |
you the pictures that I linked to in the previous post, | |
because they don't really know what a bike is. 45 | |
examples of what normal people think a bike looks like: | |
[1] A normal person given the ability to consult a | |
picture of a bike while drawing will do much better. An | |
LLM agent can effectively refresh its memory (or attempt | |
to look up information on the Internet) any time it | |
wants. | |
[1]: https://www.gianlucagimini.it/portfolio-item/vel... | |
ben_w wrote 10 hours 44 min ago: | |
> A normal person given the ability to consult a | |
picture of a bike while drawing will do much better. An | |
LLM agent can effectively refresh its memory (or | |
attempt to look up information on the Internet) any | |
time it wants. | |
Some models can when allowed to, but I don't belive | |
Simon Willson was testing that? | |
joshstrange wrote 1 day ago: | |
I really enjoy Simonâs work in this space. Iâve read almost every | |
blog post theyâve posted on this and I love seeing them poke and prod | |
the models to see what pops out. The CLI tools are all very easy to use | |
and complement each other nicely all without trying to do too much by | |
themselves. | |
And at the end of the day, itâs just so much fun to see someone else | |
having so much fun. Heâs like a kid in a candy store and that | |
excitement is contagious. After reading every one of his blog posts, | |
Iâm inspired to go play with LLMs in some new and interesting way. | |
Thank you Simon! | |
blackhaj7 wrote 1 day ago: | |
Same sentiment! | |
dotemacs wrote 1 day ago: | |
The same here. | |
Because of him, I installed a RSS reader so that I don't miss any | |
of his posts. And I know that he shares the same ones across | |
Twitter, Mastodon & Bsky... | |
neepi wrote 1 day ago: | |
My only take home is they are all terrible and I should hire a | |
professional. | |
vunderba wrote 23 hours 59 min ago: | |
This test isn't really about the quality of the image itself | |
(multimodals like gpt-image-1 or even standard diffusion models would | |
be far superior) - it's about following a spec that describes how to | |
draw. | |
A similar test would be if you asked for the pelican on a bicycle | |
through a series of LOGO instructions. | |
spaceman_2020 wrote 1 day ago: | |
My only take home is that a spanner can work as a hammer, but you | |
probably should just get a hammer | |
jug wrote 1 day ago: | |
Before that, you might ask ChatGPT to create a vector image of a | |
pelican riding a bicycle and then running the output through a PNG to | |
SVG converter... | |
Result: [1] These are tough benchmarks to trial reasoning by having | |
it _write_ an SVG file by hand and understanding how it's to be | |
written to achieve this. Even a professional would struggle with | |
that! It's _not_ a benchmark to give an AI the best tools to actually | |
do this. | |
[1]: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican... | |
YuccaGloriosa wrote 1 day ago: | |
I think you made an error there png is a bitmap format | |
sethaurus wrote 1 day ago: | |
You've misunderstood. The parent was making a specific point â | |
if you want an SVG of a penguin, the easiest way to AI-generate | |
it is to get an image generator to create a (vector-styled) | |
bitmap, then auto-vectorize it to SVG. But the point of this | |
benchmark is that it's asking models to create an SVG the hard | |
way, by writing its code directly. | |
GaggiX wrote 1 day ago: | |
An expert at writing SVGs? | |
keiferski wrote 1 day ago: | |
As the other guy said, these are text models. If you want to make | |
images use something like Midjourney. | |
Promoting a pelican riding a bicycle makes a decent image there. | |
keiferski wrote 1 day ago: | |
* Prompting | |
matkoniecz wrote 1 day ago: | |
it depends on quality you need and your budget | |
neepi wrote 1 day ago: | |
Ah yes the race to the bottom argument. | |
ben_w wrote 1 day ago: | |
When I was at university, they got some people from industry to | |
talk to us all about our CVs and how to do interviews. | |
My CV had a stupid cliché, "committed to quality", which they | |
correctly picked up on â "What do you mean?" one of them asked | |
me, directly. | |
I thought this meant I was focussed on being the best. He didn't | |
like this answer. | |
His example, blurred by 20 years of my imperfect human memory, | |
was to ask me which is better: a Porsche, or a go-kart. Now, | |
obviously (or I wouldn't be saying this), Porsche was a trick | |
answer. Less obviously is that both were trick answers, because | |
their point was that the question was under-specified â quality | |
is the match between the product and what the user actually | |
wants, so if the user is a 10 year old who physically isn't big | |
enough to sit in a real car's driver's seat and just wants to | |
rush down a hill or along a track, none of "quality" stuff that | |
makes a Porsche a Porsche is of any relevance at all, but what | |
does matter is the stuff that makes a go-kart into a go-kart⦠| |
one of which is the affordability. | |
LLMs are go-karts of the mind. Sometimes that's all you need. | |
neepi wrote 1 day ago: | |
I disagree. Quality depends on your market position and what | |
you are bringing to the market. Thus I would start with market | |
conditions and work back to quality. If you can't reach your | |
standards in the market then you shouldn't enter it. And if | |
your standards are poor, you should be ashamed. | |
Go kart or porsche is irrelevant. | |
ben_w wrote 1 day ago: | |
> Quality depends on your market position and what you are | |
bringing to the market. | |
That's the point. | |
The market for go-karts does not support Porche. | |
If you bring a Porche sales team to a go-kart race, nobody | |
will be interested. | |
Porche doesn't care about this market. It goes both ways: | |
this market doesn't care about Porche, either. | |
dist-epoch wrote 1 day ago: | |
Most of them are text-only models. Like asking a person born blind to | |
draw a pelican, based on what they heard it looks like. | |
neepi wrote 1 day ago: | |
That seems to be a completely inappropriate use case? | |
I would not hire a blind artist or a deaf musician. | |
wongogue wrote 1 day ago: | |
Even Beethoven? | |
simonw wrote 1 day ago: | |
Yeah, that's part of the point of this. Getting a state of the | |
art text generating LLM to generate SVG illustrations is an | |
inappropriate application of them. | |
It's a fun way to deflate the hype. Sure, your new LLM may have | |
cost XX million to train and beat all the others on the | |
benchmarks, but when you ask it to draw a pelican on a bicycle it | |
still outputs total junk. | |
dist-epoch wrote 1 day ago: | |
tried starting from an image: [1] lol: | |
[1]: https://chatgpt.com/share/684582a0-03cc-8006-b5b5-de51... | |
[2]: https://gemini.google.com/share/4d1746a234a8 | |
dmd wrote 1 day ago: | |
Sorry, Beethoven, you just donât seem to be a match for our | |
org. Best of luck on your search! | |
You too, Monet. Scram. | |
__alexs wrote 1 day ago: | |
I guess the idea is that by asking the model to do something that | |
is inherently hard for it we might learn something about the | |
baseline smartness of each model which could be considered a | |
predictor for performance at other tasks too. | |
namibj wrote 1 day ago: | |
It's a proxy for abstract designing, like writing software or | |
designing in a parametric CAD. | |
Most the non-math design work of applied engineering AFAIK falls | |
under the umbrella that's tested with the pelican riding the | |
bicycle. | |
You have to make a mental model and then turn it into applicable | |
instructions. | |
Program code/SVG markup/parametric CAD instructions don't really | |
differ in that aspect. | |
neepi wrote 1 day ago: | |
I would not assume that this methodology applies to applied | |
engineering, as a former actual real tangible meat space | |
engineer. Things are a little nuanced and the nuances come from | |
a combination of communication and experience, neither of which | |
any LLM has any insight into at all. It's not out there on the | |
internet to train it with and it's not even easy to put it into | |
abstract terms which can be used as training data. And | |
engineering itself in isolation doesn't exist - there is a | |
whole world around it. | |
Ergo no you can't just say throw a bicycle into an LLM and a | |
parametric model drops out into solidworks, then a machine | |
makes it. And everyone buys it. That is the hope really isn't | |
it? You end up with a useless shitty bike with a shit pelican | |
on it. | |
The biggest problem we have in the LLM space is the fact that | |
no one really knows any of the proposed use cases enough and | |
neither does anyone being told that it works for the use cases. | |
rjsw wrote 1 day ago: | |
I don't think any of that matters, CEOs will decide to use it | |
anyway. | |
neepi wrote 1 day ago: | |
This is sad but true. | |
dist-epoch wrote 1 day ago: | |
[1]: https://www.solidworks.com/lp/evolve-your-design-wor... | |
neepi wrote 1 day ago: | |
Yeah good luck with that. Seriously. | |
dist-epoch wrote 1 day ago: | |
The point is about exploring the capabilities of the model. | |
Like asking you to draw a 2D projection of 4D sphere intersected | |
with a 4D torus or something. | |
kevindamm wrote 1 day ago: | |
Yeah, I suppose it is similar.. I don't know their diameters, | |
rotations, nor the distance between their centers, nor which | |
two dimensions, so I would have to guess a lot about what you | |
meant. | |
<- back to front page |