/hn/comments_44215352.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	The last six months in LLMs, illustrated by pelicans on bicycles


	beefnugs wrote 1 hour 43 min ago:
	I think its hilarious how humans can make mistakes interpreting the
	crazy drawings : He says "I like how it solved the problem of pelicans
	not fitting on bicycles by adding a second smaller bicycle to the
	stack."

	no... that is an attempt at it actually drawing the pedals, and putting
	the pelicans feet right on the pedals!

	buserror wrote 6 hours 22 min ago:
	The hilarious bit is that this page will soon be scraped by ai-bots as
	learning material, and they'll all learn to draw pelicans on bicycles
	using this as their primary example material, as they'll be the only
	examples.

	GIGO in motion :-)

	darkoob12 wrote 7 hours 40 min ago:
	Should we be that excited about AI and calling a fraud and plagiarism
	machine "ChatGPT Mischief Buddy" without any moral deliberation?

	simonw wrote 7 hours 13 min ago:
	The "mischief buddy" joke is a poke at exactly that.

	0points wrote 13 hours 19 min ago:
	So the only bird slightly resembling a pelican beak was drawn by gemini
	2.5 pro. In general, none of the output resembles a pelican enough so
	you could separate it from "a bird".

	OP seem to ignore that pelican has a distinct look when evaluating
	these doodles.

	simonw wrote 13 hours 17 min ago:
	The pelican's distinct look - and the fact that none of the models
	can capture it - is the whole point.

	irthomasthomas wrote 20 hours 2 min ago:
	The best pelicans come from running a consortium of models. I use
	pelicans as evals now. [1] Test it using VibeLab (wip)

	[1]: https://x.com/xundecidability/status/1921009133077053462
	[2]: https://x.com/xundecidability/status/1926779393633857715

	m3047 wrote 20 hours 27 min ago:
	TIL: Snitchbench!

	NohatCoder wrote 22 hours 28 min ago:
	If you calculate ELO based on a round-robin tournament with all
	participants starting out on the same score, then the resulting ratings
	should simply correspond to the win count. I guess the algorithm in use
	take into account the order of the matches, but taking order into
	account is only meaningful when competitors are expected to develop
	significantly, otherwise it is just added noise, so we never want to do
	so in competitions between bots.

	I also can't help but notice that the competition is exactly one match
	short, for some reason exactly one of the 561 possible pairings has not
	been included.

	simonw wrote 21 hours 56 min ago:
	Yeah, that's a good call out: Elo isn't actually necessary if you can
	have every competitor battle every other competitor exactly once.

	The missing match is because one single round was declared a draw by
	the model, and I didn't have time to run it again (the Elo stuff was
	very much rushed at the last minute.)

	NicoSchwandner wrote 23 hours 42 min ago:
	Nice post, thanks!

	zurichisstained wrote 1 day ago:
	Wow, I love this benchmark - I've been doing something similar (as a
	joke for and much less frequently), where I ask multiple models to
	attempt to create a data structure like:

	```
	const melody = [
	{ freq: 261.63, duration: 'quarter' }, // C4
	{ freq: 0, duration: 'triplet' }, // triplet rest
	{ freq: 293.66, duration: 'triplet' }, // D4
	{ freq: 0, duration: 'triplet' }, // triplet rest
	{ freq: 329.63, duration: 'half' }, // E4
	]
	```

	But with the intro to Smoke on the Water by Deep Purple. Then I run it
	through the Web Audio API and see how it sounds.

	It's never quite gotten it right, but it's gotten better, to the point
	where I can ask it to make a website that can play it.

	I think yours is a lot more thoughtful about testing novelty, but its
	interesting to see them attempt to do things that they aren't really
	built for (in theory!). [1] - ChatGPT 4 Turbo [2] - Claude Sonnet 3.7
	[3] - Gemini 2.5 Pro

	Gemini is by far the best sounding one, but it's still off. I'd be
	curious how the latest and greatest (paid) versions fare.

	(And just for comparison, here's the first time I did it... you can
	tell I did the front-end because there isn't much to it!)

	[1]: https://codepen.io/mvattuone/pen/qEdPaoW
	[2]: https://codepen.io/mvattuone/pen/ogXGzdg
	[3]: https://codepen.io/mvattuone/pen/ZYGXpom
	[4]: https://nitter.space/mvattuone/status/1646610228748730368#m

	ojosilva wrote 20 hours 24 min ago:
	Drawbacks for using a pelican on a bicycle svg: it's a very
	open-ended prompt, no specific criteria to judge, and lately the svg
	all start to look similar, or at least like they accomplished the
	same non-goals (there's a pelican, there's a bicycle and I'm not sure
	its feet should be on the saddle or on the pedals), so it's hard to
	agree on which is better. And, certainly, having a LLM as a judge,
	the entire game becomes double-hinged and who knows what to think.

	Also, if it becomes popular, training sets may pick it up and improve
	models unfairly and unrealistically. But that's true of any known
	benchmark.

	Side note: I'd really like to see the Language Benchmark Game become
	a prompt based languages * models benchmark game. So we could say
	model X excels at Python Fasta, etc. although then the risk is that,
	again, it becomes training set and the whole thing self-rigs itself.

	dr_kretyn wrote 22 hours 33 min ago:
	I'm slightly confused by your example. What's the actual prompt? Is
	your expectation that a text model is going to know how to perform
	the exact song in audio?

	zurichisstained wrote 20 hours 15 min ago:
	Ohhh absolutely not, that would be pretty wild - I just wanted to
	see if it could understand musical notation enough to come up with
	the correct melody.

	I know there are far better ways to do gen AI with music, this was
	just a joke prompt that worked far better than I expected.

	My naive guess is all of the guitar tabs and signal processing info
	it's trained on gives it the ability to do stuff like this (albeit
	not very well).

	isx726552 wrote 1 day ago:
	> Iâve been feeling pretty good about my benchmark! It should stay
	useful for a long time... provided none of the big AI labs catch on.

	> And then I saw this in the Google I/O keynote a few weeks ago, in a
	blink and youâll miss it moment! Thereâs a pelican riding a
	bicycle! Theyâre on to me. Iâm going to have to switch to something
	else.

	Yeah this touches on an issue that makes it very difficult to have a
	discussion in public about AI capabilities. Any specific test you talk
	about, no matter how small â¦ if the big companies get wind of it, it
	will be RLHFâd away, sometimes to the point of absurdity. Just refer
	to the old âcount the ârâs in strawberryâ canard for one
	example.

	lofaszvanitt wrote 6 hours 28 min ago:
	You push sha512 hashes of things in a githup repo and a short
	sentence:

	x8 version: still shit
	.
	.
	x15 version: we are closing, but overall a shit experience :D

	this way they won't know what to improve upon. of course they can buy
	access. ;P

	when they finally solve your problem you can reveal what was the
	benchmark.

	Choco31415 wrote 22 hours 36 min ago:
	Just tried that canard on GPT-4o and it failed:

	"The word "strawberry" contains 2 letter râs."

	belter wrote 2 hours 49 min ago:
	I tried

	strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said
	three

	strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said
	four

	stawberrry -> DeepSeek, GeminiPro all correctly said three

	ChatGPT4o even in a new Chat, incorrectly said the word
	"stawberrry" contains 4 letter "r" characters. Even provided this
	useful breakdown to let me know :-)

	Breakdown:
	stawberrry â s, t, a, w, b, e, r, r, r, y â 4 r's

	And then asked if I meant "strawberry" instead and said because
	that one has 2 r's....

	simonw wrote 22 hours 52 min ago:
	Honestly, if my stupid pelican riding a bicycle benchmark becomes
	influential enough that AI labs waste their time optimizing for it
	and produce really beautiful pelican illustrations I will consider
	that a huge personal win.

	MattRix wrote 23 hours 4 min ago:
	This is why things like the ARC Prize are better ways of approaching
	this:

	[1]: https://arcprize.org

	whiplash451 wrote 12 hours 39 min ago:
	Well, ARC-1 did not end well for the competitors of tech giants and
	itâs very unclear that ARC-2 wonât follow the same trajectory.

	joshuajooste05 wrote 1 day ago:
	Does anyone have any thoughts on privacy/safety regarding what he said
	about GPT memory.

	I had heard of prompt injection already. But, this seems different,
	completely out of humans control. Like even when you consider web
	search functionality, he is actually right, more and more, users are
	losing control over context.

	Is this dangerous atm? Do you think it will become more dangerous in
	the future when we chuck even more data into context?

	threeseed wrote 21 hours 53 min ago:
	I've had Cursor/Claude try to call rm -rf on my entire User directory
	before.

	The issue is that LLMs have no ability to organise their memory by
	importance. Especially as the context size gets larger.

	So when they are using tools they will become more dangerous over
	time.

	ActorNightly wrote 1 day ago:
	Sort of. The thing is with agentic models, you are basically entering
	probability space where it can do real actions in the form of http
	requests if the statistical output leads it to it.

	Joker_vD wrote 1 day ago:
	> most people find it difficult to remember the exact orientation of
	the frame.

	Isn't it ÎâÎ welded together? The bottom left and right vertices
	are where the wheels are attached to, the middle bottom point is where
	the big gear with the pedals is. The lambda is for the front wheel
	because you wouldn't be able to turn it if it was attached to a delta.
	Right?

	I guess having my first bicycle be a cheap Soviet-era produced one paid
	off: I spent loads of time fidgeting with the chain tension, and
	pulling the chain back onto the gears, so I guess I had to stare at the
	frame way too much to forget even by today the way it looks.

	pbronez wrote 1 day ago:
	There are a lot of structural details that people tend to gloss over.
	This was illustrated by an Italian art project: [1] > back in 2009 I
	began pestering friends and random strangers. I would walk up to them
	with a pen and a sheet of paper asking that they immediately draw me
	a menâs bicycle, by heart. Soon I found out that when confronted
	with this odd request most people have a very hard time remembering
	exactly how a bike is made.

	[1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/

	zahlman wrote 1 day ago:
	> If you lost interest in local modelsâlike I did eight months
	agoâitâs worth paying attention to them again. Theyâve got good
	now!

	> As a power user of these tools, I want to stay in complete control of
	what the inputs are. Features like ChatGPT memory are taking that
	control away from me.

	You reap what you sow....

	> I already have a tool I built called shot-scraper, a CLI app that
	lets me take screenshots of web pages and save them as images. I had
	Claude build me a web page that accepts ?left= and ?right= parameters
	pointing to image URLs and then embeds them side-by-side on a page.
	Then I could take screenshots of those two images side-by-side. I
	generated one of those for every possible match-up of my 34 pelican
	picturesâ560 matches in total.

	Surely it would have been easier to use a local tool like ImageMagick?
	You could even have the AI write a Bash script for you.

	> ... but prompt injection is still a thing.

	...Why wouldn't it always be? There's no quoting or escaping mechanism
	that's actually out-of-band.

	> Thereâs this thing Iâm calling the lethal trifecta, which is when
	you have an AI system that has access to private data, and potential
	exposure to malicious instructionsâso other people can trick it into
	doing things... and thereâs a mechanism to exfiltrate stuff.

	People in 2025 actually need to be told this. Franklin missed the mark
	- people today will trip over themselves to give up both their security
	and their liberty for mere convenience.

	simonw wrote 1 day ago:
	I had the LLM write a bash script for me that used my [1] tool - on
	the basis that it was a neat opportunity to demonstrate another of my
	own projects.

	And honestly, even with LLM assistance getting Image Magick to output
	a 1200x600 image with two SVGs next to each other that are correctly
	resized to fill their half of the image sounds pretty tricky.
	Probably easier (for Claude) to achieve with HTML and CSS.

	[1]: https://shot-scraper.datasette.io/

	voiper1 wrote 1 day ago:
	Isn't "left or right" _followed_ by rationale asking it to
	rationalize it's 1 word answer - I thought we need to get AI to do
	the chain of though _before_ giving it's answer for it to be more
	accurate?

	simonw wrote 1 day ago:
	Yes it is - I would likely have gotten better results if I'd
	asked for the rationale first.

	zahlman wrote 1 day ago:
	> And honestly, even with LLM assistance getting Image Magick to
	output a 1200x600 image with two SVGs next to each other that are
	correctly resized to fill their half of the image sounds pretty
	tricky.

	FWIW, the next project I want to look at after my current two, is a
	command-line tool to make this sort of thing easier. Likely
	featuring some sort of Lisp-like DSL to describe what to do with
	the input images.

	username223 wrote 1 day ago:
	Interesting timeline, though the most relevant part was at the end,
	where Simon mentions that Google is now aware of the "pelican on
	bicycle" question, so it is no longer useful as a benchmark. FWIW, many
	things outside of the training data will pants these models. I just
	tried this query, which probably has no examples online, and Gemini
	gave me the standard puzzle answer, which is wrong:

	"Say I have a wolf, a goat, and some cabbage, and I want to get them
	across a river. The wolf will eat the goat if they're left alone, which
	is bad. The goat will eat some cabbage, and will starve otherwise. How
	do I get them all across the river in the fewest trips?"

	A child would pick up that you have plenty of cabbage, but can't leave
	the goat without it, lest it starve. Also, there's no mention of boat
	capacity, so you could just bring them all over at once. Useful?
	Sometimes. Intelligent? No.

	djherbis wrote 1 day ago:
	Kaggle recently ran a competition to do just this (draw SVGs from
	prompts, using fairly small models under the hood).

	The top results (click on the top Solutions) were pretty impressive:

	[1]: https://www.kaggle.com/competitions/drawing-with-llms/leaderbo...

	nine_k wrote 1 day ago:
	Am I the only one who can't but see these attempts much like attempts
	of a kid learning to draw?

	Ygg2 wrote 1 day ago:
	Yes. Kids don't draw that good of a line at the start.

	Here is better example of start

	[1]: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfTfAA...

	nine_k wrote 1 day ago:
	Have you tried giving a kid a vector-drawing tool?

	I did that to my daughter when she was not even 6 years old. The
	results were somehow similar: [1] (Now she's much better, but
	prefers raster tools, e.g. [2] )

	[1]: https://photos.app.goo.gl/XSLnTEUkmtW2n7cX8
	[2]: https://www.deviantart.com/sofiac9/art/Ivy-with-riding-gea...

	pier25 wrote 1 day ago:
	Definitely getting better but even the best result is not very
	impressive.

	jfengel wrote 1 day ago:
	It's not so great at bicycles, either. None of those are close to
	rideable.

	But bicycles are famously hard for artists as well. Cyclists can
	identify all of the parts, but if you don't ride a lot it can be
	surprisingly difficult to get all of the major bits of geometry right.

	mattlondon wrote 1 day ago:
	Most recent Gemini 2.5 one looks pretty good. Certainly rideable.

	bredren wrote 1 day ago:
	Great writeup.

	This measure of LLM capability could be extended by taking it into the
	3D domain.

	That is, having the model write Python code for Blender, then running
	blender in headless mode behind an API.

	The talk hints at this but one shot prompting likely wonât be a broad
	enough measurement of capability by this time next year. (Or perhaps
	now, even)

	So the test could also include an agentic portion that includes
	consultation of the latest blender documentation or even use of a
	search engine for blog entries detailing syntax and technique.

	For multimodal input processing, it could take into account a
	particular photo of a pelican as the test subject.

	For usability, the objects can be converted to iOSâs native 3d format
	that can be viewed in mobile safari.

	I built this workflow, including a service for blender as an initial
	test of what was possible in October of 2022. It took post processing
	for common syntax errors back then but id imagine the newer LLMs would
	make those mistakes less often now.

	mromanuk wrote 1 day ago:
	The last animation is hilarious, represents very well the AI Hype cycle
	vs reality.

	nowayno583 wrote 1 day ago:
	That was a very fun recap, thanks for sharing. It's easy to forget how
	much better these things have gotten. And this was in just six months!
	Crazy!

	adrian17 wrote 1 day ago:
	> This was one of the most successful product launches of all time.
	They signed up 100 million new user accounts in a week! They had a
	single hour where they signed up a million new accounts, as this thing
	kept on going viral again and again and again.

	Awkwardly, I never heard of it until now. I was aware that at some
	point they added ability to generate images to the app, but I never
	realized it was a major thing (plus I already had an offline stable
	diffusion app on my phone, so it felt less of an upgrade to me
	personally). With so much AI news each week, feels like unless you're
	really invested in the space, it's almost impossible to not
	accidentally miss or dismiss some big release.

	MattRix wrote 23 hours 8 min ago:
	To be clear: they already had image generation in ChatGPT, but this
	was a MUCH better one than what they had previously. Even for you
	with your stable diffusion app, it would be a significant upgrade.
	Not just because of image quality, but because it can actually
	generate coherent images and follow instructions.

	MIC132 wrote 12 hours 31 min ago:
	As impressive as it is, for some uses it still is worse than a
	local SD model.
	It will refuse to generate named anime characters (because of
	copyright, or because it just doesn't know them, even not
	particularly obscure ones) for example.
	Or obviously anything even remotely spicy.
	As someone who mostly uses image generation to amuse myself (and
	not to post it, where copyright might matter) it's honestly
	somewhat disappointing. But I don't expect any of the major AI
	companies to release anything without excessive guardrails.

	bufferoverflow wrote 1 day ago:
	Have you missed how everyone was Ghiblifying everything?

	andrepd wrote 21 hours 26 min ago:
	Oh you mean the trend of the day on the social media monoculture? I
	don't take that as an indicator of any significance.

	Philpax wrote 19 hours 1 min ago:
	One should not be proud of their ignorance.

	DaSHacka wrote 16 hours 13 min ago:
	Except when it comes to using social media, where "ignorance"
	unironically is strength

	adrian17 wrote 23 hours 21 min ago:
	I saw that, I just didn't connect it with newly added multimodal
	image generation. I knew variations of style transfer (or LoRA for
	SD) were possible for years, so I assumed it exploded in popularity
	purely as a meme, not due to OpenAI making it much more accessible.

	Again, I was aware that they added image generation, just not how
	much of a deal it turned out to be. Think of it like me
	occasionally noticing merchandise and TV trailers for a new movie
	without realizing it became the new worldwide box office #1.

	haiku2077 wrote 1 day ago:
	Congratulations, you are almost fully unplugged from social media.
	This product launch was a huge mainstream event; for a few days GPT
	generated images completely dominated mainstream social media.

	Semaphor wrote 9 hours 28 min ago:
	Facebook, discord, reddit, HN. Hadnât heard of it either. But for
	FB, Reddit, and Discord I strictly curate what I see.

	sigmoid10 wrote 9 hours 39 min ago:
	If you primarily consume text-based social media (HN, reddit with
	legacy UI) then it's kind of easy to not notice all the new kinds
	of image infographics and comics that now completely flood places
	like instagram or linkedin.

	derwiki wrote 1 day ago:
	Not sure if this is sarcasm or sincere, but I will take it as
	sincere haha. I came back to work from parental leave and everyone
	had that same Studio Ghiblized image as their Slack photo, and I
	had no idea why. It turns out you really can unplug from social
	media and not miss anything of value: if itâs a big enough deal
	you will find out from another channel.

	stavros wrote 7 hours 43 min ago:
	Why does everyone keep calling news "social media"? Have I missed
	a trend? Knowing what my friend Steve is up to is social media,
	knowing what AI is up to is news.

	loudmax wrote 1 hour 49 min ago:
	I'm afraid a lot of Americans consume the news like they
	consume sports media. They root for their team and select a
	news stream that presents them with the most favorable
	coverage.

	stavros wrote 1 hour 47 min ago:
	As a non-American, I can assure you that's pretty much
	everywhere.

	haiku2077 wrote 3 hours 27 min ago:
	You did miss a trend:

	[1]: https://www.pewresearch.org/short-reads/2024/09/17/mor...

	dgfitz wrote 20 hours 45 min ago:
	I missed it until this thread. I think Iâm proud of myself.

	tough wrote 8 hours 20 min ago:
	You're one of today's lucky 10.000

	[1]: https://xkcd.com/1053/

	azinman2 wrote 1 day ago:
	Except this went very mainstream. Lots of turn myself into a muppet,
	what is the human equivalent for my dog, etc. TikTok is all over
	this.

	It really is incredible.

	thierrydamiba wrote 1 day ago:
	The big trend was around the ghiblification of images. Those images
	were everywhere for a period of time.

	herval wrote 1 day ago:
	They still are. Instagram is full of accounts posting
	gpt-generated cartoons (and now veo3 videos). Iâve been
	tracking the image generation space from day one, and it never
	stuck like this before

	simonw wrote 1 day ago:
	Anecdotally, I've had several conversations with people way
	outside the hyper-online demographic who have been really
	enjoying the new ChatGPT image generation - using it for
	cartoon photos of their kids, to create custom birthday cards
	etc.

	I think it's broken out into mainstream adoption and is going
	to stay there.

	It reminds me a little of Napster. The Napster UI was terrible,
	but it let people do something they had never been able to do
	before: listen to any piece of music ever released, on-demand.
	As a result people with almost no interest in technology at all
	were learning how to use it.

	Most people have never had the ability to turn a photo of their
	kids into a cute cartoon before, and it turns out that's
	something they really want to be able to do.

	herval wrote 1 day ago:
	Definitely. Itâs not just online either - half the
	billboards I see now are AI. The posters at school. The
	âweâre hiring!â ad at the local McDonalds. Itâs …
	cheaper and faster than any alternative (stock images, hiring
	an editor or illustrator, etc), and most non technical people
	can get exactly what they want in a single shot, these days.

	Jedd wrote 1 day ago:
	Yeah, but so were the bored ape NFTs - none of these ephemeral
	fads are any indication of quality, longevity, legitimacy, or
	interest.

	sandspar wrote 13 hours 8 min ago:
	I just don't understand how people can see "100 million signups
	in a week" and immediately dismiss it. We're not talking about
	fidget spinners. I don't get why this sentiment is so common
	here on HackerNews. It's become a running joke in other online
	spaces, "HackerNews commenters keep saying that AI is a
	nothingburger." It's just a groupthink thing I guess, a
	kneejerk response.

	otabdeveloper4 wrote 9 hours 36 min ago:
	> We're not talking about fidget spinners.

	We're talking about Hitler memes instead? I don't understand
	your feigned outrage.

	The actual valid commercial use case for generative images
	hasn't been found yet. (No, making blog spam prettier is not
	a good use case.)

	simonw wrote 7 hours 15 min ago:
	Everything Everywhere All At Once won a bunch of Oscars.
	They used generative AI tools for some of their
	post-production work (achieved by a tiny team), for example
	to help clean up the backgrounds in the scene with the
	silent dialog between the two rocks.

	stavros wrote 7 hours 40 min ago:
	You're right, nothing has value unless someone figures out
	how to make money with it. Except OpenAI, apparently,
	because the fact that people buy ChatGPT to make images
	doesn't seem to count as a commercial use case.

	otabdeveloper4 wrote 6 hours 48 min ago:
	OpenAI is not profitable and we don't know if it ever
	will be.

	stavros wrote 6 hours 44 min ago:
	Have we shifted the goalposts from "something people
	will pay for" to "needs to be profitable even with
	massive R&D" then?

	otabdeveloper4 wrote 5 hours 25 min ago:
	OpenAI is not "something people will pay for" at the
	moment though.

	stavros wrote 5 hours 14 min ago:
	Except lots of people are paying for it. I'll refer
	you to the other post on the front page for the
	calculation that OpenAI would have to get just an
	extra $10/yr from their users to break even.

	otabdeveloper4 wrote 3 hours 10 min ago:
	Your response reminds me of that joke about
	selling a dollar bill for ninety cents.

	stavros wrote 3 hours 7 min ago:
	Your response makes me think we have different
	definitions for profitability.

	pintxo wrote 12 hours 35 min ago:
	I assume, when people dismiss it, they are not looking at it
	through the business lens and the 100m user signups KPI, but
	they are dismissing it on technical grounds, as an LLM is
	just a very big statistical database which seems incapable of
	solving problems beyond (impressive looking) text/image/video
	generation.

	sandspar wrote 12 hours 14 min ago:
	Makes sense. Although I think that's an error. TikTok is
	"just" a video sharing site. Joe Rogan is "just" a
	podcaster. Dumb things that affect lots of people are
	important.

	micromacrofoot wrote 23 hours 36 min ago:
	they're not but I'm already seeing ai generated images on
	billboards for local businesses, they're in production
	workflows now and they aren't going anywhere

	baq wrote 1 day ago:
	Itâs hard to think of a worse analogy TBH. My wife is using
	ChatGPT to change photos (still is to this day), she didnât
	use it or any other LLM until that feature hit. It is a fad,
	but itâs also a very useful tool.

	Ape NFTs areâ¦ ape NFTs. Useless. Pointless. Negative value
	for most people.

	Jedd wrote 11 hours 14 min ago:
	I would note that I was replying to a comment about the 'big
	trend of ghiblification' of images.

	Reproducing a certain style of image has been a regular fad
	since profile pictures became a thing sometime last century.

	I was not meaning to suggest that large language & diffusion
	models are fads.

	(I do think their capabilities are poorly understood and/or
	over-estimated by non-technical and some technical people
	alike, but that invites a more nuanced discussion.)

	While I'm sure your wife is getting good value out of the
	system, whether it's a better fit for purpose, produces a
	better quality, or provides a more satisfying workflow --
	than say a decent free photo editor -- or whether other tools
	were tried but determined to be too limited or difficult, etc
	-- only you or her could say. It does feel like a small
	sample set, though.

	senthil_rajasek wrote 1 day ago:
	"My wife is using ChatGPT to change photos (still is to this
	day), she didnât use it or any other LLM until that feature
	hit."

	This is deja vu, except instead of ChatGPT to edit photos it
	was instagram a decade ago.

	baq wrote 1 day ago:
	You either havenât tried it or are just trolling.

	senthil_rajasek wrote 23 hours 20 min ago:
	I am contrasting how instagram filters gave users some
	control and increased user base and how today editing
	photos with LLMs is doing the same and pulling in a wider
	user base.

	djhn wrote 23 hours 33 min ago:
	I tried it and I donât get it. What and where are the
	legal usecases? What can you do with these low-resolution
	images?

	jauntywundrkind wrote 1 day ago:
	Applying some filters and adding some overlay text is
	something some folks did, but there's such a massive
	creative world that's opened up, where all we have to do is
	ask.

	mrkurt wrote 1 day ago:
	If we try really hard, I think we can make an exhaustive list
	of what viral fads on the internet are not. You made a small
	start.

	none of these ephemeral fads are any indication of quality,
	longevity, legitimacy, interest, substance, endurance,
	prestige, relevance, credibility, allure, staying-power,
	refinement, or depth.

	Aurornis wrote 22 hours 43 min ago:
	100 million people didnât sign up to make that one image
	meme and then never use it again.

	That many signups is impressive no matter what. The attempts
	to downplay every aspect of LLM popularity are getting really
	tiresome.

	otabdeveloper4 wrote 9 hours 40 min ago:
	> 100 million people didnât sign up to make that one
	image meme and then never use it again.

	Source? They did exactly that.

	simonw wrote 7 hours 12 min ago:
	What's your source for saying they did exactly that?

	jodrellblank wrote 22 hours 31 min ago:
	I think it sounds far more likely that 100M people signed
	up to poke at the latest viral novelty and create one meme,
	than that 100M people suddenly discovered they had a
	pressing long-term need for AI images all on the same day.

	Doesnât it?

	ben_w wrote 22 hours 1 min ago:
	While 100M signing up just for one pic is certainly
	possible, I note that several hundred million people
	regularly share photographs of their lunch, so it is very
	plausible that in signing up for the latest meme
	generator they found they liked the ability to generate
	custom images of whatever they consider to be pretty
	pictures every day.

	gretch wrote 22 hours 6 min ago:
	It's neither of these options in this false dichotomy.

	100M people signed up and did at least 1 task. Then, most
	likely some % of them discovered it was a useful thing
	(if for nothing else than just to make more memes), and
	converted into a MAU.

	If I had to use my intuition, I would say it's 5% - 10%,
	which represents a larger product launch than most
	developers will ever participate in, in the context of a
	single day.

	Of course the ongoing stickiness of the MAU also depends
	on the ability of this particular tool to stay on top
	amongst increasing competition.

	oblio wrote 15 hours 39 min ago:
	Apparently OpenAI is losing money like crazy on this
	and their conversion rates to paid are abysmal, even
	for the cheaper licenses. And not even their top
	subscription covers its cost.

	Uber at a 10x scale.

	I should add that compared to the hype, at a global
	level Uber is a failure. Yes, it's still a big company,
	yes, it's profitable now, but I think it was launched
	10+ years ago and it's barely becoming net profitabile
	over it's existence now and shows no signs of taking
	over the world. Sure, it's big in the US and a few
	specific markets. But elsewhere it's either banned for
	undermining labor practices or has stiff local
	competition or it's just not cost competitive and it
	won't enter the market because without the whole "gig
	economy" scam it's just a regular taxi company with a
	better app.

	simonw wrote 14 hours 48 min ago:
	Is that information about their low conversion rates
	from credible sources?

	oblio wrote 13 hours 20 min ago:
	It's quite hard to say for sure, and I will prefix
	my comment by saying his blog posts are very long
	and quite doomerist about LLMs, but he makes a
	decent case about OpenAI financials: [1] [2] A very
	solid argument is like that against propaganda:
	it's not so much about what is being said but what
	about isn't. OpenAI is basically shouting about
	every minor achievement from the rooftops so the
	fact that they are remarkably silent about
	financial fundamentals says something. At best
	something mediocre or more likely bad.

	[1]: https://www.wheresyoured.at/wheres-the-mon...
	[2]: https://www.wheresyoured.at/openai-is-a-sy...

	landgenoot wrote 1 day ago:
	If you would give a human the SVG documentation and ask to write an
	SVG, I think the results would be quite similar.

	ramesh31 wrote 1 day ago:
	>If you would give a human the SVG documentation and ask to write an
	SVG, I think the results would be quite similar.

	It certainly would, and it would cost at minimum an hour of the human
	programmer's time at $50+/hr. Claude does it in seconds for pennies.

	diggan wrote 1 day ago:
	Lets give it a try, if you're willing to be the experiment subject :)

	The prompt is "Generate an SVG of a pelican riding a bicycle" and
	you're supposed to write it by hand, so no graphical editor. The
	specification is here: [1] I'm fairly certain I'd lose interest in
	getting it right before I got something better than most of those.

	[1]: https://www.w3.org/TR/SVG2/

	zahlman wrote 1 day ago:
	> The colors use traditional bicycle brown (#8B4513) and a classic
	blue for the pelican (#4169E1) with gold accents for the beak
	(#FFD700).

	The output pelican is indeed blue. I can't fathom where the idea
	that this is "classic", or suitable for a pelican, could have come
	from.

	diggan wrote 1 day ago:
	My guess would be that it doesn't see the web colors (CSS color
	hexes) as proper hex triplets, but because of tokenization it
	could be something dumb like '#8B','451','3' instead. I think the
	same issue happens around multiple special characters after each
	other too.

	cap11235 wrote 13 hours 59 min ago:
	Qwen3, at least, tokenizes each character of "#8B4513"
	separately.

	zahlman wrote 19 hours 15 min ago:
	No, it's understanding the colors properly. The SVG that the
	LLM created does use #4169E1 for the pelican color, and the LLM
	correctly describes this color as blue. The problem is that
	pelicans should not be blue.

	mormegil wrote 1 day ago:
	Did the testing prompt for LLMs include a clause forbidding the use
	of any tools? If not, why are you adding it here?

	simonw wrote 1 day ago:
	The way I run the pelican on a bicycle benchmark is to use this
	exact prompt:

	Generate an SVG of a pelican riding a bicycle

	And execute it via the model's API with all default settings, not
	via their user-facing interface.

	Currently none of the model APIs enable tools unless you ask them
	to, so this method excludes the use of additional tools.

	diggan wrote 1 day ago:
	The models that are being put under the "Pelican" testing don't
	use a GUI to create SVGs (either via "tools" or anything else),
	they're all Text Generation models so they exclusively use text
	for creating the graphics.

	There are 31 posts listed under "pelican-riding-a-bicycle" in
	case you wanna inspect the methodology even closer:

	[1]: https://simonwillison.net/tags/pelican-riding-a-bicycle/

	wohoef wrote 1 day ago:
	Quite a detailed image using claude sonnet 4:

	[1]: https://ibb.co/39RbRm5W

	spaceman_2020 wrote 1 day ago:
	I donât know what secret sauce Anthropic has, but in real world use,
	Sonnet is somehow still the best model around. Better than Opus and
	Gemini Pro

	diggan wrote 1 day ago:
	Statements like these are useless without sharing exactly all the
	models you've tried. Sonnet beats O1 Pro Mode for example? Not in my
	experience, but I haven't tried the latest Sonnet versions, only the
	one before, so wouldn't claim O1 Pro Mode beats everything out there.

	Besides, it's so heavily context-dependent that you really need your
	own private benchmarks to make head or tails out of this whole thing.

	big_hacker wrote 1 day ago:
	Honestly the metric which increased the most is the marketing and
	astroturfing budget of the major players (OpenAI, Anthropic, Google and
	Deepseek).

	Say what you want about Facebook but at least they released their
	flagship model fully open.

	mdaniel wrote 1 day ago:
	> model fully open.

	uh-huh

	[1]: https://www.llama.com/llama4/license/

	franze wrote 1 day ago:
	Here Claude Opus Extended Thinking

	[1]: https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c61...

	ramesh31 wrote 1 day ago:
	Single shot?

	franze wrote 1 day ago:
	2 shot, first one did just generate the svg not the shareable html
	page around it. in the second go it also worked on the svg as i did
	not forbid it.

	deadbabe wrote 1 day ago:
	As a control, he should go on fiver and have a human generate a pelican
	riding a bicycle, just to see what the eventual goal is.

	gus_massa wrote 1 day ago:
	Someone did this. Look at this sibling comment by ben_w [1] about an
	old similar project.

	[1]: https://news.ycombinator.com/item?id=44216284

	zahlman wrote 1 day ago:
	> back in 2009 I began pestering friends and random strangers. I
	would walk up to them with a pen and a sheet of paper asking that
	they immediately draw me a menâs bicycle, by heart.

	Someone commissioned to draw a bicycle on Fiverr would not have to
	rely on memory of what it should look like. It would take barely
	any time to just look up a reference.

	atxtechbro wrote 1 day ago:
	Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings
	and this is great too! I like the personalized benchmark. Hopefully the
	big LLM providers don't start gaming the pelican index!

	dirtyhippiefree wrote 1 day ago:
	Hereâs the spot where we see whoâs TL;DRâ¦

	> Claude 4 will rat you out to the feds!

	>If you expose it to evidence of malfeasance in your company, and you
	tell it it should act ethically, and you give it the ability to send
	email, itâll rat you out.

	gscott wrote 21 hours 31 min ago:
	I am interested in this ratting you out thing. At some point you
	have a video feed into AI from a Jarvis like headset device, you
	walking down the street and cross the street in the middle not at a
	sidewalk... does it rat you out? Does it make a list of every crime
	no matter how small? Or just the big ones?

	yubblegum wrote 1 day ago:
	I was looking at that and wondering about swatting via LLMs by
	malicious users.

	ben_w wrote 1 day ago:
	I'd say that's too short.

	> But itâs not just Claude. Theo Browne put together a new
	benchmark called SnitchBench, inspired by the Claude 4 System Card.

	> It turns out nearly all of the models do the same thing.

	dirtyhippiefree wrote 1 day ago:
	I totally agree, but I needed you to post the other half because of
	TL;DRâ¦

	bravesoul2 wrote 1 day ago:
	Is there a good model (any architecture) for vector graphics out of
	interest?

	simonw wrote 1 day ago:
	I was impressed by Recraft v3, which gave me an editable vector
	illustration with different layers - [1] - but as I understand it
	that one is actually still a raster image generator with a separate
	step to convert to vector at the end.

	[1]: https://simonwillison.net/2024/Nov/15/recraft-v3/

	bravesoul2 wrote 1 day ago:
	Now that is a pelican on a bicycle! Thanks

	JimDabell wrote 1 day ago:
	See also: The recent history of AI in 32 otters

	[1]: https://www.oneusefulthing.org/p/the-recent-history-of-ai-in-3...

	pbhjpbhj wrote 1 day ago:
	That is otterly fantastic. The post there shows the breadth too -
	both otters generated via text representations (in TikZ) and by image
	generators. The video at the end, wow (and funny too).

	Thanks for sharing.

	qwertytyyuu wrote 1 day ago:
	[1] here are a few i tried the models, looks like the newer vesion of
	gemini is another improvement?

	[1]: https://imgur.com/a/mzZ77xI

	puttycat wrote 1 day ago:
	The bicycle are still very far from actual ones.

	pjs_ wrote 1 day ago:


	[1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/

	simonw wrote 1 day ago:
	I think the most recent Gemini Pro bicycle may be the best yet -
	the red frame is genuinely the right shape.

	layer8 wrote 1 day ago:
	The pelican, on the other hand...

	anon373839 wrote 1 day ago:
	Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a
	really strong release, especially the fine-grained MoE which is unlike
	anything thatâs come before (in terms of capability and speed on
	consumer hardware).

	simonw wrote 1 day ago:
	Omitting Qwen 3 is my great regret about this talk. Honestly I only
	realized I had missed it after I had delivered the talk!

	It's one of my favorite local models right now, I'm not sure how I
	missed it when I was reviewing my highlights of the last six months.

	Maxious wrote 1 day ago:
	Cut for time - qwen3 was pelican tested too

	[1]: https://simonwillison.net/2025/Apr/29/qwen-3/

	nathan_phoenix wrote 1 day ago:
	My biggest gripe is that he's comparing probabilistic models (LLMs) by
	a single sample.

	You wouldn't compare different random number generators by taking one
	sample from each and then concluding that generator 5 generates the
	highest numbers...

	Would be nicer to run the comparison with 10 images (or more) for each
	LLM and then average.

	timewizard wrote 1 day ago:
	My biggest gripe is he didn't include a picture of an actual pelican.
	[1] The "closest pelican" is not even close.

	[1]: https://www.google.com/search?q=pelican&udm=2

	mooreds wrote 1 day ago:
	My biggest gripe is that he outsourced evaluation of the pelicans to
	another LLM.

	I get it was way easier to do and that doing it took pennies and no
	time. But I would have loved it if he'd tried alternate methods of
	judging and seen what the results were.

	Other ways:

	* wisdom of the crowds (have people vote on it)

	* wisdom of the experts (send the pelican images to a few dozen
	artists or ornithologists)

	* wisdom of the LLMs (use more than one LLM)

	Would have been neat to see what the human consensus was and if it
	differed from the LLM consensus

	Anyway, great talk!

	zahlman wrote 1 day ago:
	It would have been interesting to see if the LLM that Claude judged
	worst would have attempted to justify itself....

	qeternity wrote 1 day ago:
	I think you mean non-deterministic, instead of probabilistic.

	And there is no reason that these models need to be
	non-deterministic.

	rvz wrote 1 day ago:
	> I think you mean non-deterministic, instead of probabilistic.

	My thoughts too. It's more accurate to label LLMs as
	non-deterministic instead of "probablistic".

	skybrian wrote 1 day ago:
	A deterministic algorithm can still be unpredictable in a sense. In
	the extreme case, a procedural generator (like in Minecraft) is
	deterministic given a seed, but you will still have trouble
	predicting what you get if you change the seed, because internally
	it uses a (pseudo-)random number generator.

	So thereâs still the question of how controllable the LLM really
	is. If you change a prompt slightly, how unpredictable is the
	change? That canât be tested with one prompt.

	simonw wrote 1 day ago:
	It might not be 100% clear from the writing but this benchmark is
	mainly intended as a joke - I built a talk around it because it's a
	great way to make the last six months of model releases a lot more
	entertaining.

	I've been considering an expanded version of this where each model
	outputs ten images, then a vision model helps pick the "best" of
	those to represent that model in a further competition with other
	models.

	(Then I would also expand the judging panel to three vision LLMs from
	different model families which vote on each round... partly because
	it will be interesting to track cases where the judges disagree.)

	I'm not sure if it's worth me doing that though since the whole
	"benchmark" is pretty silly. I'm on the fence.

	dilap wrote 1 day ago:
	Joke or not, it still correlates much better with my own subjective
	experiences of the models than LM Arena!

	fzzzy wrote 1 day ago:
	Even if it is a joke, having a consistent methodology is useful. I
	did it for about a year with my own private benchmark of reasoning
	type questions that I always applied to each new open model that
	came out. Run it once and you get a random sample of performance.
	Got unlucky, or got lucky? So what. That's the experimental
	protocol. Running things a bunch of times and cherry picking the
	best ones adds human bias, and complicates the steps.

	simonw wrote 1 day ago:
	It wasn't until I put these slides together that I realized quite
	how well my joke benchmark correlates with actual model
	performance - the "better" models genuinely do appear to draw
	better pelicans and I don't really understand why!

	og_kalu wrote 23 hours 57 min ago:
	LLMs also have a 'g factor'

	[1]: https://www.sciencedirect.com/science/article/pii/S016...

	johnrob wrote 1 day ago:
	Well, the most likely single random sample would be a
	ârepresentativeâ one :)

	tuananh wrote 1 day ago:
	until they start targeting this benchmark

	simonw wrote 1 day ago:
	Right, that was the closing joke for the talk.

	jonstewart wrote 1 day ago:
	It is funny to think that a hundred years in the future
	there may be some vestigial area of the modelsâ networks
	thatâs still tuned to drawing pelicans on bicycles.

	more-nitor wrote 1 day ago:
	I just don't get the fuss from the pro-LLM people who don't
	want anyone to shame their LLMs...

	people expect LLMs to say "correct" stuff on the first attempt,
	not 10000 attempts.

	Yet, these people are perfectly OK with cherry-picked success
	stories on youtube + advertisements, while being extremely
	vehement about this simple experiment...

	...well maybe these people rode the LLM hype-train too early,
	and are desperate to defend LLMs lest their investment go poof?

	obligatory hype-graph classic:

	[1]: https://upload.wikimedia.org/wikipedia/commons/thumb/9...

	MichaelZuo wrote 1 day ago:
	I imagine the straightforward reason is that the âbetterâ
	models are in fact significantly smarter in some tangible way,
	somehow.

	pama wrote 1 day ago:
	How did the pelicans of point releases of V3 and of R1
	(R1-0528) do compared to the original versions of the models?

	demosthanos wrote 1 day ago:
	I'd say definitely do not do that. That would make the benchmark
	look more serious while still being problematic for knowledge
	cutoff reasons. Your prompt has become popular even outside your
	blog, so the odds of some SVG pelicans on bicycles making it into
	the training data have been going up and up.

	Karpathy used it as an example in a recent interview:

	[1]: https://www.msn.com/en-in/health/other/ai-expert-asks-grok...

	telotortium wrote 19 hours 28 min ago:
	Yeah, Simon needs to release a new benchmark under a pen name,
	like Stephen King did with Richard Bachman.

	throwaway31131 wrote 1 day ago:
	Iâd say it doesnât really matter. There is no universally
	good benchmark and really they should only be used to answer very
	specific questions which may or may not be relevant to you.

	Also, as the old saying goes, the only thing worse than using
	benchmarks is not using benchmarks.

	6LLvveMx2koXfwn wrote 1 day ago:
	I would definitely say he had no intention of doing that and was
	doubling down on the original joke.

	colecut wrote 1 day ago:
	The road to hell is paved with the best intentions

	clarification: I enjoyed the pelican on a bike and don't think
	it's that bad =p

	diggan wrote 1 day ago:
	Yeah, this is the problem with benchmarks where the
	questions/problems are public. They're valuable for some months,
	until it bleeds into the training set. I'm certain a lot of the
	"improvements" we're seeing are just benchmarks leaking into the
	training set.

	travisgriggs wrote 1 day ago:
	Thatâs ok, once bicycle âridingâ pelicans become
	normative, we can ask it for images of pelicans humping
	bicycles.

	The number of subject-verb-objects are near infinite. All are
	imaginable, but most are not plausible. A plausibility machine
	(LLM) will struggle with the implausible, until it can abstract
	well.

	zahlman wrote 1 day ago:
	I can't fathom this working, simply because building a model
	that relates the word "ride" to "hump" seems like something
	that would be orders of magnitude easier for an LLM than
	visualizing the result of SVG rendering.

	diggan wrote 1 day ago:
	> The number of subject-verb-objects are near infinite. All
	are imaginable, but most are not plausible

	Until there is enough unique/new subject-verb-objects
	examples/benchmarks so the trained model actually generalized
	it just like you did. (Public) Benchmarks needs to constantly
	evolve, otherwise they stop being useful.

	demosthanos wrote 1 day ago:
	To be fair, once it does generalize the pattern then the
	benchmark is actually measuring something useful for
	deciding if the model will be able to product a
	subject-verb-object SVG.

	ontouchstart wrote 1 day ago:
	Very nice talk, acceptable by general public and by AI agent as
	well.

	Any concerns about open source âAI celebrity talksâ like yours
	can be used in contexts that would allow LLM models to optimize
	their market share in ways that we canât imagine yet?

	Your talk might influence the funding of AI startups.

	#butterflyEffect

	threecheese wrote 1 day ago:
	I welcome a VC funded pelican â¦ anything! Clippy 2.0 maybe?

	Simon, hope you are comfortable in your new role of AI Celebrity.

	planb wrote 1 day ago:
	And by a sample that has become increasingly known as a benchmark.
	Newer training data will contain more articles like this one, which
	naturally improves the capabilities of an LLM to estimate whatâs
	considered a good âpelican on a bikeâ.

	viraptor wrote 1 day ago:
	Would it though? There really aren't that many valid answers to
	that question online. When this is talked about, we get more broken
	samples than reasonable ones. I feel like any talk about this
	actually sabotages future training a bit.

	I actually don't think I've seen a single correct svg drawing for
	that prompt.

	criddell wrote 1 day ago:
	And thatâs why he says heâs going to have to find a new
	benchmark.

	cyanydeez wrote 1 day ago:
	So what you really need to do is clone this blog post, find and
	replace pelican with any other noun, run all the tests, and publish
	that.

	Call it wikipediaslop.org

	YuccaGloriosa wrote 1 day ago:
	If the any other noun becomes fish... I think I disagree.

	puttycat wrote 1 day ago:
	You are right, but the companies making these models invest a lot of
	effort in marketing them as anything but probabilistic, i.e. making
	people think that these models work discretely like humans.

	In that case we'd expect a human with perfect drawing skills and
	perfect knowledge about bikes and birds to output such a simple
	drawing correctly 100% of the time.

	In any case, even if a model is probabilistic, if it had correctly
	learned the relevant knowledge you'd expect the output to be perfect
	because it would serve to lower the model's loss. These outputs
	clearly indicate flawed knowledge.

	bufferoverflow wrote 1 day ago:
	> work discretely like humans

	What kind of humans are you surrounded by?

	Ask any human to write 3 sentences about a specific topic. Then ask
	them the same exact question next day. They will not write the same
	3 sentences.

	cyanydeez wrote 1 day ago:
	Humans absolutely do not work discretely.

	loloquwowndueo wrote 1 day ago:
	They probably meant deterministically as opposed to
	probabilistically. Which also humans dont work like that :)

	aspenmayer wrote 1 day ago:
	I thought they meant discreetly.

	ben_w wrote 1 day ago:
	> In that case we'd expect a human with perfect drawing skills and
	perfect knowledge about bikes and birds to output such a simple
	drawing correctly 100% of the time.

	Look upon these works, ye mighty, and despair:

	[1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/

	rightbyte wrote 1 day ago:
	That blog post is a 10/10. Oh dear I miss the old internet.

	jodrellblank wrote 1 day ago:
	You claim those are drawn by people with "perfect knowledge about
	bikes" and "perfect drawing skills"?

	ben_w wrote 1 day ago:
	More that "these models work â¦ like humans" (discretely or
	otherwise) does not imply the quotation.

	Most humans do not have perfect drawing skills and perfect
	knowledge about bikes and birds, they do not output such a
	simple drawing correctly 100% of the time.

	"Average human" is a much lower bar than most people want to
	believe, mainly because most of us are average on most skills,
	and also overestimate our own competence â the modal human
	has just a handful of things they're good at, and one of those
	is the language they use, another is their day job.

	Most of us can't draw, and demonstrably can't remember (or
	figure out from first principles) how a bike works. But this
	also applies to "smart" subsets of the population: physicists
	have [1] , and there's this famous rocket scientist who weighed
	in on rescuing kids from a flooded cave, they come up with some
	nonsense about a submarine.

	[1]: https://xkcd.com/793/

	Retric wrote 1 day ago:
	Itâs not that humans have perfect drawing skills, itâs
	that humans can judge their performance and get better over
	time.

	Ask 100 random people to draw a bike and in 10 minutes and
	theyâll on average suck while still beating the LLMâs
	here. Give em an incentive and 10 months and the average
	person is going to be able to make at least one quite decent
	drawing of a bike.

	The cost and speed advantage of LLMâs is real as long as
	youâre fine with extremely low quality. Ask a model for
	10,000 drawings so you can pick the best and you get a
	marginal improvements based on random chance at a steep
	price.

	ben_w wrote 1 day ago:
	> Ask 100 random people to draw a bike and in 10 minutes
	and theyâll on average suck while still beating the
	LLMâs here.

	Y'see, this is a prime example of what I meant with
	""Average human" is a much lower bar than most people want
	to believe, mainly because most of us are average on most
	skills, and also overestimate our own competence".

	An expert artist can spend 10 minutes and end up with a
	brief sketch of a bike. You can witness this exact duration
	yourself (with non-bike examples) because of a challenge a
	few years back to draw the same picture in 10 minutes, 1
	minute, and 10 seconds.

	A normal person spending as much time as they like gets you
	the pictures that I linked to in the previous post, because
	they don't really know what a bike is. 45 examples of what
	normal people think a bike looks like: [1] > Give em an
	incentive and 10 months and the average person is going to
	be able to make at least one quite decent drawing of a
	bike.

	Given mandatory art lessons in school are longer than 10
	months, and yet those bike examples exist, I have no reason
	to believe this.

	> Ask a model for 10,000 drawings so you can pick the best
	and you get a marginal improvements based on random chance
	at a steep price.

	If you do so as a human, rating and comparing images? Then
	the cost is your own time.

	If you automate it in literally the manner in this write-up
	(pairwise comparison via API calls to another model to get
	ELO ratings), ten thousand images is like $60-$90, which is
	on the low end for a human commission.

	[1]: https://www.gianlucagimini.it/portfolio-item/veloc...

	Retric wrote 1 day ago:
	As an objective criteria what percentage include peddles
	and a chain connecting one of the wheels? I quickly found
	a dozen and stopped counting. Now do the same for those
	LLM images and itâs clear humans win.

	> ""Average human" is a much lower bar than most people
	want to believe

	I have some basis for comparison. Iâve seen 6 years
	olds draw better bikes than those LLMâs.

	Look through that list again the worst example does even
	have wheels, multiple of them have wheels without being
	connected to anything.

	Now if youâre arguing the average human is worse than
	the average 6 year old Iâm going to disagree here.

	> Given mandatory art lessons in school are longer than
	10 months, and yet those bike examples exist, I have no
	reason to believe this.

	Art lessons donât cumulatively spend 10 months teaching
	people how to draw a bike. I donât think I
	cumulatively spent 6 months drawing anything. Painting,
	collage, sculpture, coloring, etc art covers a lot and
	wasnât an every day or even every year thing. My
	mandatory collage class was art history, we didnât
	create any art.

	You may have spent more time in class studying drawing,
	but thatâs not some universal average.

	> If you automate it in literally the manner in this
	write-up (pairwise comparison via API calls to another
	model to get ELO ratings), ten thousand images is like
	$60-$90, which is on the low end for a human commission.

	Not every one of those images had a price tag but one was
	88 cents, * 10,000 = 8,800$ just to make the image for a
	test even at 4c/image your looking at 400$. Cheaper
	models existed but fairly consistently had worse
	performance.

	simonw wrote 1 day ago:
	The 88 cent one was the most expensive almost my an
	order of magnitude. Most of these cost less than a cent
	to generate - that's why I highlighted the price on the
	o1 pro output.

	Retric wrote 1 day ago:
	Yes, but if youâre averaging cheap and expensive
	options the expensive ones make a significant
	difference. Cheaper is bound by 0 so it canât
	differ as much from the average.

	Also, when youâre talking about how cheap something
	is, including the price makes sense. I had no idea
	on many of those models.

	simonw wrote 1 day ago:
	If you're interested, you can get cost estimates
	from my pricing calculator site here: [1] That link
	seeds it with 11 input tokens and 1200 output
	tokens - 11 input tokens is what most models use
	for "Generate an SVG of a pelican riding a bicycle"
	and 1200 is the number of output tokens used for
	some of the larger outputs.

	Click on different models to see estimated prices.
	They range from 0.0168 cents for Amazon Nova Micro
	(that's less than 2/100ths of a cent) up to 72
	cents for o1-pro.

	The most expensive model most people would consider
	is Claude 4 Opus, at 9 cents.

	GPT-4o is the upper end of the most common prices,
	at 1.2 cents.

	[1]: https://www.llm-prices.com/#it=11&ot=1200

	Retric wrote 1 day ago:
	Thanks

	zahlman wrote 1 day ago:
	> A normal person spending as much time as they like gets
	you the pictures that I linked to in the previous post,
	because they don't really know what a bike is. 45
	examples of what normal people think a bike looks like:
	[1] A normal person given the ability to consult a
	picture of a bike while drawing will do much better. An
	LLM agent can effectively refresh its memory (or attempt
	to look up information on the Internet) any time it
	wants.

	[1]: https://www.gianlucagimini.it/portfolio-item/vel...

	ben_w wrote 10 hours 44 min ago:
	> A normal person given the ability to consult a
	picture of a bike while drawing will do much better. An
	LLM agent can effectively refresh its memory (or
	attempt to look up information on the Internet) any
	time it wants.

	Some models can when allowed to, but I don't belive
	Simon Willson was testing that?

	joshstrange wrote 1 day ago:
	I really enjoy Simonâs work in this space. Iâve read almost every
	blog post theyâve posted on this and I love seeing them poke and prod
	the models to see what pops out. The CLI tools are all very easy to use
	and complement each other nicely all without trying to do too much by
	themselves.

	And at the end of the day, itâs just so much fun to see someone else
	having so much fun. Heâs like a kid in a candy store and that
	excitement is contagious. After reading every one of his blog posts,
	Iâm inspired to go play with LLMs in some new and interesting way.

	Thank you Simon!

	blackhaj7 wrote 1 day ago:
	Same sentiment!

	dotemacs wrote 1 day ago:
	The same here.

	Because of him, I installed a RSS reader so that I don't miss any
	of his posts. And I know that he shares the same ones across
	Twitter, Mastodon & Bsky...

	neepi wrote 1 day ago:
	My only take home is they are all terrible and I should hire a
	professional.

	vunderba wrote 23 hours 59 min ago:
	This test isn't really about the quality of the image itself
	(multimodals like gpt-image-1 or even standard diffusion models would
	be far superior) - it's about following a spec that describes how to
	draw.

	A similar test would be if you asked for the pelican on a bicycle
	through a series of LOGO instructions.

	spaceman_2020 wrote 1 day ago:
	My only take home is that a spanner can work as a hammer, but you
	probably should just get a hammer

	jug wrote 1 day ago:
	Before that, you might ask ChatGPT to create a vector image of a
	pelican riding a bicycle and then running the output through a PNG to
	SVG converter...

	Result: [1] These are tough benchmarks to trial reasoning by having
	it _write_ an SVG file by hand and understanding how it's to be
	written to achieve this. Even a professional would struggle with
	that! It's _not_ a benchmark to give an AI the best tools to actually
	do this.

	[1]: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican...

	YuccaGloriosa wrote 1 day ago:
	I think you made an error there png is a bitmap format

	sethaurus wrote 1 day ago:
	You've misunderstood. The parent was making a specific point â
	if you want an SVG of a penguin, the easiest way to AI-generate
	it is to get an image generator to create a (vector-styled)
	bitmap, then auto-vectorize it to SVG. But the point of this
	benchmark is that it's asking models to create an SVG the hard
	way, by writing its code directly.

	GaggiX wrote 1 day ago:
	An expert at writing SVGs?

	keiferski wrote 1 day ago:
	As the other guy said, these are text models. If you want to make
	images use something like Midjourney.

	Promoting a pelican riding a bicycle makes a decent image there.

	keiferski wrote 1 day ago:
	* Prompting

	matkoniecz wrote 1 day ago:
	it depends on quality you need and your budget

	neepi wrote 1 day ago:
	Ah yes the race to the bottom argument.

	ben_w wrote 1 day ago:
	When I was at university, they got some people from industry to
	talk to us all about our CVs and how to do interviews.

	My CV had a stupid clichÃ©, "committed to quality", which they
	correctly picked up on â "What do you mean?" one of them asked
	me, directly.

	I thought this meant I was focussed on being the best. He didn't
	like this answer.

	His example, blurred by 20 years of my imperfect human memory,
	was to ask me which is better: a Porsche, or a go-kart. Now,
	obviously (or I wouldn't be saying this), Porsche was a trick
	answer. Less obviously is that both were trick answers, because
	their point was that the question was under-specified â quality
	is the match between the product and what the user actually
	wants, so if the user is a 10 year old who physically isn't big
	enough to sit in a real car's driver's seat and just wants to
	rush down a hill or along a track, none of "quality" stuff that
	makes a Porsche a Porsche is of any relevance at all, but what
	does matter is the stuff that makes a go-kart into a go-kartâ¦
	one of which is the affordability.

	LLMs are go-karts of the mind. Sometimes that's all you need.

	neepi wrote 1 day ago:
	I disagree. Quality depends on your market position and what
	you are bringing to the market. Thus I would start with market
	conditions and work back to quality. If you can't reach your
	standards in the market then you shouldn't enter it. And if
	your standards are poor, you should be ashamed.

	Go kart or porsche is irrelevant.

	ben_w wrote 1 day ago:
	> Quality depends on your market position and what you are
	bringing to the market.

	That's the point.

	The market for go-karts does not support Porche.

	If you bring a Porche sales team to a go-kart race, nobody
	will be interested.

	Porche doesn't care about this market. It goes both ways:
	this market doesn't care about Porche, either.

	dist-epoch wrote 1 day ago:
	Most of them are text-only models. Like asking a person born blind to
	draw a pelican, based on what they heard it looks like.

	neepi wrote 1 day ago:
	That seems to be a completely inappropriate use case?

	I would not hire a blind artist or a deaf musician.

	wongogue wrote 1 day ago:
	Even Beethoven?

	simonw wrote 1 day ago:
	Yeah, that's part of the point of this. Getting a state of the
	art text generating LLM to generate SVG illustrations is an
	inappropriate application of them.

	It's a fun way to deflate the hype. Sure, your new LLM may have
	cost XX million to train and beat all the others on the
	benchmarks, but when you ask it to draw a pelican on a bicycle it
	still outputs total junk.

	dist-epoch wrote 1 day ago:
	tried starting from an image: [1] lol:

	[1]: https://chatgpt.com/share/684582a0-03cc-8006-b5b5-de51...
	[2]: https://gemini.google.com/share/4d1746a234a8

	dmd wrote 1 day ago:
	Sorry, Beethoven, you just donât seem to be a match for our
	org. Best of luck on your search!

	You too, Monet. Scram.

	__alexs wrote 1 day ago:
	I guess the idea is that by asking the model to do something that
	is inherently hard for it we might learn something about the
	baseline smartness of each model which could be considered a
	predictor for performance at other tasks too.

	namibj wrote 1 day ago:
	It's a proxy for abstract designing, like writing software or
	designing in a parametric CAD.

	Most the non-math design work of applied engineering AFAIK falls
	under the umbrella that's tested with the pelican riding the
	bicycle.
	You have to make a mental model and then turn it into applicable
	instructions.

	Program code/SVG markup/parametric CAD instructions don't really
	differ in that aspect.

	neepi wrote 1 day ago:
	I would not assume that this methodology applies to applied
	engineering, as a former actual real tangible meat space
	engineer. Things are a little nuanced and the nuances come from
	a combination of communication and experience, neither of which
	any LLM has any insight into at all. It's not out there on the
	internet to train it with and it's not even easy to put it into
	abstract terms which can be used as training data. And
	engineering itself in isolation doesn't exist - there is a
	whole world around it.

	Ergo no you can't just say throw a bicycle into an LLM and a
	parametric model drops out into solidworks, then a machine
	makes it. And everyone buys it. That is the hope really isn't
	it? You end up with a useless shitty bike with a shit pelican
	on it.

	The biggest problem we have in the LLM space is the fact that
	no one really knows any of the proposed use cases enough and
	neither does anyone being told that it works for the use cases.

	rjsw wrote 1 day ago:
	I don't think any of that matters, CEOs will decide to use it
	anyway.

	neepi wrote 1 day ago:
	This is sad but true.

	dist-epoch wrote 1 day ago:


	[1]: https://www.solidworks.com/lp/evolve-your-design-wor...

	neepi wrote 1 day ago:
	Yeah good luck with that. Seriously.

	dist-epoch wrote 1 day ago:
	The point is about exploring the capabilities of the model.

	Like asking you to draw a 2D projection of 4D sphere intersected
	with a 4D torus or something.

	kevindamm wrote 1 day ago:
	Yeah, I suppose it is similar.. I don't know their diameters,
	rotations, nor the distance between their centers, nor which
	two dimensions, so I would have to guess a lot about what you
	meant.


	<- back to front page