| _______ __ _______ | |
| | | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| | || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
| |___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
| on Gopher (inofficial) | |
| Visit Hacker News on the Web | |
| COMMENT PAGE FOR: | |
| The last six months in LLMs, illustrated by pelicans on bicycles | |
| beefnugs wrote 1 hour 43 min ago: | |
| I think its hilarious how humans can make mistakes interpreting the | |
| crazy drawings : He says "I like how it solved the problem of pelicans | |
| not fitting on bicycles by adding a second smaller bicycle to the | |
| stack." | |
| no... that is an attempt at it actually drawing the pedals, and putting | |
| the pelicans feet right on the pedals! | |
| buserror wrote 6 hours 22 min ago: | |
| The hilarious bit is that this page will soon be scraped by ai-bots as | |
| learning material, and they'll all learn to draw pelicans on bicycles | |
| using this as their primary example material, as they'll be the only | |
| examples. | |
| GIGO in motion :-) | |
| darkoob12 wrote 7 hours 40 min ago: | |
| Should we be that excited about AI and calling a fraud and plagiarism | |
| machine "ChatGPT Mischief Buddy" without any moral deliberation? | |
| simonw wrote 7 hours 13 min ago: | |
| The "mischief buddy" joke is a poke at exactly that. | |
| 0points wrote 13 hours 19 min ago: | |
| So the only bird slightly resembling a pelican beak was drawn by gemini | |
| 2.5 pro. In general, none of the output resembles a pelican enough so | |
| you could separate it from "a bird". | |
| OP seem to ignore that pelican has a distinct look when evaluating | |
| these doodles. | |
| simonw wrote 13 hours 17 min ago: | |
| The pelican's distinct look - and the fact that none of the models | |
| can capture it - is the whole point. | |
| irthomasthomas wrote 20 hours 2 min ago: | |
| The best pelicans come from running a consortium of models. I use | |
| pelicans as evals now. [1] Test it using VibeLab (wip) | |
| [1]: https://x.com/xundecidability/status/1921009133077053462 | |
| [2]: https://x.com/xundecidability/status/1926779393633857715 | |
| m3047 wrote 20 hours 27 min ago: | |
| TIL: Snitchbench! | |
| NohatCoder wrote 22 hours 28 min ago: | |
| If you calculate ELO based on a round-robin tournament with all | |
| participants starting out on the same score, then the resulting ratings | |
| should simply correspond to the win count. I guess the algorithm in use | |
| take into account the order of the matches, but taking order into | |
| account is only meaningful when competitors are expected to develop | |
| significantly, otherwise it is just added noise, so we never want to do | |
| so in competitions between bots. | |
| I also can't help but notice that the competition is exactly one match | |
| short, for some reason exactly one of the 561 possible pairings has not | |
| been included. | |
| simonw wrote 21 hours 56 min ago: | |
| Yeah, that's a good call out: Elo isn't actually necessary if you can | |
| have every competitor battle every other competitor exactly once. | |
| The missing match is because one single round was declared a draw by | |
| the model, and I didn't have time to run it again (the Elo stuff was | |
| very much rushed at the last minute.) | |
| NicoSchwandner wrote 23 hours 42 min ago: | |
| Nice post, thanks! | |
| zurichisstained wrote 1 day ago: | |
| Wow, I love this benchmark - I've been doing something similar (as a | |
| joke for and much less frequently), where I ask multiple models to | |
| attempt to create a data structure like: | |
| ``` | |
| const melody = [ | |
| { freq: 261.63, duration: 'quarter' }, // C4 | |
| { freq: 0, duration: 'triplet' }, // triplet rest | |
| { freq: 293.66, duration: 'triplet' }, // D4 | |
| { freq: 0, duration: 'triplet' }, // triplet rest | |
| { freq: 329.63, duration: 'half' }, // E4 | |
| ] | |
| ``` | |
| But with the intro to Smoke on the Water by Deep Purple. Then I run it | |
| through the Web Audio API and see how it sounds. | |
| It's never quite gotten it right, but it's gotten better, to the point | |
| where I can ask it to make a website that can play it. | |
| I think yours is a lot more thoughtful about testing novelty, but its | |
| interesting to see them attempt to do things that they aren't really | |
| built for (in theory!). [1] - ChatGPT 4 Turbo [2] - Claude Sonnet 3.7 | |
| [3] - Gemini 2.5 Pro | |
| Gemini is by far the best sounding one, but it's still off. I'd be | |
| curious how the latest and greatest (paid) versions fare. | |
| (And just for comparison, here's the first time I did it... you can | |
| tell I did the front-end because there isn't much to it!) | |
| [1]: https://codepen.io/mvattuone/pen/qEdPaoW | |
| [2]: https://codepen.io/mvattuone/pen/ogXGzdg | |
| [3]: https://codepen.io/mvattuone/pen/ZYGXpom | |
| [4]: https://nitter.space/mvattuone/status/1646610228748730368#m | |
| ojosilva wrote 20 hours 24 min ago: | |
| Drawbacks for using a pelican on a bicycle svg: it's a very | |
| open-ended prompt, no specific criteria to judge, and lately the svg | |
| all start to look similar, or at least like they accomplished the | |
| same non-goals (there's a pelican, there's a bicycle and I'm not sure | |
| its feet should be on the saddle or on the pedals), so it's hard to | |
| agree on which is better. And, certainly, having a LLM as a judge, | |
| the entire game becomes double-hinged and who knows what to think. | |
| Also, if it becomes popular, training sets may pick it up and improve | |
| models unfairly and unrealistically. But that's true of any known | |
| benchmark. | |
| Side note: I'd really like to see the Language Benchmark Game become | |
| a prompt based languages * models benchmark game. So we could say | |
| model X excels at Python Fasta, etc. although then the risk is that, | |
| again, it becomes training set and the whole thing self-rigs itself. | |
| dr_kretyn wrote 22 hours 33 min ago: | |
| I'm slightly confused by your example. What's the actual prompt? Is | |
| your expectation that a text model is going to know how to perform | |
| the exact song in audio? | |
| zurichisstained wrote 20 hours 15 min ago: | |
| Ohhh absolutely not, that would be pretty wild - I just wanted to | |
| see if it could understand musical notation enough to come up with | |
| the correct melody. | |
| I know there are far better ways to do gen AI with music, this was | |
| just a joke prompt that worked far better than I expected. | |
| My naive guess is all of the guitar tabs and signal processing info | |
| it's trained on gives it the ability to do stuff like this (albeit | |
| not very well). | |
| isx726552 wrote 1 day ago: | |
| > Iâve been feeling pretty good about my benchmark! It should stay | |
| useful for a long time... provided none of the big AI labs catch on. | |
| > And then I saw this in the Google I/O keynote a few weeks ago, in a | |
| blink and youâll miss it moment! Thereâs a pelican riding a | |
| bicycle! Theyâre on to me. Iâm going to have to switch to something | |
| else. | |
| Yeah this touches on an issue that makes it very difficult to have a | |
| discussion in public about AI capabilities. Any specific test you talk | |
| about, no matter how small ⦠if the big companies get wind of it, it | |
| will be RLHFâd away, sometimes to the point of absurdity. Just refer | |
| to the old âcount the ârâs in strawberryâ canard for one | |
| example. | |
| lofaszvanitt wrote 6 hours 28 min ago: | |
| You push sha512 hashes of things in a githup repo and a short | |
| sentence: | |
| x8 version: still shit | |
| . | |
| . | |
| x15 version: we are closing, but overall a shit experience :D | |
| this way they won't know what to improve upon. of course they can buy | |
| access. ;P | |
| when they finally solve your problem you can reveal what was the | |
| benchmark. | |
| Choco31415 wrote 22 hours 36 min ago: | |
| Just tried that canard on GPT-4o and it failed: | |
| "The word "strawberry" contains 2 letter râs." | |
| belter wrote 2 hours 49 min ago: | |
| I tried | |
| strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said | |
| three | |
| strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said | |
| four | |
| stawberrry -> DeepSeek, GeminiPro all correctly said three | |
| ChatGPT4o even in a new Chat, incorrectly said the word | |
| "stawberrry" contains 4 letter "r" characters. Even provided this | |
| useful breakdown to let me know :-) | |
| Breakdown: | |
| stawberrry â s, t, a, w, b, e, r, r, r, y â 4 r's | |
| And then asked if I meant "strawberry" instead and said because | |
| that one has 2 r's.... | |
| simonw wrote 22 hours 52 min ago: | |
| Honestly, if my stupid pelican riding a bicycle benchmark becomes | |
| influential enough that AI labs waste their time optimizing for it | |
| and produce really beautiful pelican illustrations I will consider | |
| that a huge personal win. | |
| MattRix wrote 23 hours 4 min ago: | |
| This is why things like the ARC Prize are better ways of approaching | |
| this: | |
| [1]: https://arcprize.org | |
| whiplash451 wrote 12 hours 39 min ago: | |
| Well, ARC-1 did not end well for the competitors of tech giants and | |
| itâs very unclear that ARC-2 wonât follow the same trajectory. | |
| joshuajooste05 wrote 1 day ago: | |
| Does anyone have any thoughts on privacy/safety regarding what he said | |
| about GPT memory. | |
| I had heard of prompt injection already. But, this seems different, | |
| completely out of humans control. Like even when you consider web | |
| search functionality, he is actually right, more and more, users are | |
| losing control over context. | |
| Is this dangerous atm? Do you think it will become more dangerous in | |
| the future when we chuck even more data into context? | |
| threeseed wrote 21 hours 53 min ago: | |
| I've had Cursor/Claude try to call rm -rf on my entire User directory | |
| before. | |
| The issue is that LLMs have no ability to organise their memory by | |
| importance. Especially as the context size gets larger. | |
| So when they are using tools they will become more dangerous over | |
| time. | |
| ActorNightly wrote 1 day ago: | |
| Sort of. The thing is with agentic models, you are basically entering | |
| probability space where it can do real actions in the form of http | |
| requests if the statistical output leads it to it. | |
| Joker_vD wrote 1 day ago: | |
| > most people find it difficult to remember the exact orientation of | |
| the frame. | |
| Isn't it ÎâÎ welded together? The bottom left and right vertices | |
| are where the wheels are attached to, the middle bottom point is where | |
| the big gear with the pedals is. The lambda is for the front wheel | |
| because you wouldn't be able to turn it if it was attached to a delta. | |
| Right? | |
| I guess having my first bicycle be a cheap Soviet-era produced one paid | |
| off: I spent loads of time fidgeting with the chain tension, and | |
| pulling the chain back onto the gears, so I guess I had to stare at the | |
| frame way too much to forget even by today the way it looks. | |
| pbronez wrote 1 day ago: | |
| There are a lot of structural details that people tend to gloss over. | |
| This was illustrated by an Italian art project: [1] > back in 2009 I | |
| began pestering friends and random strangers. I would walk up to them | |
| with a pen and a sheet of paper asking that they immediately draw me | |
| a menâs bicycle, by heart. Soon I found out that when confronted | |
| with this odd request most people have a very hard time remembering | |
| exactly how a bike is made. | |
| [1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/ | |
| zahlman wrote 1 day ago: | |
| > If you lost interest in local modelsâlike I did eight months | |
| agoâitâs worth paying attention to them again. Theyâve got good | |
| now! | |
| > As a power user of these tools, I want to stay in complete control of | |
| what the inputs are. Features like ChatGPT memory are taking that | |
| control away from me. | |
| You reap what you sow.... | |
| > I already have a tool I built called shot-scraper, a CLI app that | |
| lets me take screenshots of web pages and save them as images. I had | |
| Claude build me a web page that accepts ?left= and ?right= parameters | |
| pointing to image URLs and then embeds them side-by-side on a page. | |
| Then I could take screenshots of those two images side-by-side. I | |
| generated one of those for every possible match-up of my 34 pelican | |
| picturesâ560 matches in total. | |
| Surely it would have been easier to use a local tool like ImageMagick? | |
| You could even have the AI write a Bash script for you. | |
| > ... but prompt injection is still a thing. | |
| ...Why wouldn't it always be? There's no quoting or escaping mechanism | |
| that's actually out-of-band. | |
| > Thereâs this thing Iâm calling the lethal trifecta, which is when | |
| you have an AI system that has access to private data, and potential | |
| exposure to malicious instructionsâso other people can trick it into | |
| doing things... and thereâs a mechanism to exfiltrate stuff. | |
| People in 2025 actually need to be told this. Franklin missed the mark | |
| - people today will trip over themselves to give up both their security | |
| and their liberty for mere convenience. | |
| simonw wrote 1 day ago: | |
| I had the LLM write a bash script for me that used my [1] tool - on | |
| the basis that it was a neat opportunity to demonstrate another of my | |
| own projects. | |
| And honestly, even with LLM assistance getting Image Magick to output | |
| a 1200x600 image with two SVGs next to each other that are correctly | |
| resized to fill their half of the image sounds pretty tricky. | |
| Probably easier (for Claude) to achieve with HTML and CSS. | |
| [1]: https://shot-scraper.datasette.io/ | |
| voiper1 wrote 1 day ago: | |
| Isn't "left or right" _followed_ by rationale asking it to | |
| rationalize it's 1 word answer - I thought we need to get AI to do | |
| the chain of though _before_ giving it's answer for it to be more | |
| accurate? | |
| simonw wrote 1 day ago: | |
| Yes it is - I would likely have gotten better results if I'd | |
| asked for the rationale first. | |
| zahlman wrote 1 day ago: | |
| > And honestly, even with LLM assistance getting Image Magick to | |
| output a 1200x600 image with two SVGs next to each other that are | |
| correctly resized to fill their half of the image sounds pretty | |
| tricky. | |
| FWIW, the next project I want to look at after my current two, is a | |
| command-line tool to make this sort of thing easier. Likely | |
| featuring some sort of Lisp-like DSL to describe what to do with | |
| the input images. | |
| username223 wrote 1 day ago: | |
| Interesting timeline, though the most relevant part was at the end, | |
| where Simon mentions that Google is now aware of the "pelican on | |
| bicycle" question, so it is no longer useful as a benchmark. FWIW, many | |
| things outside of the training data will pants these models. I just | |
| tried this query, which probably has no examples online, and Gemini | |
| gave me the standard puzzle answer, which is wrong: | |
| "Say I have a wolf, a goat, and some cabbage, and I want to get them | |
| across a river. The wolf will eat the goat if they're left alone, which | |
| is bad. The goat will eat some cabbage, and will starve otherwise. How | |
| do I get them all across the river in the fewest trips?" | |
| A child would pick up that you have plenty of cabbage, but can't leave | |
| the goat without it, lest it starve. Also, there's no mention of boat | |
| capacity, so you could just bring them all over at once. Useful? | |
| Sometimes. Intelligent? No. | |
| djherbis wrote 1 day ago: | |
| Kaggle recently ran a competition to do just this (draw SVGs from | |
| prompts, using fairly small models under the hood). | |
| The top results (click on the top Solutions) were pretty impressive: | |
| [1]: https://www.kaggle.com/competitions/drawing-with-llms/leaderbo... | |
| nine_k wrote 1 day ago: | |
| Am I the only one who can't but see these attempts much like attempts | |
| of a kid learning to draw? | |
| Ygg2 wrote 1 day ago: | |
| Yes. Kids don't draw that good of a line at the start. | |
| Here is better example of start | |
| [1]: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfTfAA... | |
| nine_k wrote 1 day ago: | |
| Have you tried giving a kid a vector-drawing tool? | |
| I did that to my daughter when she was not even 6 years old. The | |
| results were somehow similar: [1] (Now she's much better, but | |
| prefers raster tools, e.g. [2] ) | |
| [1]: https://photos.app.goo.gl/XSLnTEUkmtW2n7cX8 | |
| [2]: https://www.deviantart.com/sofiac9/art/Ivy-with-riding-gea... | |
| pier25 wrote 1 day ago: | |
| Definitely getting better but even the best result is not very | |
| impressive. | |
| jfengel wrote 1 day ago: | |
| It's not so great at bicycles, either. None of those are close to | |
| rideable. | |
| But bicycles are famously hard for artists as well. Cyclists can | |
| identify all of the parts, but if you don't ride a lot it can be | |
| surprisingly difficult to get all of the major bits of geometry right. | |
| mattlondon wrote 1 day ago: | |
| Most recent Gemini 2.5 one looks pretty good. Certainly rideable. | |
| bredren wrote 1 day ago: | |
| Great writeup. | |
| This measure of LLM capability could be extended by taking it into the | |
| 3D domain. | |
| That is, having the model write Python code for Blender, then running | |
| blender in headless mode behind an API. | |
| The talk hints at this but one shot prompting likely wonât be a broad | |
| enough measurement of capability by this time next year. (Or perhaps | |
| now, even) | |
| So the test could also include an agentic portion that includes | |
| consultation of the latest blender documentation or even use of a | |
| search engine for blog entries detailing syntax and technique. | |
| For multimodal input processing, it could take into account a | |
| particular photo of a pelican as the test subject. | |
| For usability, the objects can be converted to iOSâs native 3d format | |
| that can be viewed in mobile safari. | |
| I built this workflow, including a service for blender as an initial | |
| test of what was possible in October of 2022. It took post processing | |
| for common syntax errors back then but id imagine the newer LLMs would | |
| make those mistakes less often now. | |
| mromanuk wrote 1 day ago: | |
| The last animation is hilarious, represents very well the AI Hype cycle | |
| vs reality. | |
| nowayno583 wrote 1 day ago: | |
| That was a very fun recap, thanks for sharing. It's easy to forget how | |
| much better these things have gotten. And this was in just six months! | |
| Crazy! | |
| adrian17 wrote 1 day ago: | |
| > This was one of the most successful product launches of all time. | |
| They signed up 100 million new user accounts in a week! They had a | |
| single hour where they signed up a million new accounts, as this thing | |
| kept on going viral again and again and again. | |
| Awkwardly, I never heard of it until now. I was aware that at some | |
| point they added ability to generate images to the app, but I never | |
| realized it was a major thing (plus I already had an offline stable | |
| diffusion app on my phone, so it felt less of an upgrade to me | |
| personally). With so much AI news each week, feels like unless you're | |
| really invested in the space, it's almost impossible to not | |
| accidentally miss or dismiss some big release. | |
| MattRix wrote 23 hours 8 min ago: | |
| To be clear: they already had image generation in ChatGPT, but this | |
| was a MUCH better one than what they had previously. Even for you | |
| with your stable diffusion app, it would be a significant upgrade. | |
| Not just because of image quality, but because it can actually | |
| generate coherent images and follow instructions. | |
| MIC132 wrote 12 hours 31 min ago: | |
| As impressive as it is, for some uses it still is worse than a | |
| local SD model. | |
| It will refuse to generate named anime characters (because of | |
| copyright, or because it just doesn't know them, even not | |
| particularly obscure ones) for example. | |
| Or obviously anything even remotely spicy. | |
| As someone who mostly uses image generation to amuse myself (and | |
| not to post it, where copyright might matter) it's honestly | |
| somewhat disappointing. But I don't expect any of the major AI | |
| companies to release anything without excessive guardrails. | |
| bufferoverflow wrote 1 day ago: | |
| Have you missed how everyone was Ghiblifying everything? | |
| andrepd wrote 21 hours 26 min ago: | |
| Oh you mean the trend of the day on the social media monoculture? I | |
| don't take that as an indicator of any significance. | |
| Philpax wrote 19 hours 1 min ago: | |
| One should not be proud of their ignorance. | |
| DaSHacka wrote 16 hours 13 min ago: | |
| Except when it comes to using social media, where "ignorance" | |
| unironically is strength | |
| adrian17 wrote 23 hours 21 min ago: | |
| I saw that, I just didn't connect it with newly added multimodal | |
| image generation. I knew variations of style transfer (or LoRA for | |
| SD) were possible for years, so I assumed it exploded in popularity | |
| purely as a meme, not due to OpenAI making it much more accessible. | |
| Again, I was aware that they added image generation, just not how | |
| much of a deal it turned out to be. Think of it like me | |
| occasionally noticing merchandise and TV trailers for a new movie | |
| without realizing it became the new worldwide box office #1. | |
| haiku2077 wrote 1 day ago: | |
| Congratulations, you are almost fully unplugged from social media. | |
| This product launch was a huge mainstream event; for a few days GPT | |
| generated images completely dominated mainstream social media. | |
| Semaphor wrote 9 hours 28 min ago: | |
| Facebook, discord, reddit, HN. Hadnât heard of it either. But for | |
| FB, Reddit, and Discord I strictly curate what I see. | |
| sigmoid10 wrote 9 hours 39 min ago: | |
| If you primarily consume text-based social media (HN, reddit with | |
| legacy UI) then it's kind of easy to not notice all the new kinds | |
| of image infographics and comics that now completely flood places | |
| like instagram or linkedin. | |
| derwiki wrote 1 day ago: | |
| Not sure if this is sarcasm or sincere, but I will take it as | |
| sincere haha. I came back to work from parental leave and everyone | |
| had that same Studio Ghiblized image as their Slack photo, and I | |
| had no idea why. It turns out you really can unplug from social | |
| media and not miss anything of value: if itâs a big enough deal | |
| you will find out from another channel. | |
| stavros wrote 7 hours 43 min ago: | |
| Why does everyone keep calling news "social media"? Have I missed | |
| a trend? Knowing what my friend Steve is up to is social media, | |
| knowing what AI is up to is news. | |
| loudmax wrote 1 hour 49 min ago: | |
| I'm afraid a lot of Americans consume the news like they | |
| consume sports media. They root for their team and select a | |
| news stream that presents them with the most favorable | |
| coverage. | |
| stavros wrote 1 hour 47 min ago: | |
| As a non-American, I can assure you that's pretty much | |
| everywhere. | |
| haiku2077 wrote 3 hours 27 min ago: | |
| You did miss a trend: | |
| [1]: https://www.pewresearch.org/short-reads/2024/09/17/mor... | |
| dgfitz wrote 20 hours 45 min ago: | |
| I missed it until this thread. I think Iâm proud of myself. | |
| tough wrote 8 hours 20 min ago: | |
| You're one of today's lucky 10.000 | |
| [1]: https://xkcd.com/1053/ | |
| azinman2 wrote 1 day ago: | |
| Except this went very mainstream. Lots of turn myself into a muppet, | |
| what is the human equivalent for my dog, etc. TikTok is all over | |
| this. | |
| It really is incredible. | |
| thierrydamiba wrote 1 day ago: | |
| The big trend was around the ghiblification of images. Those images | |
| were everywhere for a period of time. | |
| herval wrote 1 day ago: | |
| They still are. Instagram is full of accounts posting | |
| gpt-generated cartoons (and now veo3 videos). Iâve been | |
| tracking the image generation space from day one, and it never | |
| stuck like this before | |
| simonw wrote 1 day ago: | |
| Anecdotally, I've had several conversations with people way | |
| outside the hyper-online demographic who have been really | |
| enjoying the new ChatGPT image generation - using it for | |
| cartoon photos of their kids, to create custom birthday cards | |
| etc. | |
| I think it's broken out into mainstream adoption and is going | |
| to stay there. | |
| It reminds me a little of Napster. The Napster UI was terrible, | |
| but it let people do something they had never been able to do | |
| before: listen to any piece of music ever released, on-demand. | |
| As a result people with almost no interest in technology at all | |
| were learning how to use it. | |
| Most people have never had the ability to turn a photo of their | |
| kids into a cute cartoon before, and it turns out that's | |
| something they really want to be able to do. | |
| herval wrote 1 day ago: | |
| Definitely. Itâs not just online either - half the | |
| billboards I see now are AI. The posters at school. The | |
| âweâre hiring!â ad at the local McDonalds. Itâs … | |
| cheaper and faster than any alternative (stock images, hiring | |
| an editor or illustrator, etc), and most non technical people | |
| can get exactly what they want in a single shot, these days. | |
| Jedd wrote 1 day ago: | |
| Yeah, but so were the bored ape NFTs - none of these ephemeral | |
| fads are any indication of quality, longevity, legitimacy, or | |
| interest. | |
| sandspar wrote 13 hours 8 min ago: | |
| I just don't understand how people can see "100 million signups | |
| in a week" and immediately dismiss it. We're not talking about | |
| fidget spinners. I don't get why this sentiment is so common | |
| here on HackerNews. It's become a running joke in other online | |
| spaces, "HackerNews commenters keep saying that AI is a | |
| nothingburger." It's just a groupthink thing I guess, a | |
| kneejerk response. | |
| otabdeveloper4 wrote 9 hours 36 min ago: | |
| > We're not talking about fidget spinners. | |
| We're talking about Hitler memes instead? I don't understand | |
| your feigned outrage. | |
| The actual valid commercial use case for generative images | |
| hasn't been found yet. (No, making blog spam prettier is not | |
| a good use case.) | |
| simonw wrote 7 hours 15 min ago: | |
| Everything Everywhere All At Once won a bunch of Oscars. | |
| They used generative AI tools for some of their | |
| post-production work (achieved by a tiny team), for example | |
| to help clean up the backgrounds in the scene with the | |
| silent dialog between the two rocks. | |
| stavros wrote 7 hours 40 min ago: | |
| You're right, nothing has value unless someone figures out | |
| how to make money with it. Except OpenAI, apparently, | |
| because the fact that people buy ChatGPT to make images | |
| doesn't seem to count as a commercial use case. | |
| otabdeveloper4 wrote 6 hours 48 min ago: | |
| OpenAI is not profitable and we don't know if it ever | |
| will be. | |
| stavros wrote 6 hours 44 min ago: | |
| Have we shifted the goalposts from "something people | |
| will pay for" to "needs to be profitable even with | |
| massive R&D" then? | |
| otabdeveloper4 wrote 5 hours 25 min ago: | |
| OpenAI is not "something people will pay for" at the | |
| moment though. | |
| stavros wrote 5 hours 14 min ago: | |
| Except lots of people are paying for it. I'll refer | |
| you to the other post on the front page for the | |
| calculation that OpenAI would have to get just an | |
| extra $10/yr from their users to break even. | |
| otabdeveloper4 wrote 3 hours 10 min ago: | |
| Your response reminds me of that joke about | |
| selling a dollar bill for ninety cents. | |
| stavros wrote 3 hours 7 min ago: | |
| Your response makes me think we have different | |
| definitions for profitability. | |
| pintxo wrote 12 hours 35 min ago: | |
| I assume, when people dismiss it, they are not looking at it | |
| through the business lens and the 100m user signups KPI, but | |
| they are dismissing it on technical grounds, as an LLM is | |
| just a very big statistical database which seems incapable of | |
| solving problems beyond (impressive looking) text/image/video | |
| generation. | |
| sandspar wrote 12 hours 14 min ago: | |
| Makes sense. Although I think that's an error. TikTok is | |
| "just" a video sharing site. Joe Rogan is "just" a | |
| podcaster. Dumb things that affect lots of people are | |
| important. | |
| micromacrofoot wrote 23 hours 36 min ago: | |
| they're not but I'm already seeing ai generated images on | |
| billboards for local businesses, they're in production | |
| workflows now and they aren't going anywhere | |
| baq wrote 1 day ago: | |
| Itâs hard to think of a worse analogy TBH. My wife is using | |
| ChatGPT to change photos (still is to this day), she didnât | |
| use it or any other LLM until that feature hit. It is a fad, | |
| but itâs also a very useful tool. | |
| Ape NFTs are⦠ape NFTs. Useless. Pointless. Negative value | |
| for most people. | |
| Jedd wrote 11 hours 14 min ago: | |
| I would note that I was replying to a comment about the 'big | |
| trend of ghiblification' of images. | |
| Reproducing a certain style of image has been a regular fad | |
| since profile pictures became a thing sometime last century. | |
| I was not meaning to suggest that large language & diffusion | |
| models are fads. | |
| (I do think their capabilities are poorly understood and/or | |
| over-estimated by non-technical and some technical people | |
| alike, but that invites a more nuanced discussion.) | |
| While I'm sure your wife is getting good value out of the | |
| system, whether it's a better fit for purpose, produces a | |
| better quality, or provides a more satisfying workflow -- | |
| than say a decent free photo editor -- or whether other tools | |
| were tried but determined to be too limited or difficult, etc | |
| -- only you or her could say. It does feel like a small | |
| sample set, though. | |
| senthil_rajasek wrote 1 day ago: | |
| "My wife is using ChatGPT to change photos (still is to this | |
| day), she didnât use it or any other LLM until that feature | |
| hit." | |
| This is deja vu, except instead of ChatGPT to edit photos it | |
| was instagram a decade ago. | |
| baq wrote 1 day ago: | |
| You either havenât tried it or are just trolling. | |
| senthil_rajasek wrote 23 hours 20 min ago: | |
| I am contrasting how instagram filters gave users some | |
| control and increased user base and how today editing | |
| photos with LLMs is doing the same and pulling in a wider | |
| user base. | |
| djhn wrote 23 hours 33 min ago: | |
| I tried it and I donât get it. What and where are the | |
| legal usecases? What can you do with these low-resolution | |
| images? | |
| jauntywundrkind wrote 1 day ago: | |
| Applying some filters and adding some overlay text is | |
| something some folks did, but there's such a massive | |
| creative world that's opened up, where all we have to do is | |
| ask. | |
| mrkurt wrote 1 day ago: | |
| If we try really hard, I think we can make an exhaustive list | |
| of what viral fads on the internet are not. You made a small | |
| start. | |
| none of these ephemeral fads are any indication of quality, | |
| longevity, legitimacy, interest, substance, endurance, | |
| prestige, relevance, credibility, allure, staying-power, | |
| refinement, or depth. | |
| Aurornis wrote 22 hours 43 min ago: | |
| 100 million people didnât sign up to make that one image | |
| meme and then never use it again. | |
| That many signups is impressive no matter what. The attempts | |
| to downplay every aspect of LLM popularity are getting really | |
| tiresome. | |
| otabdeveloper4 wrote 9 hours 40 min ago: | |
| > 100 million people didnât sign up to make that one | |
| image meme and then never use it again. | |
| Source? They did exactly that. | |
| simonw wrote 7 hours 12 min ago: | |
| What's your source for saying they did exactly that? | |
| jodrellblank wrote 22 hours 31 min ago: | |
| I think it sounds far more likely that 100M people signed | |
| up to poke at the latest viral novelty and create one meme, | |
| than that 100M people suddenly discovered they had a | |
| pressing long-term need for AI images all on the same day. | |
| Doesnât it? | |
| ben_w wrote 22 hours 1 min ago: | |
| While 100M signing up just for one pic is certainly | |
| possible, I note that several hundred million people | |
| regularly share photographs of their lunch, so it is very | |
| plausible that in signing up for the latest meme | |
| generator they found they liked the ability to generate | |
| custom images of whatever they consider to be pretty | |
| pictures every day. | |
| gretch wrote 22 hours 6 min ago: | |
| It's neither of these options in this false dichotomy. | |
| 100M people signed up and did at least 1 task. Then, most | |
| likely some % of them discovered it was a useful thing | |
| (if for nothing else than just to make more memes), and | |
| converted into a MAU. | |
| If I had to use my intuition, I would say it's 5% - 10%, | |
| which represents a larger product launch than most | |
| developers will ever participate in, in the context of a | |
| single day. | |
| Of course the ongoing stickiness of the MAU also depends | |
| on the ability of this particular tool to stay on top | |
| amongst increasing competition. | |
| oblio wrote 15 hours 39 min ago: | |
| Apparently OpenAI is losing money like crazy on this | |
| and their conversion rates to paid are abysmal, even | |
| for the cheaper licenses. And not even their top | |
| subscription covers its cost. | |
| Uber at a 10x scale. | |
| I should add that compared to the hype, at a global | |
| level Uber is a failure. Yes, it's still a big company, | |
| yes, it's profitable now, but I think it was launched | |
| 10+ years ago and it's barely becoming net profitabile | |
| over it's existence now and shows no signs of taking | |
| over the world. Sure, it's big in the US and a few | |
| specific markets. But elsewhere it's either banned for | |
| undermining labor practices or has stiff local | |
| competition or it's just not cost competitive and it | |
| won't enter the market because without the whole "gig | |
| economy" scam it's just a regular taxi company with a | |
| better app. | |
| simonw wrote 14 hours 48 min ago: | |
| Is that information about their low conversion rates | |
| from credible sources? | |
| oblio wrote 13 hours 20 min ago: | |
| It's quite hard to say for sure, and I will prefix | |
| my comment by saying his blog posts are very long | |
| and quite doomerist about LLMs, but he makes a | |
| decent case about OpenAI financials: [1] [2] A very | |
| solid argument is like that against propaganda: | |
| it's not so much about what is being said but what | |
| about isn't. OpenAI is basically shouting about | |
| every minor achievement from the rooftops so the | |
| fact that they are remarkably silent about | |
| financial fundamentals says something. At best | |
| something mediocre or more likely bad. | |
| [1]: https://www.wheresyoured.at/wheres-the-mon... | |
| [2]: https://www.wheresyoured.at/openai-is-a-sy... | |
| landgenoot wrote 1 day ago: | |
| If you would give a human the SVG documentation and ask to write an | |
| SVG, I think the results would be quite similar. | |
| ramesh31 wrote 1 day ago: | |
| >If you would give a human the SVG documentation and ask to write an | |
| SVG, I think the results would be quite similar. | |
| It certainly would, and it would cost at minimum an hour of the human | |
| programmer's time at $50+/hr. Claude does it in seconds for pennies. | |
| diggan wrote 1 day ago: | |
| Lets give it a try, if you're willing to be the experiment subject :) | |
| The prompt is "Generate an SVG of a pelican riding a bicycle" and | |
| you're supposed to write it by hand, so no graphical editor. The | |
| specification is here: [1] I'm fairly certain I'd lose interest in | |
| getting it right before I got something better than most of those. | |
| [1]: https://www.w3.org/TR/SVG2/ | |
| zahlman wrote 1 day ago: | |
| > The colors use traditional bicycle brown (#8B4513) and a classic | |
| blue for the pelican (#4169E1) with gold accents for the beak | |
| (#FFD700). | |
| The output pelican is indeed blue. I can't fathom where the idea | |
| that this is "classic", or suitable for a pelican, could have come | |
| from. | |
| diggan wrote 1 day ago: | |
| My guess would be that it doesn't see the web colors (CSS color | |
| hexes) as proper hex triplets, but because of tokenization it | |
| could be something dumb like '#8B','451','3' instead. I think the | |
| same issue happens around multiple special characters after each | |
| other too. | |
| cap11235 wrote 13 hours 59 min ago: | |
| Qwen3, at least, tokenizes each character of "#8B4513" | |
| separately. | |
| zahlman wrote 19 hours 15 min ago: | |
| No, it's understanding the colors properly. The SVG that the | |
| LLM created does use #4169E1 for the pelican color, and the LLM | |
| correctly describes this color as blue. The problem is that | |
| pelicans should not be blue. | |
| mormegil wrote 1 day ago: | |
| Did the testing prompt for LLMs include a clause forbidding the use | |
| of any tools? If not, why are you adding it here? | |
| simonw wrote 1 day ago: | |
| The way I run the pelican on a bicycle benchmark is to use this | |
| exact prompt: | |
| Generate an SVG of a pelican riding a bicycle | |
| And execute it via the model's API with all default settings, not | |
| via their user-facing interface. | |
| Currently none of the model APIs enable tools unless you ask them | |
| to, so this method excludes the use of additional tools. | |
| diggan wrote 1 day ago: | |
| The models that are being put under the "Pelican" testing don't | |
| use a GUI to create SVGs (either via "tools" or anything else), | |
| they're all Text Generation models so they exclusively use text | |
| for creating the graphics. | |
| There are 31 posts listed under "pelican-riding-a-bicycle" in | |
| case you wanna inspect the methodology even closer: | |
| [1]: https://simonwillison.net/tags/pelican-riding-a-bicycle/ | |
| wohoef wrote 1 day ago: | |
| Quite a detailed image using claude sonnet 4: | |
| [1]: https://ibb.co/39RbRm5W | |
| spaceman_2020 wrote 1 day ago: | |
| I donât know what secret sauce Anthropic has, but in real world use, | |
| Sonnet is somehow still the best model around. Better than Opus and | |
| Gemini Pro | |
| diggan wrote 1 day ago: | |
| Statements like these are useless without sharing exactly all the | |
| models you've tried. Sonnet beats O1 Pro Mode for example? Not in my | |
| experience, but I haven't tried the latest Sonnet versions, only the | |
| one before, so wouldn't claim O1 Pro Mode beats everything out there. | |
| Besides, it's so heavily context-dependent that you really need your | |
| own private benchmarks to make head or tails out of this whole thing. | |
| big_hacker wrote 1 day ago: | |
| Honestly the metric which increased the most is the marketing and | |
| astroturfing budget of the major players (OpenAI, Anthropic, Google and | |
| Deepseek). | |
| Say what you want about Facebook but at least they released their | |
| flagship model fully open. | |
| mdaniel wrote 1 day ago: | |
| > model fully open. | |
| uh-huh | |
| [1]: https://www.llama.com/llama4/license/ | |
| franze wrote 1 day ago: | |
| Here Claude Opus Extended Thinking | |
| [1]: https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c61... | |
| ramesh31 wrote 1 day ago: | |
| Single shot? | |
| franze wrote 1 day ago: | |
| 2 shot, first one did just generate the svg not the shareable html | |
| page around it. in the second go it also worked on the svg as i did | |
| not forbid it. | |
| deadbabe wrote 1 day ago: | |
| As a control, he should go on fiver and have a human generate a pelican | |
| riding a bicycle, just to see what the eventual goal is. | |
| gus_massa wrote 1 day ago: | |
| Someone did this. Look at this sibling comment by ben_w [1] about an | |
| old similar project. | |
| [1]: https://news.ycombinator.com/item?id=44216284 | |
| zahlman wrote 1 day ago: | |
| > back in 2009 I began pestering friends and random strangers. I | |
| would walk up to them with a pen and a sheet of paper asking that | |
| they immediately draw me a menâs bicycle, by heart. | |
| Someone commissioned to draw a bicycle on Fiverr would not have to | |
| rely on memory of what it should look like. It would take barely | |
| any time to just look up a reference. | |
| atxtechbro wrote 1 day ago: | |
| Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings | |
| and this is great too! I like the personalized benchmark. Hopefully the | |
| big LLM providers don't start gaming the pelican index! | |
| dirtyhippiefree wrote 1 day ago: | |
| Hereâs the spot where we see whoâs TL;DR⦠| |
| > Claude 4 will rat you out to the feds! | |
| >If you expose it to evidence of malfeasance in your company, and you | |
| tell it it should act ethically, and you give it the ability to send | |
| email, itâll rat you out. | |
| gscott wrote 21 hours 31 min ago: | |
| I am interested in this ratting you out thing. At some point you | |
| have a video feed into AI from a Jarvis like headset device, you | |
| walking down the street and cross the street in the middle not at a | |
| sidewalk... does it rat you out? Does it make a list of every crime | |
| no matter how small? Or just the big ones? | |
| yubblegum wrote 1 day ago: | |
| I was looking at that and wondering about swatting via LLMs by | |
| malicious users. | |
| ben_w wrote 1 day ago: | |
| I'd say that's too short. | |
| > But itâs not just Claude. Theo Browne put together a new | |
| benchmark called SnitchBench, inspired by the Claude 4 System Card. | |
| > It turns out nearly all of the models do the same thing. | |
| dirtyhippiefree wrote 1 day ago: | |
| I totally agree, but I needed you to post the other half because of | |
| TL;DR⦠| |
| bravesoul2 wrote 1 day ago: | |
| Is there a good model (any architecture) for vector graphics out of | |
| interest? | |
| simonw wrote 1 day ago: | |
| I was impressed by Recraft v3, which gave me an editable vector | |
| illustration with different layers - [1] - but as I understand it | |
| that one is actually still a raster image generator with a separate | |
| step to convert to vector at the end. | |
| [1]: https://simonwillison.net/2024/Nov/15/recraft-v3/ | |
| bravesoul2 wrote 1 day ago: | |
| Now that is a pelican on a bicycle! Thanks | |
| JimDabell wrote 1 day ago: | |
| See also: The recent history of AI in 32 otters | |
| [1]: https://www.oneusefulthing.org/p/the-recent-history-of-ai-in-3... | |
| pbhjpbhj wrote 1 day ago: | |
| That is otterly fantastic. The post there shows the breadth too - | |
| both otters generated via text representations (in TikZ) and by image | |
| generators. The video at the end, wow (and funny too). | |
| Thanks for sharing. | |
| qwertytyyuu wrote 1 day ago: | |
| [1] here are a few i tried the models, looks like the newer vesion of | |
| gemini is another improvement? | |
| [1]: https://imgur.com/a/mzZ77xI | |
| puttycat wrote 1 day ago: | |
| The bicycle are still very far from actual ones. | |
| pjs_ wrote 1 day ago: | |
| [1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/ | |
| simonw wrote 1 day ago: | |
| I think the most recent Gemini Pro bicycle may be the best yet - | |
| the red frame is genuinely the right shape. | |
| layer8 wrote 1 day ago: | |
| The pelican, on the other hand... | |
| anon373839 wrote 1 day ago: | |
| Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a | |
| really strong release, especially the fine-grained MoE which is unlike | |
| anything thatâs come before (in terms of capability and speed on | |
| consumer hardware). | |
| simonw wrote 1 day ago: | |
| Omitting Qwen 3 is my great regret about this talk. Honestly I only | |
| realized I had missed it after I had delivered the talk! | |
| It's one of my favorite local models right now, I'm not sure how I | |
| missed it when I was reviewing my highlights of the last six months. | |
| Maxious wrote 1 day ago: | |
| Cut for time - qwen3 was pelican tested too | |
| [1]: https://simonwillison.net/2025/Apr/29/qwen-3/ | |
| nathan_phoenix wrote 1 day ago: | |
| My biggest gripe is that he's comparing probabilistic models (LLMs) by | |
| a single sample. | |
| You wouldn't compare different random number generators by taking one | |
| sample from each and then concluding that generator 5 generates the | |
| highest numbers... | |
| Would be nicer to run the comparison with 10 images (or more) for each | |
| LLM and then average. | |
| timewizard wrote 1 day ago: | |
| My biggest gripe is he didn't include a picture of an actual pelican. | |
| [1] The "closest pelican" is not even close. | |
| [1]: https://www.google.com/search?q=pelican&udm=2 | |
| mooreds wrote 1 day ago: | |
| My biggest gripe is that he outsourced evaluation of the pelicans to | |
| another LLM. | |
| I get it was way easier to do and that doing it took pennies and no | |
| time. But I would have loved it if he'd tried alternate methods of | |
| judging and seen what the results were. | |
| Other ways: | |
| * wisdom of the crowds (have people vote on it) | |
| * wisdom of the experts (send the pelican images to a few dozen | |
| artists or ornithologists) | |
| * wisdom of the LLMs (use more than one LLM) | |
| Would have been neat to see what the human consensus was and if it | |
| differed from the LLM consensus | |
| Anyway, great talk! | |
| zahlman wrote 1 day ago: | |
| It would have been interesting to see if the LLM that Claude judged | |
| worst would have attempted to justify itself.... | |
| qeternity wrote 1 day ago: | |
| I think you mean non-deterministic, instead of probabilistic. | |
| And there is no reason that these models need to be | |
| non-deterministic. | |
| rvz wrote 1 day ago: | |
| > I think you mean non-deterministic, instead of probabilistic. | |
| My thoughts too. It's more accurate to label LLMs as | |
| non-deterministic instead of "probablistic". | |
| skybrian wrote 1 day ago: | |
| A deterministic algorithm can still be unpredictable in a sense. In | |
| the extreme case, a procedural generator (like in Minecraft) is | |
| deterministic given a seed, but you will still have trouble | |
| predicting what you get if you change the seed, because internally | |
| it uses a (pseudo-)random number generator. | |
| So thereâs still the question of how controllable the LLM really | |
| is. If you change a prompt slightly, how unpredictable is the | |
| change? That canât be tested with one prompt. | |
| simonw wrote 1 day ago: | |
| It might not be 100% clear from the writing but this benchmark is | |
| mainly intended as a joke - I built a talk around it because it's a | |
| great way to make the last six months of model releases a lot more | |
| entertaining. | |
| I've been considering an expanded version of this where each model | |
| outputs ten images, then a vision model helps pick the "best" of | |
| those to represent that model in a further competition with other | |
| models. | |
| (Then I would also expand the judging panel to three vision LLMs from | |
| different model families which vote on each round... partly because | |
| it will be interesting to track cases where the judges disagree.) | |
| I'm not sure if it's worth me doing that though since the whole | |
| "benchmark" is pretty silly. I'm on the fence. | |
| dilap wrote 1 day ago: | |
| Joke or not, it still correlates much better with my own subjective | |
| experiences of the models than LM Arena! | |
| fzzzy wrote 1 day ago: | |
| Even if it is a joke, having a consistent methodology is useful. I | |
| did it for about a year with my own private benchmark of reasoning | |
| type questions that I always applied to each new open model that | |
| came out. Run it once and you get a random sample of performance. | |
| Got unlucky, or got lucky? So what. That's the experimental | |
| protocol. Running things a bunch of times and cherry picking the | |
| best ones adds human bias, and complicates the steps. | |
| simonw wrote 1 day ago: | |
| It wasn't until I put these slides together that I realized quite | |
| how well my joke benchmark correlates with actual model | |
| performance - the "better" models genuinely do appear to draw | |
| better pelicans and I don't really understand why! | |
| og_kalu wrote 23 hours 57 min ago: | |
| LLMs also have a 'g factor' | |
| [1]: https://www.sciencedirect.com/science/article/pii/S016... | |
| johnrob wrote 1 day ago: | |
| Well, the most likely single random sample would be a | |
| ârepresentativeâ one :) | |
| tuananh wrote 1 day ago: | |
| until they start targeting this benchmark | |
| simonw wrote 1 day ago: | |
| Right, that was the closing joke for the talk. | |
| jonstewart wrote 1 day ago: | |
| It is funny to think that a hundred years in the future | |
| there may be some vestigial area of the modelsâ networks | |
| thatâs still tuned to drawing pelicans on bicycles. | |
| more-nitor wrote 1 day ago: | |
| I just don't get the fuss from the pro-LLM people who don't | |
| want anyone to shame their LLMs... | |
| people expect LLMs to say "correct" stuff on the first attempt, | |
| not 10000 attempts. | |
| Yet, these people are perfectly OK with cherry-picked success | |
| stories on youtube + advertisements, while being extremely | |
| vehement about this simple experiment... | |
| ...well maybe these people rode the LLM hype-train too early, | |
| and are desperate to defend LLMs lest their investment go poof? | |
| obligatory hype-graph classic: | |
| [1]: https://upload.wikimedia.org/wikipedia/commons/thumb/9... | |
| MichaelZuo wrote 1 day ago: | |
| I imagine the straightforward reason is that the âbetterâ | |
| models are in fact significantly smarter in some tangible way, | |
| somehow. | |
| pama wrote 1 day ago: | |
| How did the pelicans of point releases of V3 and of R1 | |
| (R1-0528) do compared to the original versions of the models? | |
| demosthanos wrote 1 day ago: | |
| I'd say definitely do not do that. That would make the benchmark | |
| look more serious while still being problematic for knowledge | |
| cutoff reasons. Your prompt has become popular even outside your | |
| blog, so the odds of some SVG pelicans on bicycles making it into | |
| the training data have been going up and up. | |
| Karpathy used it as an example in a recent interview: | |
| [1]: https://www.msn.com/en-in/health/other/ai-expert-asks-grok... | |
| telotortium wrote 19 hours 28 min ago: | |
| Yeah, Simon needs to release a new benchmark under a pen name, | |
| like Stephen King did with Richard Bachman. | |
| throwaway31131 wrote 1 day ago: | |
| Iâd say it doesnât really matter. There is no universally | |
| good benchmark and really they should only be used to answer very | |
| specific questions which may or may not be relevant to you. | |
| Also, as the old saying goes, the only thing worse than using | |
| benchmarks is not using benchmarks. | |
| 6LLvveMx2koXfwn wrote 1 day ago: | |
| I would definitely say he had no intention of doing that and was | |
| doubling down on the original joke. | |
| colecut wrote 1 day ago: | |
| The road to hell is paved with the best intentions | |
| clarification: I enjoyed the pelican on a bike and don't think | |
| it's that bad =p | |
| diggan wrote 1 day ago: | |
| Yeah, this is the problem with benchmarks where the | |
| questions/problems are public. They're valuable for some months, | |
| until it bleeds into the training set. I'm certain a lot of the | |
| "improvements" we're seeing are just benchmarks leaking into the | |
| training set. | |
| travisgriggs wrote 1 day ago: | |
| Thatâs ok, once bicycle âridingâ pelicans become | |
| normative, we can ask it for images of pelicans humping | |
| bicycles. | |
| The number of subject-verb-objects are near infinite. All are | |
| imaginable, but most are not plausible. A plausibility machine | |
| (LLM) will struggle with the implausible, until it can abstract | |
| well. | |
| zahlman wrote 1 day ago: | |
| I can't fathom this working, simply because building a model | |
| that relates the word "ride" to "hump" seems like something | |
| that would be orders of magnitude easier for an LLM than | |
| visualizing the result of SVG rendering. | |
| diggan wrote 1 day ago: | |
| > The number of subject-verb-objects are near infinite. All | |
| are imaginable, but most are not plausible | |
| Until there is enough unique/new subject-verb-objects | |
| examples/benchmarks so the trained model actually generalized | |
| it just like you did. (Public) Benchmarks needs to constantly | |
| evolve, otherwise they stop being useful. | |
| demosthanos wrote 1 day ago: | |
| To be fair, once it does generalize the pattern then the | |
| benchmark is actually measuring something useful for | |
| deciding if the model will be able to product a | |
| subject-verb-object SVG. | |
| ontouchstart wrote 1 day ago: | |
| Very nice talk, acceptable by general public and by AI agent as | |
| well. | |
| Any concerns about open source âAI celebrity talksâ like yours | |
| can be used in contexts that would allow LLM models to optimize | |
| their market share in ways that we canât imagine yet? | |
| Your talk might influence the funding of AI startups. | |
| #butterflyEffect | |
| threecheese wrote 1 day ago: | |
| I welcome a VC funded pelican ⦠anything! Clippy 2.0 maybe? | |
| Simon, hope you are comfortable in your new role of AI Celebrity. | |
| planb wrote 1 day ago: | |
| And by a sample that has become increasingly known as a benchmark. | |
| Newer training data will contain more articles like this one, which | |
| naturally improves the capabilities of an LLM to estimate whatâs | |
| considered a good âpelican on a bikeâ. | |
| viraptor wrote 1 day ago: | |
| Would it though? There really aren't that many valid answers to | |
| that question online. When this is talked about, we get more broken | |
| samples than reasonable ones. I feel like any talk about this | |
| actually sabotages future training a bit. | |
| I actually don't think I've seen a single correct svg drawing for | |
| that prompt. | |
| criddell wrote 1 day ago: | |
| And thatâs why he says heâs going to have to find a new | |
| benchmark. | |
| cyanydeez wrote 1 day ago: | |
| So what you really need to do is clone this blog post, find and | |
| replace pelican with any other noun, run all the tests, and publish | |
| that. | |
| Call it wikipediaslop.org | |
| YuccaGloriosa wrote 1 day ago: | |
| If the any other noun becomes fish... I think I disagree. | |
| puttycat wrote 1 day ago: | |
| You are right, but the companies making these models invest a lot of | |
| effort in marketing them as anything but probabilistic, i.e. making | |
| people think that these models work discretely like humans. | |
| In that case we'd expect a human with perfect drawing skills and | |
| perfect knowledge about bikes and birds to output such a simple | |
| drawing correctly 100% of the time. | |
| In any case, even if a model is probabilistic, if it had correctly | |
| learned the relevant knowledge you'd expect the output to be perfect | |
| because it would serve to lower the model's loss. These outputs | |
| clearly indicate flawed knowledge. | |
| bufferoverflow wrote 1 day ago: | |
| > work discretely like humans | |
| What kind of humans are you surrounded by? | |
| Ask any human to write 3 sentences about a specific topic. Then ask | |
| them the same exact question next day. They will not write the same | |
| 3 sentences. | |
| cyanydeez wrote 1 day ago: | |
| Humans absolutely do not work discretely. | |
| loloquwowndueo wrote 1 day ago: | |
| They probably meant deterministically as opposed to | |
| probabilistically. Which also humans dont work like that :) | |
| aspenmayer wrote 1 day ago: | |
| I thought they meant discreetly. | |
| ben_w wrote 1 day ago: | |
| > In that case we'd expect a human with perfect drawing skills and | |
| perfect knowledge about bikes and birds to output such a simple | |
| drawing correctly 100% of the time. | |
| Look upon these works, ye mighty, and despair: | |
| [1]: https://www.gianlucagimini.it/portfolio-item/velocipedia/ | |
| rightbyte wrote 1 day ago: | |
| That blog post is a 10/10. Oh dear I miss the old internet. | |
| jodrellblank wrote 1 day ago: | |
| You claim those are drawn by people with "perfect knowledge about | |
| bikes" and "perfect drawing skills"? | |
| ben_w wrote 1 day ago: | |
| More that "these models work ⦠like humans" (discretely or | |
| otherwise) does not imply the quotation. | |
| Most humans do not have perfect drawing skills and perfect | |
| knowledge about bikes and birds, they do not output such a | |
| simple drawing correctly 100% of the time. | |
| "Average human" is a much lower bar than most people want to | |
| believe, mainly because most of us are average on most skills, | |
| and also overestimate our own competence â the modal human | |
| has just a handful of things they're good at, and one of those | |
| is the language they use, another is their day job. | |
| Most of us can't draw, and demonstrably can't remember (or | |
| figure out from first principles) how a bike works. But this | |
| also applies to "smart" subsets of the population: physicists | |
| have [1] , and there's this famous rocket scientist who weighed | |
| in on rescuing kids from a flooded cave, they come up with some | |
| nonsense about a submarine. | |
| [1]: https://xkcd.com/793/ | |
| Retric wrote 1 day ago: | |
| Itâs not that humans have perfect drawing skills, itâs | |
| that humans can judge their performance and get better over | |
| time. | |
| Ask 100 random people to draw a bike and in 10 minutes and | |
| theyâll on average suck while still beating the LLMâs | |
| here. Give em an incentive and 10 months and the average | |
| person is going to be able to make at least one quite decent | |
| drawing of a bike. | |
| The cost and speed advantage of LLMâs is real as long as | |
| youâre fine with extremely low quality. Ask a model for | |
| 10,000 drawings so you can pick the best and you get a | |
| marginal improvements based on random chance at a steep | |
| price. | |
| ben_w wrote 1 day ago: | |
| > Ask 100 random people to draw a bike and in 10 minutes | |
| and theyâll on average suck while still beating the | |
| LLMâs here. | |
| Y'see, this is a prime example of what I meant with | |
| ""Average human" is a much lower bar than most people want | |
| to believe, mainly because most of us are average on most | |
| skills, and also overestimate our own competence". | |
| An expert artist can spend 10 minutes and end up with a | |
| brief sketch of a bike. You can witness this exact duration | |
| yourself (with non-bike examples) because of a challenge a | |
| few years back to draw the same picture in 10 minutes, 1 | |
| minute, and 10 seconds. | |
| A normal person spending as much time as they like gets you | |
| the pictures that I linked to in the previous post, because | |
| they don't really know what a bike is. 45 examples of what | |
| normal people think a bike looks like: [1] > Give em an | |
| incentive and 10 months and the average person is going to | |
| be able to make at least one quite decent drawing of a | |
| bike. | |
| Given mandatory art lessons in school are longer than 10 | |
| months, and yet those bike examples exist, I have no reason | |
| to believe this. | |
| > Ask a model for 10,000 drawings so you can pick the best | |
| and you get a marginal improvements based on random chance | |
| at a steep price. | |
| If you do so as a human, rating and comparing images? Then | |
| the cost is your own time. | |
| If you automate it in literally the manner in this write-up | |
| (pairwise comparison via API calls to another model to get | |
| ELO ratings), ten thousand images is like $60-$90, which is | |
| on the low end for a human commission. | |
| [1]: https://www.gianlucagimini.it/portfolio-item/veloc... | |
| Retric wrote 1 day ago: | |
| As an objective criteria what percentage include peddles | |
| and a chain connecting one of the wheels? I quickly found | |
| a dozen and stopped counting. Now do the same for those | |
| LLM images and itâs clear humans win. | |
| > ""Average human" is a much lower bar than most people | |
| want to believe | |
| I have some basis for comparison. Iâve seen 6 years | |
| olds draw better bikes than those LLMâs. | |
| Look through that list again the worst example does even | |
| have wheels, multiple of them have wheels without being | |
| connected to anything. | |
| Now if youâre arguing the average human is worse than | |
| the average 6 year old Iâm going to disagree here. | |
| > Given mandatory art lessons in school are longer than | |
| 10 months, and yet those bike examples exist, I have no | |
| reason to believe this. | |
| Art lessons donât cumulatively spend 10 months teaching | |
| people how to draw a bike. I donât think I | |
| cumulatively spent 6 months drawing anything. Painting, | |
| collage, sculpture, coloring, etc art covers a lot and | |
| wasnât an every day or even every year thing. My | |
| mandatory collage class was art history, we didnât | |
| create any art. | |
| You may have spent more time in class studying drawing, | |
| but thatâs not some universal average. | |
| > If you automate it in literally the manner in this | |
| write-up (pairwise comparison via API calls to another | |
| model to get ELO ratings), ten thousand images is like | |
| $60-$90, which is on the low end for a human commission. | |
| Not every one of those images had a price tag but one was | |
| 88 cents, * 10,000 = 8,800$ just to make the image for a | |
| test even at 4c/image your looking at 400$. Cheaper | |
| models existed but fairly consistently had worse | |
| performance. | |
| simonw wrote 1 day ago: | |
| The 88 cent one was the most expensive almost my an | |
| order of magnitude. Most of these cost less than a cent | |
| to generate - that's why I highlighted the price on the | |
| o1 pro output. | |
| Retric wrote 1 day ago: | |
| Yes, but if youâre averaging cheap and expensive | |
| options the expensive ones make a significant | |
| difference. Cheaper is bound by 0 so it canât | |
| differ as much from the average. | |
| Also, when youâre talking about how cheap something | |
| is, including the price makes sense. I had no idea | |
| on many of those models. | |
| simonw wrote 1 day ago: | |
| If you're interested, you can get cost estimates | |
| from my pricing calculator site here: [1] That link | |
| seeds it with 11 input tokens and 1200 output | |
| tokens - 11 input tokens is what most models use | |
| for "Generate an SVG of a pelican riding a bicycle" | |
| and 1200 is the number of output tokens used for | |
| some of the larger outputs. | |
| Click on different models to see estimated prices. | |
| They range from 0.0168 cents for Amazon Nova Micro | |
| (that's less than 2/100ths of a cent) up to 72 | |
| cents for o1-pro. | |
| The most expensive model most people would consider | |
| is Claude 4 Opus, at 9 cents. | |
| GPT-4o is the upper end of the most common prices, | |
| at 1.2 cents. | |
| [1]: https://www.llm-prices.com/#it=11&ot=1200 | |
| Retric wrote 1 day ago: | |
| Thanks | |
| zahlman wrote 1 day ago: | |
| > A normal person spending as much time as they like gets | |
| you the pictures that I linked to in the previous post, | |
| because they don't really know what a bike is. 45 | |
| examples of what normal people think a bike looks like: | |
| [1] A normal person given the ability to consult a | |
| picture of a bike while drawing will do much better. An | |
| LLM agent can effectively refresh its memory (or attempt | |
| to look up information on the Internet) any time it | |
| wants. | |
| [1]: https://www.gianlucagimini.it/portfolio-item/vel... | |
| ben_w wrote 10 hours 44 min ago: | |
| > A normal person given the ability to consult a | |
| picture of a bike while drawing will do much better. An | |
| LLM agent can effectively refresh its memory (or | |
| attempt to look up information on the Internet) any | |
| time it wants. | |
| Some models can when allowed to, but I don't belive | |
| Simon Willson was testing that? | |
| joshstrange wrote 1 day ago: | |
| I really enjoy Simonâs work in this space. Iâve read almost every | |
| blog post theyâve posted on this and I love seeing them poke and prod | |
| the models to see what pops out. The CLI tools are all very easy to use | |
| and complement each other nicely all without trying to do too much by | |
| themselves. | |
| And at the end of the day, itâs just so much fun to see someone else | |
| having so much fun. Heâs like a kid in a candy store and that | |
| excitement is contagious. After reading every one of his blog posts, | |
| Iâm inspired to go play with LLMs in some new and interesting way. | |
| Thank you Simon! | |
| blackhaj7 wrote 1 day ago: | |
| Same sentiment! | |
| dotemacs wrote 1 day ago: | |
| The same here. | |
| Because of him, I installed a RSS reader so that I don't miss any | |
| of his posts. And I know that he shares the same ones across | |
| Twitter, Mastodon & Bsky... | |
| neepi wrote 1 day ago: | |
| My only take home is they are all terrible and I should hire a | |
| professional. | |
| vunderba wrote 23 hours 59 min ago: | |
| This test isn't really about the quality of the image itself | |
| (multimodals like gpt-image-1 or even standard diffusion models would | |
| be far superior) - it's about following a spec that describes how to | |
| draw. | |
| A similar test would be if you asked for the pelican on a bicycle | |
| through a series of LOGO instructions. | |
| spaceman_2020 wrote 1 day ago: | |
| My only take home is that a spanner can work as a hammer, but you | |
| probably should just get a hammer | |
| jug wrote 1 day ago: | |
| Before that, you might ask ChatGPT to create a vector image of a | |
| pelican riding a bicycle and then running the output through a PNG to | |
| SVG converter... | |
| Result: [1] These are tough benchmarks to trial reasoning by having | |
| it _write_ an SVG file by hand and understanding how it's to be | |
| written to achieve this. Even a professional would struggle with | |
| that! It's _not_ a benchmark to give an AI the best tools to actually | |
| do this. | |
| [1]: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican... | |
| YuccaGloriosa wrote 1 day ago: | |
| I think you made an error there png is a bitmap format | |
| sethaurus wrote 1 day ago: | |
| You've misunderstood. The parent was making a specific point â | |
| if you want an SVG of a penguin, the easiest way to AI-generate | |
| it is to get an image generator to create a (vector-styled) | |
| bitmap, then auto-vectorize it to SVG. But the point of this | |
| benchmark is that it's asking models to create an SVG the hard | |
| way, by writing its code directly. | |
| GaggiX wrote 1 day ago: | |
| An expert at writing SVGs? | |
| keiferski wrote 1 day ago: | |
| As the other guy said, these are text models. If you want to make | |
| images use something like Midjourney. | |
| Promoting a pelican riding a bicycle makes a decent image there. | |
| keiferski wrote 1 day ago: | |
| * Prompting | |
| matkoniecz wrote 1 day ago: | |
| it depends on quality you need and your budget | |
| neepi wrote 1 day ago: | |
| Ah yes the race to the bottom argument. | |
| ben_w wrote 1 day ago: | |
| When I was at university, they got some people from industry to | |
| talk to us all about our CVs and how to do interviews. | |
| My CV had a stupid cliché, "committed to quality", which they | |
| correctly picked up on â "What do you mean?" one of them asked | |
| me, directly. | |
| I thought this meant I was focussed on being the best. He didn't | |
| like this answer. | |
| His example, blurred by 20 years of my imperfect human memory, | |
| was to ask me which is better: a Porsche, or a go-kart. Now, | |
| obviously (or I wouldn't be saying this), Porsche was a trick | |
| answer. Less obviously is that both were trick answers, because | |
| their point was that the question was under-specified â quality | |
| is the match between the product and what the user actually | |
| wants, so if the user is a 10 year old who physically isn't big | |
| enough to sit in a real car's driver's seat and just wants to | |
| rush down a hill or along a track, none of "quality" stuff that | |
| makes a Porsche a Porsche is of any relevance at all, but what | |
| does matter is the stuff that makes a go-kart into a go-kart⦠| |
| one of which is the affordability. | |
| LLMs are go-karts of the mind. Sometimes that's all you need. | |
| neepi wrote 1 day ago: | |
| I disagree. Quality depends on your market position and what | |
| you are bringing to the market. Thus I would start with market | |
| conditions and work back to quality. If you can't reach your | |
| standards in the market then you shouldn't enter it. And if | |
| your standards are poor, you should be ashamed. | |
| Go kart or porsche is irrelevant. | |
| ben_w wrote 1 day ago: | |
| > Quality depends on your market position and what you are | |
| bringing to the market. | |
| That's the point. | |
| The market for go-karts does not support Porche. | |
| If you bring a Porche sales team to a go-kart race, nobody | |
| will be interested. | |
| Porche doesn't care about this market. It goes both ways: | |
| this market doesn't care about Porche, either. | |
| dist-epoch wrote 1 day ago: | |
| Most of them are text-only models. Like asking a person born blind to | |
| draw a pelican, based on what they heard it looks like. | |
| neepi wrote 1 day ago: | |
| That seems to be a completely inappropriate use case? | |
| I would not hire a blind artist or a deaf musician. | |
| wongogue wrote 1 day ago: | |
| Even Beethoven? | |
| simonw wrote 1 day ago: | |
| Yeah, that's part of the point of this. Getting a state of the | |
| art text generating LLM to generate SVG illustrations is an | |
| inappropriate application of them. | |
| It's a fun way to deflate the hype. Sure, your new LLM may have | |
| cost XX million to train and beat all the others on the | |
| benchmarks, but when you ask it to draw a pelican on a bicycle it | |
| still outputs total junk. | |
| dist-epoch wrote 1 day ago: | |
| tried starting from an image: [1] lol: | |
| [1]: https://chatgpt.com/share/684582a0-03cc-8006-b5b5-de51... | |
| [2]: https://gemini.google.com/share/4d1746a234a8 | |
| dmd wrote 1 day ago: | |
| Sorry, Beethoven, you just donât seem to be a match for our | |
| org. Best of luck on your search! | |
| You too, Monet. Scram. | |
| __alexs wrote 1 day ago: | |
| I guess the idea is that by asking the model to do something that | |
| is inherently hard for it we might learn something about the | |
| baseline smartness of each model which could be considered a | |
| predictor for performance at other tasks too. | |
| namibj wrote 1 day ago: | |
| It's a proxy for abstract designing, like writing software or | |
| designing in a parametric CAD. | |
| Most the non-math design work of applied engineering AFAIK falls | |
| under the umbrella that's tested with the pelican riding the | |
| bicycle. | |
| You have to make a mental model and then turn it into applicable | |
| instructions. | |
| Program code/SVG markup/parametric CAD instructions don't really | |
| differ in that aspect. | |
| neepi wrote 1 day ago: | |
| I would not assume that this methodology applies to applied | |
| engineering, as a former actual real tangible meat space | |
| engineer. Things are a little nuanced and the nuances come from | |
| a combination of communication and experience, neither of which | |
| any LLM has any insight into at all. It's not out there on the | |
| internet to train it with and it's not even easy to put it into | |
| abstract terms which can be used as training data. And | |
| engineering itself in isolation doesn't exist - there is a | |
| whole world around it. | |
| Ergo no you can't just say throw a bicycle into an LLM and a | |
| parametric model drops out into solidworks, then a machine | |
| makes it. And everyone buys it. That is the hope really isn't | |
| it? You end up with a useless shitty bike with a shit pelican | |
| on it. | |
| The biggest problem we have in the LLM space is the fact that | |
| no one really knows any of the proposed use cases enough and | |
| neither does anyone being told that it works for the use cases. | |
| rjsw wrote 1 day ago: | |
| I don't think any of that matters, CEOs will decide to use it | |
| anyway. | |
| neepi wrote 1 day ago: | |
| This is sad but true. | |
| dist-epoch wrote 1 day ago: | |
| [1]: https://www.solidworks.com/lp/evolve-your-design-wor... | |
| neepi wrote 1 day ago: | |
| Yeah good luck with that. Seriously. | |
| dist-epoch wrote 1 day ago: | |
| The point is about exploring the capabilities of the model. | |
| Like asking you to draw a 2D projection of 4D sphere intersected | |
| with a 4D torus or something. | |
| kevindamm wrote 1 day ago: | |
| Yeah, I suppose it is similar.. I don't know their diameters, | |
| rotations, nor the distance between their centers, nor which | |
| two dimensions, so I would have to guess a lot about what you | |
| meant. | |
| <- back to front page |