/hn/comments_40218021.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	StoryDiffusion: Long-range image and video generation


	gtoast wrote 22 min ago:
	Its really challenging to think of the positive, constructive uses for
	this technology without thiking of the myriad, life and societal
	effecting uses for this. Just interpersonally the use of this
	technology is heavily weighted towards destruction and deception. I
	don't know where this ends or where researchers who release this
	technology think this will go, but I can't imagine its going anywhere
	good for all of us.

	nephanth wrote 2 hours 3 min ago:
	Um, the github link is a 404, and the paper link links to the webpage
	itself (â the paper is not on arxiv). Probably they put the website
	on too fast?

	spywaregorilla wrote 5 hours 38 min ago:
	How is this conceptually different from tracking an embedding for a
	single character or training a lora on it?

	MisterTea wrote 6 hours 53 min ago:
	One day we won't have 3D engines or GPU's but AI chips that generate
	the scenes without calculating a single triangle or loading a single
	texture. We just stream in a scene, IP asset seeds provide the
	characters, plot and story. But even those can be generated in
	real-time. Video games, movies, anything will be on demand. No one will
	act. No one will draw. We will just sit and ask for more. Strange
	times.

	whamlastxmas wrote 6 hours 52 min ago:
	I had this same realization when Sora came out

	jerpint wrote 9 hours 35 min ago:
	The videos look incredible, but a lot of the captions are riddled with
	grammar/syntax mistakes that seem odd for a model to make of that
	quality.

	speedgoose wrote 10 hours 18 min ago:
	Is there a video of Will Smith eating spaghetti with this model?

	smusamashah wrote 11 hours 24 min ago:
	This is unbelievably good. Seems better than Sora even in terms of
	natural look and motion in videos.

	The video of two girls talking seems so natural. There are some
	artifacts but the movement is so natural and clothes and other things
	around are not continuously changing.

	I hope it does become open source, which i suspect it won't because
	it's coming from byte dance.

	cchance wrote 11 hours 1 min ago:
	I don't know if thats true, theirs a massive flicker in the guys hair
	(the one thats mostly black background and black shirt) half way
	through it completely loses tracking on his hair and it like snap
	changes.

	smusamashah wrote 7 hours 51 min ago:
	If you compare this with current state of openly available video
	models (assuming this will be open too) this is still a leap. If it
	is going to be closed like Sora than it's comparable. Sora has
	different kind of artifacts.

	These artifacts are an improvement over current state.

	pmontra wrote 14 hours 27 min ago:
	The Moon in the sky seen from the surface of the Moon is wrong? Poetic?
	Funny? Recursive? A demonstration that these models don't understand
	anything? Add to the list.

	gbickford wrote 17 hours 24 min ago:
	It's always disappointing when people publish things to GitHub without
	the intention of collaborating or sharing.

	forgingahead wrote 17 hours 39 min ago:
	Github link is broken, and I honestly find it frustrating that the only
	link to code is the theme source and credits?? Is it really that
	important to give the static page theme that much real estate instead
	of actual code release for the project?

	29athrowaway wrote 17 hours 43 min ago:
	Time for Microsoft Chat 2.0 it seems.

	topspin wrote 17 hours 48 min ago:
	Love how under "Multiple Characters Generation" the white guy is "A
	Man," whereas the someone else is "An Asian Man." Reminds me of Daryl
	Gates and the "normal people" quote, thence patrol cars being called
	"black and normals."

	fnordpiglet wrote 17 hours 42 min ago:
	A probabilistic regression models behavior will just demonstrate the
	training data. Donât hate the player, hate the game.

	topspin wrote 17 hours 12 min ago:
	No hate for any part of this: it's just amusing.

	peteradio wrote 17 hours 55 min ago:
	There is a video of two girls. One girl seems to be sticking out her
	tongue and then blowing a kiss, but the tongue is appearing again
	mid-kiss. Very arousing stuff I'll say. Keep up the good work
	microsft or goggle or whoever made it.

	yard2010 wrote 10 hours 44 min ago:
	Worse - bytedance

	schoen wrote 18 hours 13 min ago:
	I looked very closely at the videos for a while and managed to find
	some minor continuity errors (like different numbers of buttons on
	people's button-down shirts at different times, or different sizes or
	styles of earrings, or arguably different interpretations of which
	finger is which in an intermittently-obscured hand). I also think that
	the cycling woman's shorts appear to cover more of her left leg than
	her right leg, although that's not physically impossible, and the bear
	seemingly has a differently-sized canine tooth at different times.

	But I guess it took me multiple minutes to find these problems,
	watching each video clip many times, rather than having any of them
	jump out at me. So, it's not like literally full consistent object
	persistence, but at a casual viewing it was very persuasive.

	Maybe people who shoot or edit video frequently would notice some of
	these problems more quickly, because they're more attuned to looking
	for continuity problems?

	chrsw wrote 5 hours 17 min ago:
	There are lots of inconsistencies in these clips of the type you
	would never find even in a hastily put together amateur film. I
	wonder how you would even add continuity support into a generative
	video model. It's got its training data, its model, its algorithms
	for generating data... but could you say "make sure this shirt always
	has 6 buttons in this scene"? Does it even understand what a button
	is? Or a shirt? Or a thing?

	It seems to me that eventually these systems are going to have to be
	grounded in some hard truths about our world. Like, there are things
	called objects, objects can be distinct, objects can have
	relationships between other objects, etc. Then the generative network
	would have to generate data around these priors. Or maybe they
	already have that, I don't know how they work.

	jononor wrote 1 hour 59 min ago:
	Hopefully continuity (of relevant features) would be the results of
	the training process, eventually. In videos from the wild, the
	number of buttons on a shirt basically never changes during a
	scene. That kind of information is in the training data already. So
	it is theoretically possible for a model to learn that should stay
	consistent, in contrast to other properties that affect the shirt,
	like lighting or pose. But we are still in the very early days of
	kinda-working video generation, and certainly in terms of temporal
	consistency.

	whywhywhywhy wrote 6 hours 57 min ago:
	You can see this in Sora videos too if you look closely to things
	like leaves of trees, you can tell some sort of bucketing is going on
	temporally even on SOTA models

	IanCal wrote 7 hours 10 min ago:
	I think it's fascinating to watch what the issues/complaints are. I'm
	in no way saying you're complaining but I think looking at what
	people point out are the issues is a great measure of progress.

	Here we're looking at video, of high quality individual frames, where
	the inconsistencies are maybe clear and maybe not - but compared to
	craion (around the time of dalle): [1] it's wild how that's changed.
	And this capability was a vast improvement over things before (at
	least ones that weren't fixed goals, the GAN approach to faces in
	headshots was very lifelike before this)

	[1]: https://i.ytimg.com/vi/lcoitxKbw_0/maxresdefault.jpg

	grobgambit wrote 8 hours 49 min ago:
	I am super picky when it comes to art and I think these look like
	complete shit when compared to what I have seen from Sora.

	Not even in the same ballpark. Even when things are wrong in Sora it
	seems like the imagery is still very crisp. If I watched these videos
	for 5 minutes I know I would get a headache.

	taneq wrote 7 hours 39 min ago:
	Werenât the Sora videos heavily edited/post produced? At least so
	Iâve read, happy to be corrected here.

	cchance wrote 11 hours 0 min ago:
	I mean at the end of the day neither is standard video editing, how
	many times have we all found inconsistencies in TV shows or random
	water bottles showing up and disappearing in scenes... I imagine
	diffusion video creation will be similar eventually funny anecdotes
	of what we saw that time in LOTR 10

	vkou wrote 17 hours 33 min ago:
	I'm immediately noticing significant issues with mouths
	(specifically, when they are open).

	It's also telling that most of the shots do their best to hide hands
	- whenever they are visible, they are obviously broken.

	godelski wrote 17 hours 40 min ago:
	Did you miss the fish?[0] You should see the error in first viewing

	What about the woman with glasses? Her face literally "jumps"[1] Same
	with this guy's hands[2]

	Interesting, we notice that [1] has "sora" in the name though I think
	it is a reference to the main image on sora[3]

	Not sure if the gallery is weird to anyone else, but it doesn't
	exactly show new images and the position indicator is wonky.

	The thing that makes me most suspicious is seeing the numbers on
	these demos. 1, 2, 4 (terrifying to me), 5, 65, 66, 68, 72, 73, 83,
	85, 86 (is this Simone Giertz? Vic Michaelis?). The part that is
	tough about evaluating generative models is the cherry picking for
	demonstrations. You have to do it or people tear your work apart but
	also in doing so you give a false impression of what your work can
	actually do.

	IMO it has gotten out of hand and is not benefiting anyone. It makes
	these papers more akin to advertising than communication of research.
	We talk about integrity of the research community and why we argue
	over borderline works but come on, if you can get a better review by
	more samples, you can get better reviews by paying more, not by doing
	better work. A pay to play system is far worse for the integrity of
	ML (or any science) than arguing over borderline works.

	Edit: I think it is also a bit problematic that this is posted BEFORE
	the arxiv link or GitHub goes live. I'd appeal to the HN community to
	not upvote these kinds of works until at least the paper is live.

	[0] [1] [2] [1] [3]

	[1]: https://storydiffusion.github.io/MagicStory_files/longvideo/...
	[2]: https://storydiffusion.github.io/MagicStory_files/longvideo/...
	[3]: https://storydiffusion.github.io/MagicStory_files/longvideo/...
	[4]: https://openai.com/sora

	nyokodo wrote 17 hours 54 min ago:
	> But I guess it took me multiple minutes to find these problems

	Iâm no video editor but I noticed straight away that The
	charactersâ eyes and hair tend to change, sometimes dramatically as
	they turn their head. Also, the head movement tends to be jerky or
	abrupt especially in the middle of the turn.

	justinclift wrote 15 hours 1 min ago:
	Eyes and teeth seem like they still need further work. Still,
	looks like things are improving. :)

	brotherdusk wrote 18 hours 29 min ago:
	sorry, i can't access the repo and the pdf doesn't have an href attr,
	is that by design?

	hbbio wrote 18 hours 30 min ago:
	GitHub link is not public yet?

	[1]: https://github.com/HVision-NKU/StoryDiffusion

	ActionHank wrote 6 hours 12 min ago:
	A lot of these AI-related announcements seem to be doing this sort of
	baiting.

	"I made a new thing", go to the repo, COMING SOON. Or this, here's
	the paper, no we won't show our work.

	smcnally wrote 16 hours 17 min ago:
	That repoâs not listed

	[1]: https://github.com/orgs/HVision-NKU/repositories

	stanislavb wrote 18 hours 4 min ago:
	Seems so. I was about to report about it, too.

	keikobadthebad wrote 18 hours 34 min ago:
	It'll be good if the girl and the giant squirrel are ever seen in the
	same park at the same time.

	freefruit wrote 18 hours 47 min ago:
	So is Amazon flooded with hyper niche e-books yet?

	m463 wrote 14 hours 49 min ago:
	I went to buy an air fryer. There were several
	specific-air-fryer-model recipe books available. But they were all
	garbage auto-generated stuff.

	I complained to amazon, and they said since I hadn't purchased the
	book they couldn't do anything. So I bought the book, complained,
	and returned it.
	The chapters devoted to the details of the specific air fryer model
	were either very general (almost quotes of product description on
	amazon), or just plain wrong.

	What I thought I would get would be like the magic lantern books
	about specific camera models. Instead it was auto-generated pages of
	nonsense.

	surfingdino wrote 12 hours 32 min ago:
	Your real-life example is a good case against using AI-generated
	legal or medical advice.

	selalipop wrote 18 hours 32 min ago:
	Iâm working on a platform for reading hyper niche e-books: [1] I
	donât think this form of generative AI needs to become a source of
	spam, carefully designed platforms can let people enjoy their niche
	content without making them feel isolated

	[1]: https://tryspellbound.com

	surfingdino wrote 12 hours 32 min ago:
	Too late, it has become a source of spam.

	selalipop wrote 10 hours 9 min ago:
	Not really useful to give up the fight in the infancy of
	something with as much surface area as generative AI.

	Is being used to create spam is not the same as needs to be spam,
	and we mostly just need platforms that leverage generative AI
	natively to bridge the gap.

	surfingdino wrote 1 hour 22 min ago:
	There is literally zero need for tools to generate text. Humans
	generate tons of spam already.

	samspenc wrote 18 hours 54 min ago:
	Normally I don't mind spelling errors - and there are plenty in the
	examples - but my question is, did the system really produce "lunch"
	when the prompt was "they have launch at restraunt" (verbatim from the
	sample)? I would imagine it got restaurant right, but I would have
	expected it to produce something like a rocket launch image instead of
	figuring out the author meant lunch.

	ffhhj wrote 15 hours 25 min ago:
	Curious what it would produced with: "they have launch a
	rockestaurant".

	taneq wrote 7 hours 41 min ago:
	Bistromathics!

	godelski wrote 17 hours 29 min ago:
	"He felt very frightened and run", "There is a huge amount of
	treasure in the house!"

	I suspect that some grammar and spelling issues may be the authors
	themselves. For example "A Asian Man": "a" instead of "an" is a
	common mistake for many Asian languages due to not having similar
	forms in their languages. So considering consistent article errors, I
	expect this to be an issue from the authors. Not sure the "M"
	capitalization. Similar things with "The man have breakfast", "They
	have launch at restaurant", "They play in (the) amusement part."

	Considering the comics have similar types of error (the squirrel one
	clearer) I'd chalk it up to language barrier instead of the process.
	Though LeCun is not wearing gloves on the moon, and well...

	neckro23 wrote 18 hours 27 min ago:
	And if it the model is supposed to be so attentive to context, why
	did it show a desert instead of "dessert"? After all, they just ate
	"launch".

	yorwba wrote 13 hours 55 min ago:
	The model can only attend to context that is part of the input.
	Most likely they created the image grid by independently feeding
	the model each prompt together with the reference image. (And the
	point is to show off that the model output remains consistent
	despite this independent generation process.)

	dkarras wrote 18 hours 36 min ago:
	transformers / attention is very robust against typos as they take
	the entire context into account just like we do. launch any free LLM
	and ask them questions with typos that you would notice and
	auto-correct and you'll see that the models just don't care and
	understand them. actually they are so resilient that they understand
	very garbled text without breaking a sweat.

	noneeeed wrote 10 hours 47 min ago:
	I often use ChatGPT in learning spanish, I find it's great for
	explaining distinctions between words with similar meanings where a
	dictionary isn't always a lot of help.

	I am constantly surprised by how well it copes with my typos,
	grammatical errors and generally poor spelling.

	BoorishBears wrote 17 hours 9 min ago:
	There's honestly something uncanny about how well they do.

	In the "early days" of GPT-4 I tried testing it as a way to get
	around poor transcription for an in-car voice assistant. It
	managed: "I'm how dew yew say... Freud?" => Turn up the
	temperature... which was nonsense most people would stare at for a
	long time before making any sense of.

	LeoPanthera wrote 18 hours 58 min ago:
	The rate of progress of generative AI is honestly quite scary.

	ed_mercer wrote 18 hours 38 min ago:
	Really? Feels like nothing much is happening lately.

	newswasboring wrote 10 hours 23 min ago:
	What are you talking about? ChatGPT-3 came out less than 4 years
	ago. Stable diffusion's first version around that too. In less than
	4 years we went from nothing to making janky but believable video
	clips. This is not fast enough for you?

	vouaobrasil wrote 14 hours 35 min ago:
	Progress comes in spurts. Due to the negative reactions to AI by
	some (artists), the system wants it to appear that nothing is
	happening so that the next wave of AI can be created in relative
	peace, at which time it will be too late to stop it.

	We have been conditioned to only react to hype and "news", rather
	than analyze reality and see the danger.

	thejohnconway wrote 10 hours 10 min ago:
	Which âsystemâ?

	vouaobrasil wrote 8 hours 32 min ago:
	The global capitalist system, or the emergent behaviour that
	comes out of a mass of humanity addicted to technological
	development through wealth accumulation.

	thejohnconway wrote 6 hours 51 min ago:
	Such a system can't want anything.

	vouaobrasil wrote 5 hours 47 min ago:
	It's a term I use for emergent behaviour. And some
	philosophers of technology would disagree with you, such as
	the panpsychists. We are just a bag of cells and yet we
	speak of "wanting" things even though we might just be
	deterministic bags of blood.


	<- back to front page