/hn/comments_40213292.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	Alice's adventures in a differentiable wonderland


	seanhunter wrote 1 day ago:
	Wow, just skimmed a bit, but this book looks amazing so far. Really
	understandable but with an intuitive presentation of the underlying
	maths that invites the reader to go deeper if they want to by giving
	them what they need to get started.

	gfaure wrote 1 day ago:
	In the literature, they're usually called convolutional layers (I think
	you can pretty much search and replace all uses of "convolutive" in the
	text).

	p1esk wrote 1 day ago:
	And then you learn about binary or ternary networks where gradients
	donât really exist anywhere, and you start to wonder about the
	importance of this differentiability.

	ubj wrote 1 day ago:
	...And then you start learning about generalizations of the notion of
	"gradient" to scenarios where the classical gradient doesn't exist :)

	whimsicalism wrote 1 day ago:
	binary networks don't really work well unless you do a relaxation
	first

	Eduard wrote 1 day ago:
	better remove all the Disney-based Alice in Wonderland character
	intellectual property from the book.

	astrodust wrote 1 day ago:
	I was just thinking that's "cease and desist" bait right there.

	iainmerrick wrote 1 day ago:
	Alice in Wonderland (the book) is in the public domain. The old
	Disney movie is still in copyright, and the cover image does look
	very much like it's from the movie, but that character design is
	from John Tenniel's illustrations which are also in the public
	domain.

	astrodust wrote 1 day ago:
	The character design is. The image, however, is clearly Disney
	flavoured if not traced directly.

	His version, for example, does not have the distinctive bow. The
	art style is also completely different.

	iainmerrick wrote 1 day ago:
	True - it would be a good idea to use a Tenniel piece instead.

	Edit to add: I was mostly trying to push back on the
	implication that Disney owns Alice in Wonderland (and Peter
	Pan, Winnie the Pooh, etc). Now I re-read the original comment,
	they did specify "Disney-based", so maybe I'm over-reacting!

	xanderlewis wrote 1 day ago:
	> Stripped of anything else, neural networks are compositions of
	differentiable primitives

	Iâm a sucker for statements like this. It almost feels philosophical,
	and makes the whole subject so much more comprehensible in only a
	single sentence.

	I think FranÃ§ois Chollet says something similar in his book on deep
	learning: one shouldnât fall into the trap of anthropomorphising and
	mysticising models based on the âneuralâ name; deep learning is
	simply the application of sequences of operations that are nonlinear
	(and hence capable of encoding arbitrary complexity) but nonetheless
	differentiable and so efficiently optimisable.

	m463 wrote 1 day ago:
	chess is pattern matching.

	jonas21 wrote 1 day ago:
	I feel like this statement is both obvious after spending a few
	minutes working with neural networks and completely useless in
	helping you build better neural networks.

	It's kind of like saying, "Stripped of anything else, works of
	literature are compositions of words"

	falcor84 wrote 1 day ago:
	Let me try to one-up that: "Stripped off anything else, works of
	literature are compositions of the uppercase letter I"

	Horffupolde wrote 1 day ago:
	Well, Iâd argue that could also be a bit enlightening. Itâs
	like taking a moment to appreciate that forests are composed of
	single trees. It takes a certain level of insight to appreciate
	systems at various depths.

	sideshowb wrote 1 day ago:
	> deep learning is simply the application of sequences of operations
	that are nonlinear but nonetheless differentiable

	Though other things fit this description which are not deep learning.
	Like (shameless plug) my recent paper here

	[1]: https://ieeexplore.ieee.org/document/10497907

	naasking wrote 1 day ago:
	> one shouldnât fall into the trap of anthropomorphising and
	mysticising models based on the âneuralâ name

	One also shouldn't fall into the dual trap of assuming that just
	because one understands how a model works, it cannot have any bearing
	on the ever-mysterious operation of the brain.

	jimbokun wrote 1 day ago:
	I always get the impression even the proponents of these algorithms
	when they didn't seem so promising, are shocked at the capabilities
	demonstrated by models built with such a relatively simple procedure.

	phkahler wrote 1 day ago:
	>> one shouldnât fall into the trap of anthropomorphising and
	mysticising models based on the âneuralâ name

	And yet, artificial neural networks ARE an approximation of how
	biological neurons work. It is worth noting that they came out of
	neurobiology and not some math department - well at least in the
	forward direction, I'm not sure who came up with the training
	algorithms (probably the math folks). Should they be considered
	mystical? No. I would also posit that biological neurons are more
	efficient and probably have better learning algorithms than
	artificial ones today.

	I'm confused as to why some people seem to shun the biological
	equivalence of these things. In a recent thread here I learned that
	physical synaptic weights (in our brains) are at least partly stored
	in DNA or its methylation. If that isn't fascinating I'm not sure
	what is. Or is it more along the lines of intelligence can be reduced
	to a large number of simple things, and biology has given us an
	interesting physical implementation?

	chriswarbo wrote 1 day ago:
	> And yet, artificial neural networks ARE an approximation of how
	biological neurons work.

	Only if you limit yourself to "sums of weighted inputs, sent
	through a 1D activation function".

	However, the parent said "differentiable primitives": these days
	people have built networks that contain differentiable ray-tracers,
	differentiable physics simulations, etc. Those seem like crazy
	ideas if we limit ourselves to the "neural" analogy; but are quite
	natural for a "composition of differentiable primitives" approach.

	srean wrote 1 day ago:
	> And yet, artificial neural networks ARE an approximation of how
	biological neurons work

	For a non-vapid/non-vacuous definition of 'approximation' this is
	not true at all. It is well understood that (i) back-propagation is
	biologically infeasible in the brain (ii) output 'voltage' is a
	transformed weighted average of the input 'voltage' -- is not how
	neurons operate. (ii) is in the 'not even wrong' category.

	Neurons operate in terms of spikes and frequency and quiescence of
	spiking. If you are interested any undergrad text in neurobiology
	will help correct the wrong notions.

	xanderlewis wrote 1 day ago:
	As the commenter below mentions, the biological version of a neuron
	(i.e. a neuron) is much more complicated than the neural network
	version. The neural network version is essentially just a weighted
	sum, with an extra layer of shaping applied afterwards to make it
	nonlinear. As far as I know, we still donât understand all of the
	complexity about how biological neurons work. Even skimming the
	Wikipedia page for âneuronâ will give you some idea.

	The original idea of approximating something like a neuron using a
	weighted sum (which is a fairly obvious idea, given the initial
	discovery that neurons become âactivatedâ and they do so in
	proportion to how much the neurons they are connected to are) did
	come from thinking about biological brains, but the mathematical
	building blocks are incredibly simple and are hundreds of years
	old, if not thousands.

	naasking wrote 1 day ago:
	> the biological version of a neuron (i.e. a neuron) is much
	more complicated than the neural network version

	This is a difference of degree not of kind, because neural
	networks are Turning complete. Whatever additional complexity the
	neuron has can itself be modelled as a neural network.

	Edit: meaning, that if the greater complexity of a biological
	neuron is relevant to its information processing component, then
	that just increases the number of artificial neural network
	neurons needed to describe it, it does not need any computation
	of a different kind.

	srean wrote 1 day ago:
	> This is a difference of degree not of kind

	Nope.

	Neurons in our brain operate fundamentally differently. They
	work by transient spikes and information is carried not by the
	intensity of the spike voltage, but by the frequency of
	spiking. This is a fundamentally different phenomenon than ANNs
	where the output (voltage) is a squash transformed aggregated
	input values (voltages).

	phkahler wrote 20 hours 22 min ago:
	>> Neurons in our brain operate fundamentally differently.
	They work by transient spikes and information is carried not
	by the intensity of the spike voltage, but by the frequency
	of spiking.

	I thought they worked like accumulators where the spike
	"energy" accumulates until the output "fires". If that's the
	case then the artificial NNs are still an approximation of
	that process. I agree that this is a significant difference,
	but the mathematical version is still a rough approximation
	inspired by the biological one.

	andoando wrote 1 day ago:
	And assembly is also turing complete, so if two models being
	both Turing completeness means they are equivalent, there would
	be no need for coding neural networks at all. Would you
	consider LLMs a different kind of computation than writing
	assembly code?

	Perhaps fundamentally they are not, but its also true that just
	writing more and more random assembly code isn't going to lead
	to an LLM.

	naasking wrote 1 day ago:
	LLMs aren't randomly generated though, they are shaped by
	training data. This means there would, in principle, be a
	comparable way to synthesize an equivalent assembly program
	from that same training data.

	The difference here is that it's just more obvious how to do
	this in one case than the other.

	My point was only that 1) neural networks are sufficient,
	even if real neurons have additional complexity, and 2)
	whatever that additional complexity, artificial neural
	networks can learn to reproduce it.

	andoando wrote 1 day ago:
	I understand that, what I am saying though is the fact that
	they can doesn't mean that they will by simply scaling
	their number. It still entirely depends on how they are
	trained/arranged, meaning it may take a completely
	different way of composing/glueing neurons together to
	stimulate any additional complexity. Its like saying a nand
	gate is turing complete, I put 1000000000 of them in a
	series, but its not doing anything, what gives, do I need
	to add a billion more?

	Just as a modeling and running a single neuron takes x
	amount of transistors configured in a very specific way for
	example, it may take y amount of neurons arranged in some
	very specific, unknown to model something that has extra
	properties.

	And its not clear either whether neurons are fundamentally
	the correct approach to reach this higher level
	construction than some other kind of node.

	xanderlewis wrote 1 day ago:
	PowerPoint is Turing complete. Does that mean PowerPoint should
	be regarded as being biological or at least
	neuroscience-inspired?

	naasking wrote 1 day ago:
	No, but neural networks literally were inspired by biology so
	I'm not sure what your point is.

	xanderlewis wrote 1 day ago:
	My point is that you seem to think neurons in the sense of
	artificial neural networks and neurons in the human brain
	are equivalent because:

	(1) Neural networks are Turing complete, and hence can do
	anything brains can. [debatable anyway; We donât know
	this to be the case since brains might be doing more than
	computation. Ask a philosopher or a cognitive scientist. Or
	Roger Penrose.]

	(2) Neural networks were very loosely inspired by the idea
	that the human brain is made up of interconnected nodes
	that âactivateâ in proportion to how other related
	nodes do.

	I donât think thatâs nearly enough to say that
	theyâre equivalent. For (1), we donât yet know (and
	weâre not even close), and anyway: if you consider all
	Turing complete systems to be equivalent to the point of it
	being a waste of time to talk about their differences then
	you can say goodbye to quite a lot of work in theoretical
	computer science. For (2): so what? Lots of things are
	inspired by other things. It doesnât make them in any
	sense equivalent, especially if the analogy is as weak as
	it is in this case. No neuroscientist thinks that a
	weighted sum is an adequate (or even remotely accurate)
	model of a real biological neuron. They operate on
	completely different principles, as we now know much better
	than when such things were first dreamed up.

	naasking wrote 1 day ago:
	The brain certainly could be doing super-Turing
	computation, but that would overturn quite a bit of
	physics seeing as how not even quantum computers are more
	powerful than Turing machines (they're just faster on
	some problems). Extraordinary claims and all that.

	As for equivalency, that depends on how that's defined.
	Real neurons would not feature any more computational
	power than Turing machines or artificial neural networks,
	but I never said it would be a waste of time to talk
	about their differences. I merely pointed out that the
	artificial neural network model is still sufficient, even
	if real neurons have more complexity.

	> No neuroscientist thinks that a weighted sum is an
	adequate (or even remotely accurate) model of a real
	biological neuron

	Fortunately that's not what I said. If the neuron indeed
	has more relevant complexity, then it wouldn't be one
	weighted sum = one biological neuron, but one biological
	neuron = a network of weighted sums, since such a network
	can model any function.

	xanderlewis wrote 1 day ago:
	The original comment you were in defence of was
	suggesting that artificial neurons were somehow very
	close to biological ones, since supposedly thatâs
	where their inspiration came from.

	If youâre interested in pure computational
	âpowerâ, then if the brain is nothing more than a
	Turing machine (which, as you agree, it might not be),
	fine. You can call them âequivalentâ. Itâs just
	not very meaningful.

	Whatâs interesting about neural nets has nothing to
	do with what they can compute; indeed they can compute
	anything any other Turing machine can, and nothing
	more. Whatâs interesting is how they do it, since
	they can âlearnâ and hence allow us to produce
	solutions to hard problems without any explicit
	programming or traditional analysis of the problem.

	> that would overturn quite a bit of physics

	Our physics is currently woefully incomplete, soâ¦
	yes. That would be welcome.

	kragen wrote 1 day ago:
	anns originated in hypotheses about how neurobiology might work in
	the 01940s but diverged completely from neurobiology in the 01960s;
	they contain nothing we've learned about neurons in the last 50
	years, and not much from before that either (they don't, for
	example, do hebbian learning). current anns use training methods
	like gradient descent with momentum and activation functions like
	relu which have no plausible biological realization

	artificial neural networks are an approximation of biological
	neural networks in the same way that a submarine is an
	approximation of a fish

	zackmorris wrote 1 day ago:
	Ya so far this is the best introduction to neural networks from first
	principles that I've seen.

	Quickly skimming the draft pdf at [1] I can grok it instantly,
	because it's written in familiar academic language instead of
	gobbledygook. Anyone with an undergrad math education in engineering,
	computer science, etc or a self-taught equivalent understanding of
	differential equations should be able to read it easily. It does a
	really good job of connecting esoteric terms like tensors with
	arrays, gradients with partial derivatives, Jacobians with gradients
	and backpropagation with gradient descent in forward/reverse mode
	automatic differentiation. Which helps the reader to grasp the
	fundamentals instead of being distracted by the implementation
	details of TensorFlow, CUDA, etc. Some notable excerpts:

	Introduction (page 4):

	By viewing neural networks as simply compositions of differentiable
	primitives we can ask two basic questions (Figure F.1.3): first, what
	data types can we handle as inputs or outputs? And second, what sort
	of primitives can we use? Differentiability is a strong requirement
	that does not allow us to work directly with many standard data
	types, such as characters or integers, which are fundamentally
	discrete and hence discontinuous. By contrast, we will see that
	differentiable models can work easily with more complex data
	represented as large arrays (what we will call tensors) of numbers,
	such as images, which can be manipulated algebraically by basic
	compositions of linear and nonlinear transformations.

	Chapter 2.2 Gradients and Jacobians (page 23):

	[just read this section - it connects partial derivatives,
	gradients, Jacobians and Taylorâs theorem - wow!]

	Chapter 4.1.5 Some computational considerations (page 59):

	In general, we will always prefer algorithms that scale linearly
	both in the feature dimension c and in the batch size n, since
	super-linear algorithms will become quickly impractical (e.g., a
	batch of 32 RGB images of size 1024Ã1024 has c â 1e7). We can
	avoid a quadratic complexity in the equation of the gradient by
	computing the multiplications in the correct order, i.e., computing
	the matrix-vector product Xw first. Hence, pure gradient descent is
	linear in both c and n, but only if proper care is taken in the
	implementation: generalizing this idea is the fundamental insight for
	the development of reverse-mode automatic differentiation, a.k.a.
	back-propagation (Section 6.3).

	Chapter 6 Automatic differentiation (page 87):

	We consider the problem of efficiently computing gradients of
	generic computational graphs, such as those induced by optimizing a
	scalar loss function on a fully-connected neural network, a task
	called automatic differentiation (AD) [BPRS18]. You can think of a
	computational graph as the set of atomic operations (which we call
	primitives) obtained by running the program itself. We will consider
	sequential graphs for brevity, but everything can be easily extended
	to more sophisticated, acyclic computational graphs.

	The problem may seem trivial, since the chain rule of Jacobians
	(Section 2.2, (E.2.22)) tells us that the gradient of function
	composition is simply the matrix product of the corresponding
	Jacobian matrices. However, efficiently implementing this is the key
	challenge, and the resulting algorithm (reverse-mode AD or
	backpropagation) is a cornerstone of neural networks and
	differentiable programming in general [GW08, BR24]. Understanding it
	is also key to understanding the design (and the differences) of most
	frameworks for implementing and training such programs (such as
	TensorFlow or PyTorch or JAX). A brief history of the algorithm can
	be found in [Gri12].

	Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter 2.2
	Gradients and Jacobians (page 23) for better context.

	[1]: https://arxiv.org/pdf/2404.17625

	barrenko wrote 1 day ago:
	Thank you for the summary, and thanks to the OP/author of the book.

	I started self-studying programming some time ago, then pivoted to
	AI/ML and (understandably) ended up mostly studying math, these
	resources are a boon to my folk.

	jxy wrote 1 day ago:
	> > Stripped of anything else, neural networks are compositions of
	differentiable primitives

	> Iâm a sucker for statements like this. It almost feels
	philosophical, and makes the whole subject so much more
	comprehensible in only a single sentence.

	And I hate inaccurate statements like this. It pretends to be
	rigorous mathematical, but really just propagates erroneous
	information, and makes the whole article so much more amateur in only
	a single sentence.

	The simple relu is continuous but not differentiable at 0, and its
	derivative is discontinuous at 0.

	kmmlng wrote 9 hours 42 min ago:
	Eh, it really doesn't matter much in practice. Additionally, there
	are many other activation functions without this issue.

	whimsicalism wrote 1 day ago:
	it's pretty close to accurate, the lack of differentiability at 0
	for relu doesn't really come into play in practice

	xanderlewis wrote 1 day ago:
	Itâs not âinaccurateâ. The mark of true mastery is an abili…
	to make terse statements that convey a huge amount without
	involving excessive formality or discussion of by-the-by technical
	details. If ever youâve spoken to world-renowned experts in pure
	mathematics or other highly technical and pendantic fields,
	youâll find theyâll say all sorts of âinaccurateâ thin…
	conversation (or even in written documents). It doesnât make them
	worthless; far from it.

	If you want to have a war of petty pedantry, letâs go: the
	derivative of ReLU canât be discontinuous at zero, as you say,
	because continuity (or indeed discontinuity) of a function at x
	requires the function to have a value at x (which is the negation
	of what your first statement correctly claims).

	newrotik wrote 1 day ago:
	Lack of differentiability is actually a very important feature of
	the underlying optimization problem.

	You might think that it doesn't matter because ReLU is, e.g.,
	non-differentiable "only at one point".

	Gradient based methods (what you find in pytorch) generally rely
	on the idea that gradients should taper to 0 in the proximity of
	a local optimum. This is not the case for non-differentiable
	functions, and in fact gradients can be made to be arbitrarily
	large even very close to the optimum.

	As you may imagine, it is not hard to construct examples where
	simple gradient methods that do not properly take these facts
	into account fail to converge. These examples are not exotic.

	makerdiety wrote 1 day ago:
	Invoking excessive formality and discussions of minute technical
	details leads to a cathedral of knowledge built on autistic
	pedantry. The chosen rabbit hole to get lost in needs to be the
	correct one. And human science is riddled with the paths that
	have naive or childish fundamentals.

	kragen wrote 1 day ago:
	> a cathedral of knowledge built on autistic pedantry

	this is certainly true, but more often we use its short name,
	'math'. it turns out to be far more effective than so-called
	common sense

	mistermann wrote 1 day ago:
	This comment makes me want to both upvote and downvote with
	extreme enthusiasm/fury!

	The sign of a truly good conversation?

	kragen wrote 1 day ago:
	my experience with world-renowned experts in pure mathematics is
	that they are much more careful than the average bear to
	explicitly qualify inaccurate things as inaccurate, because their
	discipline requires them to be very clear about precisely what
	they are saying

	discontinuity of a function at x does not, according to the usual
	definition of 'continuity', require the function to have a value
	at x; indeed, functions that fail to have a value at x are
	necessarily discontinuous there, precisely because (as you say)
	they are not continuous there. [1] there are other definitions of
	'discontinuous' in use, but i can't think of one that would give
	the result you claim

	[1]: https://en.wikipedia.org/wiki/Continuous_function#Defini...

	xanderlewis wrote 1 day ago:
	> they are much more careful than the average bear to
	explicitly qualify inaccurate things as inaccurate

	Sure. But what part of this entirely worded in natural
	language, and very short statement made you think it was a
	technical, formal statement? I think youâre just taking an
	opportunity to flex your knowledge of basic calculus, and
	deliberately attributing intent to the author that isnât
	there in order to look clever.

	Regarding a function being discontinuous at a point outside its
	domain: if you take a completely naive view of what
	âdiscontinuousâ means, then I suppose you can say so. But
	discontinuity is just the logical negation of continuity.
	Observe:

	To say that f: X â> Y (in this context, a real-valued
	function of real numbers) is continuous means precisely

	âxâX âÎµ>0 âÎ´>0 \|x - p\| < Î´ â \|f(x) -…

	and so its negation looks like

	âxâX â â¦

	that is, there is a point in X, the domain of f where
	continuity fails.

	For example, you wouldnât talk about a function defined on
	the integers being discontinuous at pi, would you? That would
	just be weird.

	To prove the point further, observe that the set of
	discontinuities (according to your definition) of any given
	function would actually include every numberâ¦ in fact every
	mathematical object in the universe â which would make it not
	even a set in ZFC. So itâs absurd.

	Even more reasons to believe functions can only be
	discontinuous at points of their domain: a function is said to
	be discontinuous if it has at least one discontinuity. By your
	definition, every function is discontinuous.

	â¦anyway, I said we were going to be petty. Iâm trying to
	demonstrate this is a waste of time by wasting my own time.

	kragen wrote 1 day ago:
	you have an interesting point of view, and some of the things
	you have said are correct, but if you try to use gradient
	descent on a function from, say, â¤ â â, you are going
	to be a very sad xanda. i would indeed describe such a
	function as being discontinuous not just at Ï but
	everywhere, at least with the usual definition of continuity
	(though there is a sense in which such a function could be,
	for example, scott-continuous)

	even in the case of a single discontinuity in the derivative,
	like in relu', you lose the intermediate value theorem and
	everything that follows from it; it's not an inconsequential
	or marginally relevant fact

	jj3 wrote 1 day ago:
	Note that any function â¤ â â is continuous on its
	domain but nowhere differentiable.

	A Scott-continuous function â¤ â â must be
	monontonous. So not every such function is
	Scott-continuous.

	kragen wrote 19 hours 17 min ago:
	aha, thanks!

	laingc wrote 1 day ago:
	Because memes aren't allowed on HN, you're not allowed to
	reply with the "akssshuallllyyy" meme, so you had to go to
	these lengths.

	Â¯\_(ã)_/Â¯

	xanderlewis wrote 1 day ago:
	Youâre actually not far off. Iâm somewhat embarrassed
	by the above, but I think it makes the point.

	gessha wrote 1 day ago:
	It is soothing to the mind because it conveys that itâs
	understandable but it doesnât take away from the complexity. You
	still have to read through math and pytorch code and debug
	nonsensical CUDA errors, comb through the data, etc etc

	whimsicalism wrote 1 day ago:
	the complexity is in the values learned from the optimization. even
	the pytorch code for a simple transformer is not that complex,
	attention is a simple mechanism, etc.

	gessha wrote 1 day ago:
	Complexity also comes from the number of papers that work out how
	different elements of network work and how to intuitively change
	them.

	Why do we use conv operators, why do we use attention operators,
	when do we use one over the other? What augmentations do you use,
	how big of a dataset do you need, how do you collect the dataset,
	etc etc etc

	whimsicalism wrote 1 day ago:
	idk, just using attention and massive web crawls gets you
	pretty far. a lot of the rest is more product-style decisions
	about what personality you want your LM to take.

	I fundamentally don't think this technology is that complex.

	gessha wrote 1 day ago:
	No? In his recent tutorial, Karpathy showed just how much
	complexity there is in the tokenizer.

	This technology has been years in the making with many small
	advances pushing the performance ever so slightly. Thereâs
	been theoretical and engineering advances that contributed to
	where we are today. And we need many more to get the
	technology to an actually usable level instead of the current
	word spaghetti that we get.

	Also, the post is generally about neural networks and not
	just LMs.

	When making design decisions about an ML system you
	shouldnât just choose the attention hammer and hammer away.
	Thereâs a lot of design constraints you need to consider
	which is why I made the original reply.

	whimsicalism wrote 1 day ago:
	Are there micro-optimizations that eke out small
	advancements? Yes, absolutely - the modern tokenizer is a
	good example of that.

	Is the core of the technology that complex? No. You could
	get very far with a naive tokenizer that just tokenized by
	words and replaced unknown words with . This is extremely
	simple to implement and I've trained transformers like
	this. It (of course) makes a perplexity difference but the
	core of the technology is not changed and is quite simple.
	Most of the complexity is in the hardware, not the software
	innovations.

	> And we need many more to get the technology to an
	actually usable level instead of the current word spaghetti
	that we get.

	I think the current technology is useable.

	> you shouldnât just choose the attention hammer and
	hammer away

	It's a good first choice of hammer, tbph.

	SkyBelow wrote 1 day ago:
	Before the recent AI boom, I was mystified by the possibility of AI
	and emulating humans (in no small part thanks to works of fiction
	showing AI powered androids). Then I created and trained some neural
	networks. Smaller ones, doing much of nothing special. That was
	enough to break the mysticism. To realize it was just multiplying
	matrices. Training them was a bit more advanced, but still applied
	mathematics.

	Only recently have I begun to appreciate that the simplicity of the
	operation, applied to a large enough matrices, may still capture
	enough of the nature of intelligence and sentience. In the end we
	can be broken down into (relatively) simple chemical reactions, and
	it is the massive scale of these reactions that create real
	intelligence and sentience.

	mistermann wrote 1 day ago:
	Next step, in case you get bored: why (in fact, not a minor
	distinction) does such a simple approach work so well?

	naasking wrote 1 day ago:
	Exactly, the people who are derisive of those who consider ML
	models to exhibit glimmers of true intelligence because it's only
	matrix multiplications always amuse me. It's like they don't even
	realize the contradiction in holding the position that seemingly
	complex and intelligent outward behaviour should not be used as an
	indication of actual complexity and intelligence.

	citizen_friend wrote 1 day ago:
	If you study this historically you will see that every generation
	thinks they have the mechanism to explain brains (gear systems,
	analog control/cybernetics, perceptrons).

	My conclusion is we tend to overestimate our understanding and
	the power of our inventions.

	naasking wrote 1 day ago:
	The difference is that we now actually have a proof of
	computational power and computational universality.

	citizen_friend wrote 1 day ago:
	Analog circuits have the same computational power. Piecewise
	linear functions have the same computational universality.

	naasking wrote 1 day ago:
	Except we didn't know any of that, nor did know how to
	construct physical analogs in order to achieve universal
	computation. At best, we had limited task-specific
	computation, like clocks and planetary motion.

	citizen_friend wrote 16 hours 20 min ago:
	We knew about universal function approximators like.
	Polynomials and trig functions since the 1700s. Turing
	and godel were around 1910 and 1920. The cybernetics
	movement is big in the 30s and 40s. Perceptrons 50s and
	60s

	naasking wrote 15 hours 40 min ago:
	Taylor expansions for all functions do not exist.
	Furthermore, our characterization of infinity was still
	poor, so we didn't even have a solid notion of what it
	would mean for a formalism to be able to compute all
	computable functions. The notion of a universal
	computer arguably didn't exist until Babbage.

	I stand by my position that having a mathematical proof
	of computational universality is a significant
	difference that separates today from all prior eras
	that sought to understand the brain through
	contemporaneous technology.

	citizen_friend wrote 15 hours 31 min ago:
	> Taylor expansions

	Thatâs not what Iâm talking about. This is a
	basic analysis topic: [1] At least mid 1800s for a
	proof. 1700s also explored Fourier series.

	> stand by my position

	And youâre still ignoring the cybernetics, and
	perceptrons movement I keep referring to which was
	more than 100 years ago, and informed by Turing.

	[1]: https://en.m.wikipedia.org/wiki/Stone%E2%80%...

	naasking wrote 15 hours 18 min ago:
	> Thatâs not what Iâm talking about. This is a
	basic analysis topic:

	It's the same basic flaw: requiring continuous
	functions. Not all functions are continuous,
	therefore this is not sufficient.

	> And youâre still ignoring the cybernetics, and
	perceptrons movement I keep referring to which was
	more than 100 years ago, and informed by Turing.

	What about them? As long as they're universal, they
	can all simulate brains. Anything after Church and
	Turing is just window dressing. Notice how none of
	these new ideas claimed to change what could in
	principle be computed, only how much easier or more
	natural this paradigm might be for simulating or
	creating brains.

	citizen_friend wrote 12 hours 45 min ago:
	This implies it works piecewise. Thatâs also
	true of neural nets lol. You have to keep adding
	more neurons to get the granularity of whatever
	your discontinuities are.

	Itâs also a different reason than Taylor series
	which uses differentiability.

	You do not understand this subject. Please read
	before repeating this: [1] > what about them

	Then you seem to have lost the subject of the
	thread.

	[1]: https://en.m.wikipedia.org/wiki/Universa...

	captainclam wrote 1 day ago:
	Ugh, exactly, it's so cool. I've been a deep learning practitioner
	for ~3 years now, and I feel like this notion has really been
	impressed upon me only recently.

	I've spent an awful lot of mental energy trying to conceive of how
	these things work, when really it comes down to "does increasing this
	parameter improve the performance on this task? Yes? Move the dial up
	a bit. No? Down a bit..." x 1e9.

	And the cool part is that this yields such rich, interesting,
	sometimes even useful, structures!

	I like to think of this cognitive primitive as the analogue to the
	idea that thermodynamics is just the sum of particles bumping into
	each other. At the end of the day, that really is just it, but the
	collective behavior is something else entirely.

	kadushka wrote 1 day ago:
	it comes down to "does increasing this parameter improve the
	performance on this task? Yes? Move the dial up a bit. No? Down a
	bit..." x 1e9

	This is not how gradient based NN optimization works. What you
	described is called "random weight perturbation", a variant of
	evolutionary algorithms. It does not scale to networks larger than
	a few thousand parameters for obvious reasons.

	NNs are optimized by directly computing a gradient which tells us
	the direction to go to to reduce the loss on the current batch of
	training data. There's no trying up or down and seeing if it worked
	- we always know which direction to go.

	SGD and RWP are two completely different approaches to learning
	optimal NN weights.

	captainclam wrote 1 day ago:
	I guess you could say I don't know RWP from Adam! :D

	My og comment wasn't to accurately explain gradient optimization,
	I was just expressing a sentiment not especially aimed at experts
	and not especially requiring details.

	Though I'm afraid I subjected you to the same "cringe" I
	experience when I read pop sci/tech articles describe deep
	learning optimization as "the algorithm" being "rewarded" or
	"punished," haha.

	kadushka wrote 1 day ago:
	No worries, we're all friends here!

	it's just you happened to accidentally describe the idea behind
	RWP, which is a gradient-free optimization method, so I thought
	I should point it out.

	xanderlewis wrote 1 day ago:
	I donât think the author literally meant tweaking the
	parameters and seeing what happens; itâs probably an analogy
	meant to give a sense of how the gradient indicates what
	direction and to what degree the parameters should be tweaked.
	Basically, substitute âthe gradient is positiveâ for
	âincreasing this parameter decreases performanceâ and vice
	versa and it becomes correct.

	p1esk wrote 1 day ago:
	That substitution is the main difference between SGD and RWP.

	Itâs like describing bubble sort when you meant to describe
	quick sort. Would not fly on an ML 101 exam, or in an ML job
	interview.

	a_random_canuck wrote 1 day ago:
	I donât think anyone is trying to pass an exam here, but
	just give an understandable overview to a general audience.

	xanderlewis wrote 1 day ago:
	Itâs not like that at all. You couldnât accidentally
	sound like youâre describing quick sort when describing
	bubble sort, or vice versa. I canât think of any
	substitution of a few words that would do that.

	The meaning of the gradient is perfectly adequately described
	by the author. They werenât describing an algorithm for
	computing it.

	JackFr wrote 1 day ago:
	NAND gates by themselves are kind of dull, but it's pretty cool
	what you can do with a billion of them.

	xanderlewis wrote 1 day ago:
	> At the end of the day, that really is just it, but the collective
	behavior is something else entirely.

	Exactly. Itâs not to say that neat descriptions like this are the
	end of the story (or even the beginning of it). If they were, there
	would be no need for this entire field of study.

	But they are cool, and can give you a really clear
	conceptualisation of something that can appear more like a sum of
	disjoint observations and ad hoc tricks than a discipline based on
	a few deep principles.

	andoando wrote 1 day ago:
	What does "differentiable primitives" mean here?

	esafak wrote 1 day ago:
	functions, which you can compose to increase their expressiveness,
	and run gradient descent on to train.

	The success of deep learning is basically attributable to
	composable (expressive), differentiable (learnable) functions. The
	"deep" moniker alludes to the compositionality.

	dirkc wrote 1 day ago:
	When I did "AI" it would have meant the sigmoid function, these
	days it's something like ReLU.

	xanderlewis wrote 1 day ago:
	I think itâs referring to âprimitive functionsâ in the sense
	that theyâre the building blocks of more complicated functions.
	If f and g are differentiable, f+g, fg, f/g (as long as g is never
	zero)â¦ and so on are differentiable too. Importantly, f composed
	with g is also differentiable, and so since the output of the whole
	network as a function of its input is a composition of these
	âprimitivesâ itâs differentiable too.

	The actual primitive functions in this case would be things like
	the weighted sums of activations in the previous layer to get the
	activation of a given layer, and the actual âactivation
	functionsâ (traditionally something like a sigmoid function;
	these days a ReLU) associated with each layer.

	âPrimitivesâ is also sometimes used as a synonym for
	antiderivatives, but I donât think thatâs what it means here.

	Edit: it just occurred to me from a comment below that you might
	have meant to ask what the âdifferentiableâ part means. See [1]
	.

	[1]: https://en.wikipedia.org/wiki/Differentiable_function

	andoando wrote 1 day ago:
	Is this function composition essentially lambda calculus then?

	xanderlewis wrote 1 day ago:
	Composition here just means what it does for any two functions:
	the value of the âcompositionâ of f and g at x is defined
	to be f applied to g applied to x. In symbols, its: fâg :=
	f(g(x)) for each x in the domain of f. It may seem obvious,
	but the fact that this new thing is also a function (that is,
	its value is well-defined for every input) is actually a very
	useful thing indeed and leads toâ¦ well, most of mathematics.

	You can certainly do function composition in lambda calculus:
	in fact, the act of composition itself is a higher order
	function (takes functions and returns a function) and you can
	certainly express it formally with lambda terms and such.
	Itâs not really got anything to do with any particular
	language or model of computation though.

	andoando wrote 1 day ago:
	I didn't form my question too well. I understand all that.
	What I am asking is, are these function compositions
	equivalent to equivalent/similar to functions in lambda
	calculus?

	I guess my question, is what are the primitive functions here
	doing?

	xanderlewis wrote 1 day ago:
	Well, yes, to the extent that functions are functions are
	functions (theyâre just associations or mappings or
	whatever you want to call them).

	Maybe your question boils down to asking something more
	general like: whatâs the difference between functions to
	a computer scientist (or a programmer) and functions to a
	mathematician? That is, are âfunctionsâ in C (or lambda
	calculus), say, the same âfunctionsâ we talk about in
	calculus?

	The answer to that is: in this case, because these are
	quite simple functions (sums and products and compositions
	thereof) theyâre the same. In general, theyâre a bit
	different. The difference is basically the difference
	between functional programming and âtraditionalâ
	programming. If you have state/âside effectsâ of
	functions, then your function wonât be a function in the
	sense of mathematics; if the return value of your function
	depends entirely on the input and doesnât return
	different values depending on whatever else is happening in
	the program, then it will be.

	Since youâre asking about lambda calculus in particular,
	the answer is that theyâre the same because lambda
	calculus doesnât have state. Itâs âpurely
	functionalâ in that sense.

	>I guess my question, is what are the primitive functions
	here doing?

	Iâm not really sure what you mean. Theyâre doing what
	functions always do. Every computer program is abstractly a
	(partial) function.

	Does that help, or have I misunderstood?

	andoando wrote 1 day ago:
	So when I think of functions in lambda calculus, I think
	of the I,S,K functions which when composed can produce
	functions like "copy", "add", "remove", "if", etc which
	then can do different computations like "copy every other
	symbol if the symbol is true", "multiply 5 times then add
	2". Since lambda calculus is complete, any
	computation/program can be composed.

	When I think of functions in a traditional mathematical
	sense, I think about transformations of numbers. x->2x,
	x->2x^2, etc. I completely understand composition of
	functions here, ex x->2(x->2x)^2, but its unclear how
	these transformations relate to computation. For a
	regression problem, I can totally understand how finding
	the right compositions of functions can lead to a better
	approximations. So I am wondering, in an LLM
	architecture, what computations do these functions
	actually represent? I assume, it has something to do with
	what path to take through the neural layers.
	I probably just need to take the time to study it deeper.

	>If you have state/âside effectsâ of functions, then
	your function wonât be a function in the sense of
	mathematics; if the return value of your function depends
	entirely on the input and doesnât return different
	values depending on whatever else is happening in the
	program, then it will be.

	Totally understood from the perspective of functions in
	say, Java. Though fundamentally I don't think there is
	distinction between functions in computer science and
	mathematics. The program as a whole is effectively a
	function. The "global" state is from another reference,
	just local variables of the encompassing function. If a
	function is modifying variables outside of the "function
	block" (in say Java), the "input" to the function isn't
	just the parameters of the function. Imo, this is more of
	an artifact of implementation of some languages rather
	than a fundamental difference. Python for example
	requires declaring global args in the function block. Go
	one step further and require putting global args into the
	parameters list and you're pretty close to satisfying
	this.

	xanderlewis wrote 1 day ago:
	I think youâre actually massively overthinking it.

	The state of a neural network is described entirely by
	its parameters, which usually consist of a long array
	(well, a matrix, or a tensor, or whateverâ¦) of
	floating point numbers. What is being optimised when a
	network is trained is these parameters and nothing
	else. When you evaluate a neural network on some input
	(often called performing âinferenceâ), that is when
	the functions weâre talking about are used. You start
	with the input vector, and you apply all of those
	functions in order and you get the output vector of the
	network. The training process also uses these
	functions, because to train a network you have to
	perform evaluation repeatedly in between tweaking those
	parameters to make it better approximate the desired
	output for each input. Importantly, the functions do
	not change. They are constant; itâs the parameters
	that change. The functions are the architecture â not
	the thing being learned. Essentially what the
	parameters represent is how likely each neuron is to be
	activated (have a high value) if others in the previous
	layer are. So you can think of the parameters as
	encoding strengths of connections between each pair of
	neurons in consecutive layers. Thinking about âwhat
	path to take through the neural layersâ is way too
	sophisticated â itâs not doing anything like that.

	> Though fundamentally I don't think there is
	distinction between functions in computer science and
	mathematics. The program as a whole is effectively a
	function.

	Youâre pretty much right about that, but there are
	two important problems/nitpicks:

	(1) We canât prove (in general) that a given program
	will halt and evaluate to something (rather than just
	looping forever) on a given input, so the âentire
	programâ is instead whatâs called a partial
	function. This means that itâs still a function on
	its domain â but we canât know what its precise
	domain is. Given an input, it may or may not produce an
	output. If it does, though, itâs well defined because
	itâs a deterministic process.

	(2) Youâre right to qualify that itâs the whole
	program that is (possibly) a function. If you take a
	function from some program that depends on some state
	in that same program, then clearly that function
	wonât be a proper âmathematicalâ function. Sure,
	if you incorporate that extra state as one of your
	inputs, it might be, but thatâs a different function.
	You have to remember that in mathematics, unlike in
	programming, a function consists essentially of three
	pieces of data: a domain, a codomain, and a âruleâ.
	If you want to be set-theoretic and formal about it,
	this rule is just a subset of the cartesian product of
	its domain and codomain (itâs a set of pairs of the
	form (x, f(x))). If you change either of these sets,
	itâs technically a different function and there are
	good reasons for distinguishing between these. So
	itâs not right to say that mathematical functions and
	functions in a computer program are exactly the same.

	andoando wrote 1 day ago:
	I appreciate your responses, sorry I hope I don't
	seem like Im arguing for the sake of arguing.

	>Essentially what the parameters represent is how
	likely each neuron is to be activated (have a high
	value) if others in the previous layer are. So you
	can think of the parameters as encoding strengths of
	connections between each pair of neurons in
	consecutive layers. Thinking about âwhat path to
	take through the neural layersâ is way too
	sophisticated â itâs not doing anything like
	that.

	Im a little confused. The discussion thus far about
	how neural networks are essentially just compositions
	of functions, but you are now saying that the
	function is static, and only the parameters change.

	But that aside, if these parameters change which
	neurons are activated, and this activation affects
	which neurons are activated in the next layer, are
	these parameters effectively not changing the path
	taken through the layers?

	>Sure, if you incorporate that extra state as one of
	your inputs, it might be, but thatâs a different
	function.

	So say we have this program
	"
	let c = 2;
	function 3sum (a,b) {
	return a+b + c;
	}
	let d = 3sum(3,4)"

	I believe you are saying, if we had constructed this
	instead as

	"function(a,b,c) {
	return a+b+c
	}
	let d = 3sum(3,4,2)
	"

	then, this is a different function.

	Certainly, these are different in a sense, but at a
	fundamental level, when you compile this all down and
	run it, there is an equivalency in the transformation
	that is happening. That is, the two functions
	equivalently take some input state A (composed of
	a,b,c) and return the same output state B, while
	applying the same intermediary steps (add a to b, add
	c to result of (add to b)). Really, in the first case
	where c is defined outside the scope of the function
	block, the interpreter is effectively producing the
	function 3sum(x,y,c) as it has to at some point, one
	way or another, inject c into a+b+c.

	Similarly, I am won't argue that the current, formal
	definitions of functions in mathematics are exactly
	that of functions as they're generally defined in
	programming.

	Rather, what I saying is that there is an equivalent
	way to think and study functions that equally apply
	to both fields. That is, a function is simply a
	transformation from A to B, where A and B can be
	anything, whether that is bits, numbers, or any other
	construction in any system. The only primitive
	distinction to make here is whether A and B are the
	same thing or different.

	OJFord wrote 1 day ago:
	Function composition is just f(g(x)), considered as a single
	function that's the composition of f and g; it has the domain
	of f and the range of g.

	In lambda calculus terminology it's an 'application' (with a
	function argument).

	CobrastanJorji wrote 1 day ago:
	Continuous mathematical functions which have derivatives.

	glonq wrote 1 day ago:
	I wonder if the usage of Alice & Wonderland takes inspiration from
	Douglas Hofstadter's "GÃ¶del, Escher, Bach: an Eternal Golden Braid" ?

	abdullahkhalids wrote 1 day ago:
	Lewis Carrol's Alice in Wonderland features a number of logical and
	mathematical puzzles [1] He also wrote What the Tortoise Said to
	Achilles (1895) in which the paradoxes of Zeno are discussed.

	So it's more correct to say that GEB and this article are originally
	inspired by Lewis Carrol's work. [1] I wrote a short article for my
	university magazine a long time ago. Some interesting references at
	the end

	[1]: https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...

	devnonymous wrote 1 day ago:
	No, I think the inspiration is more direct

	[1]: https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+...

	iainmerrick wrote 1 day ago:
	It's a pretty common trope, especially for math-related books, e.g.
	Alex Bellos' "Alex's Adventures in Numberland".


	<- back to front page