_______ __ _______ | |
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
on Gopher (inofficial) | |
Visit Hacker News on the Web | |
COMMENT PAGE FOR: | |
Alice's adventures in a differentiable wonderland | |
seanhunter wrote 1 day ago: | |
Wow, just skimmed a bit, but this book looks amazing so far. Really | |
understandable but with an intuitive presentation of the underlying | |
maths that invites the reader to go deeper if they want to by giving | |
them what they need to get started. | |
gfaure wrote 1 day ago: | |
In the literature, they're usually called convolutional layers (I think | |
you can pretty much search and replace all uses of "convolutive" in the | |
text). | |
p1esk wrote 1 day ago: | |
And then you learn about binary or ternary networks where gradients | |
donât really exist anywhere, and you start to wonder about the | |
importance of this differentiability. | |
ubj wrote 1 day ago: | |
...And then you start learning about generalizations of the notion of | |
"gradient" to scenarios where the classical gradient doesn't exist :) | |
whimsicalism wrote 1 day ago: | |
binary networks don't really work well unless you do a relaxation | |
first | |
Eduard wrote 1 day ago: | |
better remove all the Disney-based Alice in Wonderland character | |
intellectual property from the book. | |
astrodust wrote 1 day ago: | |
I was just thinking that's "cease and desist" bait right there. | |
iainmerrick wrote 1 day ago: | |
Alice in Wonderland (the book) is in the public domain. The old | |
Disney movie is still in copyright, and the cover image does look | |
very much like it's from the movie, but that character design is | |
from John Tenniel's illustrations which are also in the public | |
domain. | |
astrodust wrote 1 day ago: | |
The character design is. The image, however, is clearly Disney | |
flavoured if not traced directly. | |
His version, for example, does not have the distinctive bow. The | |
art style is also completely different. | |
iainmerrick wrote 1 day ago: | |
True - it would be a good idea to use a Tenniel piece instead. | |
Edit to add: I was mostly trying to push back on the | |
implication that Disney owns Alice in Wonderland (and Peter | |
Pan, Winnie the Pooh, etc). Now I re-read the original comment, | |
they did specify "Disney-based", so maybe I'm over-reacting! | |
xanderlewis wrote 1 day ago: | |
> Stripped of anything else, neural networks are compositions of | |
differentiable primitives | |
Iâm a sucker for statements like this. It almost feels philosophical, | |
and makes the whole subject so much more comprehensible in only a | |
single sentence. | |
I think François Chollet says something similar in his book on deep | |
learning: one shouldnât fall into the trap of anthropomorphising and | |
mysticising models based on the âneuralâ name; deep learning is | |
simply the application of sequences of operations that are nonlinear | |
(and hence capable of encoding arbitrary complexity) but nonetheless | |
differentiable and so efficiently optimisable. | |
m463 wrote 1 day ago: | |
chess is pattern matching. | |
jonas21 wrote 1 day ago: | |
I feel like this statement is both obvious after spending a few | |
minutes working with neural networks and completely useless in | |
helping you build better neural networks. | |
It's kind of like saying, "Stripped of anything else, works of | |
literature are compositions of words" | |
falcor84 wrote 1 day ago: | |
Let me try to one-up that: "Stripped off anything else, works of | |
literature are compositions of the uppercase letter I" | |
Horffupolde wrote 1 day ago: | |
Well, Iâd argue that could also be a bit enlightening. Itâs | |
like taking a moment to appreciate that forests are composed of | |
single trees. It takes a certain level of insight to appreciate | |
systems at various depths. | |
sideshowb wrote 1 day ago: | |
> deep learning is simply the application of sequences of operations | |
that are nonlinear but nonetheless differentiable | |
Though other things fit this description which are not deep learning. | |
Like (shameless plug) my recent paper here | |
[1]: https://ieeexplore.ieee.org/document/10497907 | |
naasking wrote 1 day ago: | |
> one shouldnât fall into the trap of anthropomorphising and | |
mysticising models based on the âneuralâ name | |
One also shouldn't fall into the dual trap of assuming that just | |
because one understands how a model works, it cannot have any bearing | |
on the ever-mysterious operation of the brain. | |
jimbokun wrote 1 day ago: | |
I always get the impression even the proponents of these algorithms | |
when they didn't seem so promising, are shocked at the capabilities | |
demonstrated by models built with such a relatively simple procedure. | |
phkahler wrote 1 day ago: | |
>> one shouldnât fall into the trap of anthropomorphising and | |
mysticising models based on the âneuralâ name | |
And yet, artificial neural networks ARE an approximation of how | |
biological neurons work. It is worth noting that they came out of | |
neurobiology and not some math department - well at least in the | |
forward direction, I'm not sure who came up with the training | |
algorithms (probably the math folks). Should they be considered | |
mystical? No. I would also posit that biological neurons are more | |
efficient and probably have better learning algorithms than | |
artificial ones today. | |
I'm confused as to why some people seem to shun the biological | |
equivalence of these things. In a recent thread here I learned that | |
physical synaptic weights (in our brains) are at least partly stored | |
in DNA or its methylation. If that isn't fascinating I'm not sure | |
what is. Or is it more along the lines of intelligence can be reduced | |
to a large number of simple things, and biology has given us an | |
interesting physical implementation? | |
chriswarbo wrote 1 day ago: | |
> And yet, artificial neural networks ARE an approximation of how | |
biological neurons work. | |
Only if you limit yourself to "sums of weighted inputs, sent | |
through a 1D activation function". | |
However, the parent said "differentiable primitives": these days | |
people have built networks that contain differentiable ray-tracers, | |
differentiable physics simulations, etc. Those seem like crazy | |
ideas if we limit ourselves to the "neural" analogy; but are quite | |
natural for a "composition of differentiable primitives" approach. | |
srean wrote 1 day ago: | |
> And yet, artificial neural networks ARE an approximation of how | |
biological neurons work | |
For a non-vapid/non-vacuous definition of 'approximation' this is | |
not true at all. It is well understood that (i) back-propagation is | |
biologically infeasible in the brain (ii) output 'voltage' is a | |
transformed weighted average of the input 'voltage' -- is not how | |
neurons operate. (ii) is in the 'not even wrong' category. | |
Neurons operate in terms of spikes and frequency and quiescence of | |
spiking. If you are interested any undergrad text in neurobiology | |
will help correct the wrong notions. | |
xanderlewis wrote 1 day ago: | |
As the commenter below mentions, the biological version of a neuron | |
(i.e. a neuron) is much more complicated than the neural network | |
version. The neural network version is essentially just a weighted | |
sum, with an extra layer of shaping applied afterwards to make it | |
nonlinear. As far as I know, we still donât understand all of the | |
complexity about how biological neurons work. Even skimming the | |
Wikipedia page for âneuronâ will give you some idea. | |
The original idea of approximating something like a neuron using a | |
weighted sum (which is a fairly obvious idea, given the initial | |
discovery that neurons become âactivatedâ and they do so in | |
proportion to how much the neurons they are connected to are) did | |
come from thinking about biological brains, but the mathematical | |
building blocks are incredibly simple and are hundreds of years | |
old, if not thousands. | |
naasking wrote 1 day ago: | |
> the biological version of a neuron (i.e. a neuron) is much | |
more complicated than the neural network version | |
This is a difference of degree not of kind, because neural | |
networks are Turning complete. Whatever additional complexity the | |
neuron has can itself be modelled as a neural network. | |
Edit: meaning, that if the greater complexity of a biological | |
neuron is relevant to its information processing component, then | |
that just increases the number of artificial neural network | |
neurons needed to describe it, it does not need any computation | |
of a different kind. | |
srean wrote 1 day ago: | |
> This is a difference of degree not of kind | |
Nope. | |
Neurons in our brain operate fundamentally differently. They | |
work by transient spikes and information is carried not by the | |
intensity of the spike voltage, but by the frequency of | |
spiking. This is a fundamentally different phenomenon than ANNs | |
where the output (voltage) is a squash transformed aggregated | |
input values (voltages). | |
phkahler wrote 20 hours 22 min ago: | |
>> Neurons in our brain operate fundamentally differently. | |
They work by transient spikes and information is carried not | |
by the intensity of the spike voltage, but by the frequency | |
of spiking. | |
I thought they worked like accumulators where the spike | |
"energy" accumulates until the output "fires". If that's the | |
case then the artificial NNs are still an approximation of | |
that process. I agree that this is a significant difference, | |
but the mathematical version is still a rough approximation | |
inspired by the biological one. | |
andoando wrote 1 day ago: | |
And assembly is also turing complete, so if two models being | |
both Turing completeness means they are equivalent, there would | |
be no need for coding neural networks at all. Would you | |
consider LLMs a different kind of computation than writing | |
assembly code? | |
Perhaps fundamentally they are not, but its also true that just | |
writing more and more random assembly code isn't going to lead | |
to an LLM. | |
naasking wrote 1 day ago: | |
LLMs aren't randomly generated though, they are shaped by | |
training data. This means there would, in principle, be a | |
comparable way to synthesize an equivalent assembly program | |
from that same training data. | |
The difference here is that it's just more obvious how to do | |
this in one case than the other. | |
My point was only that 1) neural networks are sufficient, | |
even if real neurons have additional complexity, and 2) | |
whatever that additional complexity, artificial neural | |
networks can learn to reproduce it. | |
andoando wrote 1 day ago: | |
I understand that, what I am saying though is the fact that | |
they can doesn't mean that they will by simply scaling | |
their number. It still entirely depends on how they are | |
trained/arranged, meaning it may take a completely | |
different way of composing/glueing neurons together to | |
stimulate any additional complexity. Its like saying a nand | |
gate is turing complete, I put 1000000000 of them in a | |
series, but its not doing anything, what gives, do I need | |
to add a billion more? | |
Just as a modeling and running a single neuron takes x | |
amount of transistors configured in a very specific way for | |
example, it may take y amount of neurons arranged in some | |
very specific, unknown to model something that has extra | |
properties. | |
And its not clear either whether neurons are fundamentally | |
the correct approach to reach this higher level | |
construction than some other kind of node. | |
xanderlewis wrote 1 day ago: | |
PowerPoint is Turing complete. Does that mean PowerPoint should | |
be regarded as being biological or at least | |
neuroscience-inspired? | |
naasking wrote 1 day ago: | |
No, but neural networks literally were inspired by biology so | |
I'm not sure what your point is. | |
xanderlewis wrote 1 day ago: | |
My point is that you seem to think neurons in the sense of | |
artificial neural networks and neurons in the human brain | |
are equivalent because: | |
(1) Neural networks are Turing complete, and hence can do | |
anything brains can. [debatable anyway; We donât know | |
this to be the case since brains might be doing more than | |
computation. Ask a philosopher or a cognitive scientist. Or | |
Roger Penrose.] | |
(2) Neural networks were very loosely inspired by the idea | |
that the human brain is made up of interconnected nodes | |
that âactivateâ in proportion to how other related | |
nodes do. | |
I donât think thatâs nearly enough to say that | |
theyâre equivalent. For (1), we donât yet know (and | |
weâre not even close), and anyway: if you consider all | |
Turing complete systems to be equivalent to the point of it | |
being a waste of time to talk about their differences then | |
you can say goodbye to quite a lot of work in theoretical | |
computer science. For (2): so what? Lots of things are | |
inspired by other things. It doesnât make them in any | |
sense equivalent, especially if the analogy is as weak as | |
it is in this case. No neuroscientist thinks that a | |
weighted sum is an adequate (or even remotely accurate) | |
model of a real biological neuron. They operate on | |
completely different principles, as we now know much better | |
than when such things were first dreamed up. | |
naasking wrote 1 day ago: | |
The brain certainly could be doing super-Turing | |
computation, but that would overturn quite a bit of | |
physics seeing as how not even quantum computers are more | |
powerful than Turing machines (they're just faster on | |
some problems). Extraordinary claims and all that. | |
As for equivalency, that depends on how that's defined. | |
Real neurons would not feature any more computational | |
power than Turing machines or artificial neural networks, | |
but I never said it would be a waste of time to talk | |
about their differences. I merely pointed out that the | |
artificial neural network model is still sufficient, even | |
if real neurons have more complexity. | |
> No neuroscientist thinks that a weighted sum is an | |
adequate (or even remotely accurate) model of a real | |
biological neuron | |
Fortunately that's not what I said. If the neuron indeed | |
has more relevant complexity, then it wouldn't be one | |
weighted sum = one biological neuron, but one biological | |
neuron = a network of weighted sums, since such a network | |
can model any function. | |
xanderlewis wrote 1 day ago: | |
The original comment you were in defence of was | |
suggesting that artificial neurons were somehow very | |
close to biological ones, since supposedly thatâs | |
where their inspiration came from. | |
If youâre interested in pure computational | |
âpowerâ, then if the brain is nothing more than a | |
Turing machine (which, as you agree, it might not be), | |
fine. You can call them âequivalentâ. Itâs just | |
not very meaningful. | |
Whatâs interesting about neural nets has nothing to | |
do with what they can compute; indeed they can compute | |
anything any other Turing machine can, and nothing | |
more. Whatâs interesting is how they do it, since | |
they can âlearnâ and hence allow us to produce | |
solutions to hard problems without any explicit | |
programming or traditional analysis of the problem. | |
> that would overturn quite a bit of physics | |
Our physics is currently woefully incomplete, so⦠| |
yes. That would be welcome. | |
kragen wrote 1 day ago: | |
anns originated in hypotheses about how neurobiology might work in | |
the 01940s but diverged completely from neurobiology in the 01960s; | |
they contain nothing we've learned about neurons in the last 50 | |
years, and not much from before that either (they don't, for | |
example, do hebbian learning). current anns use training methods | |
like gradient descent with momentum and activation functions like | |
relu which have no plausible biological realization | |
artificial neural networks are an approximation of biological | |
neural networks in the same way that a submarine is an | |
approximation of a fish | |
zackmorris wrote 1 day ago: | |
Ya so far this is the best introduction to neural networks from first | |
principles that I've seen. | |
Quickly skimming the draft pdf at [1] I can grok it instantly, | |
because it's written in familiar academic language instead of | |
gobbledygook. Anyone with an undergrad math education in engineering, | |
computer science, etc or a self-taught equivalent understanding of | |
differential equations should be able to read it easily. It does a | |
really good job of connecting esoteric terms like tensors with | |
arrays, gradients with partial derivatives, Jacobians with gradients | |
and backpropagation with gradient descent in forward/reverse mode | |
automatic differentiation. Which helps the reader to grasp the | |
fundamentals instead of being distracted by the implementation | |
details of TensorFlow, CUDA, etc. Some notable excerpts: | |
Introduction (page 4): | |
By viewing neural networks as simply compositions of differentiable | |
primitives we can ask two basic questions (Figure F.1.3): first, what | |
data types can we handle as inputs or outputs? And second, what sort | |
of primitives can we use? Differentiability is a strong requirement | |
that does not allow us to work directly with many standard data | |
types, such as characters or integers, which are fundamentally | |
discrete and hence discontinuous. By contrast, we will see that | |
differentiable models can work easily with more complex data | |
represented as large arrays (what we will call tensors) of numbers, | |
such as images, which can be manipulated algebraically by basic | |
compositions of linear and nonlinear transformations. | |
Chapter 2.2 Gradients and Jacobians (page 23): | |
[just read this section - it connects partial derivatives, | |
gradients, Jacobians and Taylorâs theorem - wow!] | |
Chapter 4.1.5 Some computational considerations (page 59): | |
In general, we will always prefer algorithms that scale linearly | |
both in the feature dimension c and in the batch size n, since | |
super-linear algorithms will become quickly impractical (e.g., a | |
batch of 32 RGB images of size 1024Ã1024 has c â 1e7). We can | |
avoid a quadratic complexity in the equation of the gradient by | |
computing the multiplications in the correct order, i.e., computing | |
the matrix-vector product Xw first. Hence, pure gradient descent is | |
linear in both c and n, but only if proper care is taken in the | |
implementation: generalizing this idea is the fundamental insight for | |
the development of reverse-mode automatic differentiation, a.k.a. | |
back-propagation (Section 6.3). | |
Chapter 6 Automatic differentiation (page 87): | |
We consider the problem of efficiently computing gradients of | |
generic computational graphs, such as those induced by optimizing a | |
scalar loss function on a fully-connected neural network, a task | |
called automatic differentiation (AD) [BPRS18]. You can think of a | |
computational graph as the set of atomic operations (which we call | |
primitives) obtained by running the program itself. We will consider | |
sequential graphs for brevity, but everything can be easily extended | |
to more sophisticated, acyclic computational graphs. | |
The problem may seem trivial, since the chain rule of Jacobians | |
(Section 2.2, (E.2.22)) tells us that the gradient of function | |
composition is simply the matrix product of the corresponding | |
Jacobian matrices. However, efficiently implementing this is the key | |
challenge, and the resulting algorithm (reverse-mode AD or | |
backpropagation) is a cornerstone of neural networks and | |
differentiable programming in general [GW08, BR24]. Understanding it | |
is also key to understanding the design (and the differences) of most | |
frameworks for implementing and training such programs (such as | |
TensorFlow or PyTorch or JAX). A brief history of the algorithm can | |
be found in [Gri12]. | |
Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter 2.2 | |
Gradients and Jacobians (page 23) for better context. | |
[1]: https://arxiv.org/pdf/2404.17625 | |
barrenko wrote 1 day ago: | |
Thank you for the summary, and thanks to the OP/author of the book. | |
I started self-studying programming some time ago, then pivoted to | |
AI/ML and (understandably) ended up mostly studying math, these | |
resources are a boon to my folk. | |
jxy wrote 1 day ago: | |
> > Stripped of anything else, neural networks are compositions of | |
differentiable primitives | |
> Iâm a sucker for statements like this. It almost feels | |
philosophical, and makes the whole subject so much more | |
comprehensible in only a single sentence. | |
And I hate inaccurate statements like this. It pretends to be | |
rigorous mathematical, but really just propagates erroneous | |
information, and makes the whole article so much more amateur in only | |
a single sentence. | |
The simple relu is continuous but not differentiable at 0, and its | |
derivative is discontinuous at 0. | |
kmmlng wrote 9 hours 42 min ago: | |
Eh, it really doesn't matter much in practice. Additionally, there | |
are many other activation functions without this issue. | |
whimsicalism wrote 1 day ago: | |
it's pretty close to accurate, the lack of differentiability at 0 | |
for relu doesn't really come into play in practice | |
xanderlewis wrote 1 day ago: | |
Itâs not âinaccurateâ. The mark of true mastery is an abili… | |
to make terse statements that convey a huge amount without | |
involving excessive formality or discussion of by-the-by technical | |
details. If ever youâve spoken to world-renowned experts in pure | |
mathematics or other highly technical and pendantic fields, | |
youâll find theyâll say all sorts of âinaccurateâ thin… | |
conversation (or even in written documents). It doesnât make them | |
worthless; far from it. | |
If you want to have a war of petty pedantry, letâs go: the | |
derivative of ReLU canât be discontinuous at zero, as you say, | |
because continuity (or indeed discontinuity) of a function at x | |
requires the function to have a value at x (which is the negation | |
of what your first statement correctly claims). | |
newrotik wrote 1 day ago: | |
Lack of differentiability is actually a very important feature of | |
the underlying optimization problem. | |
You might think that it doesn't matter because ReLU is, e.g., | |
non-differentiable "only at one point". | |
Gradient based methods (what you find in pytorch) generally rely | |
on the idea that gradients should taper to 0 in the proximity of | |
a local optimum. This is not the case for non-differentiable | |
functions, and in fact gradients can be made to be arbitrarily | |
large even very close to the optimum. | |
As you may imagine, it is not hard to construct examples where | |
simple gradient methods that do not properly take these facts | |
into account fail to converge. These examples are not exotic. | |
makerdiety wrote 1 day ago: | |
Invoking excessive formality and discussions of minute technical | |
details leads to a cathedral of knowledge built on autistic | |
pedantry. The chosen rabbit hole to get lost in needs to be the | |
correct one. And human science is riddled with the paths that | |
have naive or childish fundamentals. | |
kragen wrote 1 day ago: | |
> a cathedral of knowledge built on autistic pedantry | |
this is certainly true, but more often we use its short name, | |
'math'. it turns out to be far more effective than so-called | |
common sense | |
mistermann wrote 1 day ago: | |
This comment makes me want to both upvote and downvote with | |
extreme enthusiasm/fury! | |
The sign of a truly good conversation? | |
kragen wrote 1 day ago: | |
my experience with world-renowned experts in pure mathematics is | |
that they are much more careful than the average bear to | |
explicitly qualify inaccurate things as inaccurate, because their | |
discipline requires them to be very clear about precisely what | |
they are saying | |
discontinuity of a function at x does not, according to the usual | |
definition of 'continuity', require the function to have a value | |
at x; indeed, functions that fail to have a value at x are | |
necessarily discontinuous there, precisely because (as you say) | |
they are not continuous there. [1] there are other definitions of | |
'discontinuous' in use, but i can't think of one that would give | |
the result you claim | |
[1]: https://en.wikipedia.org/wiki/Continuous_function#Defini... | |
xanderlewis wrote 1 day ago: | |
> they are much more careful than the average bear to | |
explicitly qualify inaccurate things as inaccurate | |
Sure. But what part of this entirely worded in natural | |
language, and very short statement made you think it was a | |
technical, formal statement? I think youâre just taking an | |
opportunity to flex your knowledge of basic calculus, and | |
deliberately attributing intent to the author that isnât | |
there in order to look clever. | |
Regarding a function being discontinuous at a point outside its | |
domain: if you take a completely naive view of what | |
âdiscontinuousâ means, then I suppose you can say so. But | |
discontinuity is just the logical negation of continuity. | |
Observe: | |
To say that f: X â> Y (in this context, a real-valued | |
function of real numbers) is continuous means precisely | |
âxâX âε>0 âδ>0 |x - p| < δ â |f(x) -… | |
and so its negation looks like | |
âxâX â ⦠| |
that is, there is a point in X, the domain of f where | |
continuity fails. | |
For example, you wouldnât talk about a function defined on | |
the integers being discontinuous at pi, would you? That would | |
just be weird. | |
To prove the point further, observe that the set of | |
discontinuities (according to your definition) of any given | |
function would actually include every number⦠in fact every | |
mathematical object in the universe â which would make it not | |
even a set in ZFC. So itâs absurd. | |
Even more reasons to believe functions can only be | |
discontinuous at points of their domain: a function is said to | |
be discontinuous if it has at least one discontinuity. By your | |
definition, every function is discontinuous. | |
â¦anyway, I said we were going to be petty. Iâm trying to | |
demonstrate this is a waste of time by wasting my own time. | |
kragen wrote 1 day ago: | |
you have an interesting point of view, and some of the things | |
you have said are correct, but if you try to use gradient | |
descent on a function from, say, ⤠â â, you are going | |
to be a very sad xanda. i would indeed describe such a | |
function as being discontinuous not just at Ï but | |
everywhere, at least with the usual definition of continuity | |
(though there is a sense in which such a function could be, | |
for example, scott-continuous) | |
even in the case of a single discontinuity in the derivative, | |
like in relu', you lose the intermediate value theorem and | |
everything that follows from it; it's not an inconsequential | |
or marginally relevant fact | |
jj3 wrote 1 day ago: | |
Note that any function ⤠â â is continuous on its | |
domain but nowhere differentiable. | |
A Scott-continuous function ⤠â â must be | |
monontonous. So not every such function is | |
Scott-continuous. | |
kragen wrote 19 hours 17 min ago: | |
aha, thanks! | |
laingc wrote 1 day ago: | |
Because memes aren't allowed on HN, you're not allowed to | |
reply with the "akssshuallllyyy" meme, so you had to go to | |
these lengths. | |
¯\_(ã)_/¯ | |
xanderlewis wrote 1 day ago: | |
Youâre actually not far off. Iâm somewhat embarrassed | |
by the above, but I think it makes the point. | |
gessha wrote 1 day ago: | |
It is soothing to the mind because it conveys that itâs | |
understandable but it doesnât take away from the complexity. You | |
still have to read through math and pytorch code and debug | |
nonsensical CUDA errors, comb through the data, etc etc | |
whimsicalism wrote 1 day ago: | |
the complexity is in the values learned from the optimization. even | |
the pytorch code for a simple transformer is not that complex, | |
attention is a simple mechanism, etc. | |
gessha wrote 1 day ago: | |
Complexity also comes from the number of papers that work out how | |
different elements of network work and how to intuitively change | |
them. | |
Why do we use conv operators, why do we use attention operators, | |
when do we use one over the other? What augmentations do you use, | |
how big of a dataset do you need, how do you collect the dataset, | |
etc etc etc | |
whimsicalism wrote 1 day ago: | |
idk, just using attention and massive web crawls gets you | |
pretty far. a lot of the rest is more product-style decisions | |
about what personality you want your LM to take. | |
I fundamentally don't think this technology is that complex. | |
gessha wrote 1 day ago: | |
No? In his recent tutorial, Karpathy showed just how much | |
complexity there is in the tokenizer. | |
This technology has been years in the making with many small | |
advances pushing the performance ever so slightly. Thereâs | |
been theoretical and engineering advances that contributed to | |
where we are today. And we need many more to get the | |
technology to an actually usable level instead of the current | |
word spaghetti that we get. | |
Also, the post is generally about neural networks and not | |
just LMs. | |
When making design decisions about an ML system you | |
shouldnât just choose the attention hammer and hammer away. | |
Thereâs a lot of design constraints you need to consider | |
which is why I made the original reply. | |
whimsicalism wrote 1 day ago: | |
Are there micro-optimizations that eke out small | |
advancements? Yes, absolutely - the modern tokenizer is a | |
good example of that. | |
Is the core of the technology that complex? No. You could | |
get very far with a naive tokenizer that just tokenized by | |
words and replaced unknown words with . This is extremely | |
simple to implement and I've trained transformers like | |
this. It (of course) makes a perplexity difference but the | |
core of the technology is not changed and is quite simple. | |
Most of the complexity is in the hardware, not the software | |
innovations. | |
> And we need many more to get the technology to an | |
actually usable level instead of the current word spaghetti | |
that we get. | |
I think the current technology is useable. | |
> you shouldnât just choose the attention hammer and | |
hammer away | |
It's a good first choice of hammer, tbph. | |
SkyBelow wrote 1 day ago: | |
Before the recent AI boom, I was mystified by the possibility of AI | |
and emulating humans (in no small part thanks to works of fiction | |
showing AI powered androids). Then I created and trained some neural | |
networks. Smaller ones, doing much of nothing special. That was | |
enough to break the mysticism. To realize it was just multiplying | |
matrices. Training them was a bit more advanced, but still applied | |
mathematics. | |
Only recently have I begun to appreciate that the simplicity of the | |
operation, applied to a large enough matrices, may still capture | |
enough of the nature of intelligence and sentience. In the end we | |
can be broken down into (relatively) simple chemical reactions, and | |
it is the massive scale of these reactions that create real | |
intelligence and sentience. | |
mistermann wrote 1 day ago: | |
Next step, in case you get bored: why (in fact, not a minor | |
distinction) does such a simple approach work so well? | |
naasking wrote 1 day ago: | |
Exactly, the people who are derisive of those who consider ML | |
models to exhibit glimmers of true intelligence because it's only | |
matrix multiplications always amuse me. It's like they don't even | |
realize the contradiction in holding the position that seemingly | |
complex and intelligent outward behaviour should not be used as an | |
indication of actual complexity and intelligence. | |
citizen_friend wrote 1 day ago: | |
If you study this historically you will see that every generation | |
thinks they have the mechanism to explain brains (gear systems, | |
analog control/cybernetics, perceptrons). | |
My conclusion is we tend to overestimate our understanding and | |
the power of our inventions. | |
naasking wrote 1 day ago: | |
The difference is that we now actually have a proof of | |
computational power and computational universality. | |
citizen_friend wrote 1 day ago: | |
Analog circuits have the same computational power. Piecewise | |
linear functions have the same computational universality. | |
naasking wrote 1 day ago: | |
Except we didn't know any of that, nor did know how to | |
construct physical analogs in order to achieve universal | |
computation. At best, we had limited task-specific | |
computation, like clocks and planetary motion. | |
citizen_friend wrote 16 hours 20 min ago: | |
We knew about universal function approximators like. | |
Polynomials and trig functions since the 1700s. Turing | |
and godel were around 1910 and 1920. The cybernetics | |
movement is big in the 30s and 40s. Perceptrons 50s and | |
60s | |
naasking wrote 15 hours 40 min ago: | |
Taylor expansions for all functions do not exist. | |
Furthermore, our characterization of infinity was still | |
poor, so we didn't even have a solid notion of what it | |
would mean for a formalism to be able to compute all | |
computable functions. The notion of a universal | |
computer arguably didn't exist until Babbage. | |
I stand by my position that having a mathematical proof | |
of computational universality is a significant | |
difference that separates today from all prior eras | |
that sought to understand the brain through | |
contemporaneous technology. | |
citizen_friend wrote 15 hours 31 min ago: | |
> Taylor expansions | |
Thatâs not what Iâm talking about. This is a | |
basic analysis topic: [1] At least mid 1800s for a | |
proof. 1700s also explored Fourier series. | |
> stand by my position | |
And youâre still ignoring the cybernetics, and | |
perceptrons movement I keep referring to which was | |
more than 100 years ago, and informed by Turing. | |
[1]: https://en.m.wikipedia.org/wiki/Stone%E2%80%... | |
naasking wrote 15 hours 18 min ago: | |
> Thatâs not what Iâm talking about. This is a | |
basic analysis topic: | |
It's the same basic flaw: requiring continuous | |
functions. Not all functions are continuous, | |
therefore this is not sufficient. | |
> And youâre still ignoring the cybernetics, and | |
perceptrons movement I keep referring to which was | |
more than 100 years ago, and informed by Turing. | |
What about them? As long as they're universal, they | |
can all simulate brains. Anything after Church and | |
Turing is just window dressing. Notice how none of | |
these new ideas claimed to change what could in | |
principle be computed, only how much easier or more | |
natural this paradigm might be for simulating or | |
creating brains. | |
citizen_friend wrote 12 hours 45 min ago: | |
This implies it works piecewise. Thatâs also | |
true of neural nets lol. You have to keep adding | |
more neurons to get the granularity of whatever | |
your discontinuities are. | |
Itâs also a different reason than Taylor series | |
which uses differentiability. | |
You do not understand this subject. Please read | |
before repeating this: [1] > what about them | |
Then you seem to have lost the subject of the | |
thread. | |
[1]: https://en.m.wikipedia.org/wiki/Universa... | |
captainclam wrote 1 day ago: | |
Ugh, exactly, it's so cool. I've been a deep learning practitioner | |
for ~3 years now, and I feel like this notion has really been | |
impressed upon me only recently. | |
I've spent an awful lot of mental energy trying to conceive of how | |
these things work, when really it comes down to "does increasing this | |
parameter improve the performance on this task? Yes? Move the dial up | |
a bit. No? Down a bit..." x 1e9. | |
And the cool part is that this yields such rich, interesting, | |
sometimes even useful, structures! | |
I like to think of this cognitive primitive as the analogue to the | |
idea that thermodynamics is just the sum of particles bumping into | |
each other. At the end of the day, that really is just it, but the | |
collective behavior is something else entirely. | |
kadushka wrote 1 day ago: | |
it comes down to "does increasing this parameter improve the | |
performance on this task? Yes? Move the dial up a bit. No? Down a | |
bit..." x 1e9 | |
This is not how gradient based NN optimization works. What you | |
described is called "random weight perturbation", a variant of | |
evolutionary algorithms. It does not scale to networks larger than | |
a few thousand parameters for obvious reasons. | |
NNs are optimized by directly computing a gradient which tells us | |
the direction to go to to reduce the loss on the current batch of | |
training data. There's no trying up or down and seeing if it worked | |
- we always know which direction to go. | |
SGD and RWP are two completely different approaches to learning | |
optimal NN weights. | |
captainclam wrote 1 day ago: | |
I guess you could say I don't know RWP from Adam! :D | |
My og comment wasn't to accurately explain gradient optimization, | |
I was just expressing a sentiment not especially aimed at experts | |
and not especially requiring details. | |
Though I'm afraid I subjected you to the same "cringe" I | |
experience when I read pop sci/tech articles describe deep | |
learning optimization as "the algorithm" being "rewarded" or | |
"punished," haha. | |
kadushka wrote 1 day ago: | |
No worries, we're all friends here! | |
it's just you happened to accidentally describe the idea behind | |
RWP, which is a gradient-free optimization method, so I thought | |
I should point it out. | |
xanderlewis wrote 1 day ago: | |
I donât think the author literally meant tweaking the | |
parameters and seeing what happens; itâs probably an analogy | |
meant to give a sense of how the gradient indicates what | |
direction and to what degree the parameters should be tweaked. | |
Basically, substitute âthe gradient is positiveâ for | |
âincreasing this parameter decreases performanceâ and vice | |
versa and it becomes correct. | |
p1esk wrote 1 day ago: | |
That substitution is the main difference between SGD and RWP. | |
Itâs like describing bubble sort when you meant to describe | |
quick sort. Would not fly on an ML 101 exam, or in an ML job | |
interview. | |
a_random_canuck wrote 1 day ago: | |
I donât think anyone is trying to pass an exam here, but | |
just give an understandable overview to a general audience. | |
xanderlewis wrote 1 day ago: | |
Itâs not like that at all. You couldnât accidentally | |
sound like youâre describing quick sort when describing | |
bubble sort, or vice versa. I canât think of any | |
substitution of a few words that would do that. | |
The meaning of the gradient is perfectly adequately described | |
by the author. They werenât describing an algorithm for | |
computing it. | |
JackFr wrote 1 day ago: | |
NAND gates by themselves are kind of dull, but it's pretty cool | |
what you can do with a billion of them. | |
xanderlewis wrote 1 day ago: | |
> At the end of the day, that really is just it, but the collective | |
behavior is something else entirely. | |
Exactly. Itâs not to say that neat descriptions like this are the | |
end of the story (or even the beginning of it). If they were, there | |
would be no need for this entire field of study. | |
But they are cool, and can give you a really clear | |
conceptualisation of something that can appear more like a sum of | |
disjoint observations and ad hoc tricks than a discipline based on | |
a few deep principles. | |
andoando wrote 1 day ago: | |
What does "differentiable primitives" mean here? | |
esafak wrote 1 day ago: | |
functions, which you can compose to increase their expressiveness, | |
and run gradient descent on to train. | |
The success of deep learning is basically attributable to | |
composable (expressive), differentiable (learnable) functions. The | |
"deep" moniker alludes to the compositionality. | |
dirkc wrote 1 day ago: | |
When I did "AI" it would have meant the sigmoid function, these | |
days it's something like ReLU. | |
xanderlewis wrote 1 day ago: | |
I think itâs referring to âprimitive functionsâ in the sense | |
that theyâre the building blocks of more complicated functions. | |
If f and g are differentiable, f+g, fg, f/g (as long as g is never | |
zero)⦠and so on are differentiable too. Importantly, f composed | |
with g is also differentiable, and so since the output of the whole | |
network as a function of its input is a composition of these | |
âprimitivesâ itâs differentiable too. | |
The actual primitive functions in this case would be things like | |
the weighted sums of activations in the previous layer to get the | |
activation of a given layer, and the actual âactivation | |
functionsâ (traditionally something like a sigmoid function; | |
these days a ReLU) associated with each layer. | |
âPrimitivesâ is also sometimes used as a synonym for | |
antiderivatives, but I donât think thatâs what it means here. | |
Edit: it just occurred to me from a comment below that you might | |
have meant to ask what the âdifferentiableâ part means. See [1] | |
. | |
[1]: https://en.wikipedia.org/wiki/Differentiable_function | |
andoando wrote 1 day ago: | |
Is this function composition essentially lambda calculus then? | |
xanderlewis wrote 1 day ago: | |
Composition here just means what it does for any two functions: | |
the value of the âcompositionâ of f and g at x is defined | |
to be f applied to g applied to x. In symbols, its: fâg := | |
f(g(x)) for each x in the domain of f. It may seem obvious, | |
but the fact that this new thing is also a function (that is, | |
its value is well-defined for every input) is actually a very | |
useful thing indeed and leads to⦠well, most of mathematics. | |
You can certainly do function composition in lambda calculus: | |
in fact, the act of composition itself is a higher order | |
function (takes functions and returns a function) and you can | |
certainly express it formally with lambda terms and such. | |
Itâs not really got anything to do with any particular | |
language or model of computation though. | |
andoando wrote 1 day ago: | |
I didn't form my question too well. I understand all that. | |
What I am asking is, are these function compositions | |
equivalent to equivalent/similar to functions in lambda | |
calculus? | |
I guess my question, is what are the primitive functions here | |
doing? | |
xanderlewis wrote 1 day ago: | |
Well, yes, to the extent that functions are functions are | |
functions (theyâre just associations or mappings or | |
whatever you want to call them). | |
Maybe your question boils down to asking something more | |
general like: whatâs the difference between functions to | |
a computer scientist (or a programmer) and functions to a | |
mathematician? That is, are âfunctionsâ in C (or lambda | |
calculus), say, the same âfunctionsâ we talk about in | |
calculus? | |
The answer to that is: in this case, because these are | |
quite simple functions (sums and products and compositions | |
thereof) theyâre the same. In general, theyâre a bit | |
different. The difference is basically the difference | |
between functional programming and âtraditionalâ | |
programming. If you have state/âside effectsâ of | |
functions, then your function wonât be a function in the | |
sense of mathematics; if the return value of your function | |
depends entirely on the input and doesnât return | |
different values depending on whatever else is happening in | |
the program, then it will be. | |
Since youâre asking about lambda calculus in particular, | |
the answer is that theyâre the same because lambda | |
calculus doesnât have state. Itâs âpurely | |
functionalâ in that sense. | |
>I guess my question, is what are the primitive functions | |
here doing? | |
Iâm not really sure what you mean. Theyâre doing what | |
functions always do. Every computer program is abstractly a | |
(partial) function. | |
Does that help, or have I misunderstood? | |
andoando wrote 1 day ago: | |
So when I think of functions in lambda calculus, I think | |
of the I,S,K functions which when composed can produce | |
functions like "copy", "add", "remove", "if", etc which | |
then can do different computations like "copy every other | |
symbol if the symbol is true", "multiply 5 times then add | |
2". Since lambda calculus is complete, any | |
computation/program can be composed. | |
When I think of functions in a traditional mathematical | |
sense, I think about transformations of numbers. x->2x, | |
x->2x^2, etc. I completely understand composition of | |
functions here, ex x->2(x->2x)^2, but its unclear how | |
these transformations relate to computation. For a | |
regression problem, I can totally understand how finding | |
the right compositions of functions can lead to a better | |
approximations. So I am wondering, in an LLM | |
architecture, what computations do these functions | |
actually represent? I assume, it has something to do with | |
what path to take through the neural layers. | |
I probably just need to take the time to study it deeper. | |
>If you have state/âside effectsâ of functions, then | |
your function wonât be a function in the sense of | |
mathematics; if the return value of your function depends | |
entirely on the input and doesnât return different | |
values depending on whatever else is happening in the | |
program, then it will be. | |
Totally understood from the perspective of functions in | |
say, Java. Though fundamentally I don't think there is | |
distinction between functions in computer science and | |
mathematics. The program as a whole is effectively a | |
function. The "global" state is from another reference, | |
just local variables of the encompassing function. If a | |
function is modifying variables outside of the "function | |
block" (in say Java), the "input" to the function isn't | |
just the parameters of the function. Imo, this is more of | |
an artifact of implementation of some languages rather | |
than a fundamental difference. Python for example | |
requires declaring global args in the function block. Go | |
one step further and require putting global args into the | |
parameters list and you're pretty close to satisfying | |
this. | |
xanderlewis wrote 1 day ago: | |
I think youâre actually massively overthinking it. | |
The state of a neural network is described entirely by | |
its parameters, which usually consist of a long array | |
(well, a matrix, or a tensor, or whateverâ¦) of | |
floating point numbers. What is being optimised when a | |
network is trained is these parameters and nothing | |
else. When you evaluate a neural network on some input | |
(often called performing âinferenceâ), that is when | |
the functions weâre talking about are used. You start | |
with the input vector, and you apply all of those | |
functions in order and you get the output vector of the | |
network. The training process also uses these | |
functions, because to train a network you have to | |
perform evaluation repeatedly in between tweaking those | |
parameters to make it better approximate the desired | |
output for each input. Importantly, the functions do | |
not change. They are constant; itâs the parameters | |
that change. The functions are the architecture â not | |
the thing being learned. Essentially what the | |
parameters represent is how likely each neuron is to be | |
activated (have a high value) if others in the previous | |
layer are. So you can think of the parameters as | |
encoding strengths of connections between each pair of | |
neurons in consecutive layers. Thinking about âwhat | |
path to take through the neural layersâ is way too | |
sophisticated â itâs not doing anything like that. | |
> Though fundamentally I don't think there is | |
distinction between functions in computer science and | |
mathematics. The program as a whole is effectively a | |
function. | |
Youâre pretty much right about that, but there are | |
two important problems/nitpicks: | |
(1) We canât prove (in general) that a given program | |
will halt and evaluate to something (rather than just | |
looping forever) on a given input, so the âentire | |
programâ is instead whatâs called a partial | |
function. This means that itâs still a function on | |
its domain â but we canât know what its precise | |
domain is. Given an input, it may or may not produce an | |
output. If it does, though, itâs well defined because | |
itâs a deterministic process. | |
(2) Youâre right to qualify that itâs the whole | |
program that is (possibly) a function. If you take a | |
function from some program that depends on some state | |
in that same program, then clearly that function | |
wonât be a proper âmathematicalâ function. Sure, | |
if you incorporate that extra state as one of your | |
inputs, it might be, but thatâs a different function. | |
You have to remember that in mathematics, unlike in | |
programming, a function consists essentially of three | |
pieces of data: a domain, a codomain, and a âruleâ. | |
If you want to be set-theoretic and formal about it, | |
this rule is just a subset of the cartesian product of | |
its domain and codomain (itâs a set of pairs of the | |
form (x, f(x))). If you change either of these sets, | |
itâs technically a different function and there are | |
good reasons for distinguishing between these. So | |
itâs not right to say that mathematical functions and | |
functions in a computer program are exactly the same. | |
andoando wrote 1 day ago: | |
I appreciate your responses, sorry I hope I don't | |
seem like Im arguing for the sake of arguing. | |
>Essentially what the parameters represent is how | |
likely each neuron is to be activated (have a high | |
value) if others in the previous layer are. So you | |
can think of the parameters as encoding strengths of | |
connections between each pair of neurons in | |
consecutive layers. Thinking about âwhat path to | |
take through the neural layersâ is way too | |
sophisticated â itâs not doing anything like | |
that. | |
Im a little confused. The discussion thus far about | |
how neural networks are essentially just compositions | |
of functions, but you are now saying that the | |
function is static, and only the parameters change. | |
But that aside, if these parameters change which | |
neurons are activated, and this activation affects | |
which neurons are activated in the next layer, are | |
these parameters effectively not changing the path | |
taken through the layers? | |
>Sure, if you incorporate that extra state as one of | |
your inputs, it might be, but thatâs a different | |
function. | |
So say we have this program | |
" | |
let c = 2; | |
function 3sum (a,b) { | |
return a+b + c; | |
} | |
let d = 3sum(3,4)" | |
I believe you are saying, if we had constructed this | |
instead as | |
"function(a,b,c) { | |
return a+b+c | |
} | |
let d = 3sum(3,4,2) | |
" | |
then, this is a different function. | |
Certainly, these are different in a sense, but at a | |
fundamental level, when you compile this all down and | |
run it, there is an equivalency in the transformation | |
that is happening. That is, the two functions | |
equivalently take some input state A (composed of | |
a,b,c) and return the same output state B, while | |
applying the same intermediary steps (add a to b, add | |
c to result of (add to b)). Really, in the first case | |
where c is defined outside the scope of the function | |
block, the interpreter is effectively producing the | |
function 3sum(x,y,c) as it has to at some point, one | |
way or another, inject c into a+b+c. | |
Similarly, I am won't argue that the current, formal | |
definitions of functions in mathematics are exactly | |
that of functions as they're generally defined in | |
programming. | |
Rather, what I saying is that there is an equivalent | |
way to think and study functions that equally apply | |
to both fields. That is, a function is simply a | |
transformation from A to B, where A and B can be | |
anything, whether that is bits, numbers, or any other | |
construction in any system. The only primitive | |
distinction to make here is whether A and B are the | |
same thing or different. | |
OJFord wrote 1 day ago: | |
Function composition is just f(g(x)), considered as a single | |
function that's the composition of f and g; it has the domain | |
of f and the range of g. | |
In lambda calculus terminology it's an 'application' (with a | |
function argument). | |
CobrastanJorji wrote 1 day ago: | |
Continuous mathematical functions which have derivatives. | |
glonq wrote 1 day ago: | |
I wonder if the usage of Alice & Wonderland takes inspiration from | |
Douglas Hofstadter's "Gödel, Escher, Bach: an Eternal Golden Braid" ? | |
abdullahkhalids wrote 1 day ago: | |
Lewis Carrol's Alice in Wonderland features a number of logical and | |
mathematical puzzles [1] He also wrote What the Tortoise Said to | |
Achilles (1895) in which the paradoxes of Zeno are discussed. | |
So it's more correct to say that GEB and this article are originally | |
inspired by Lewis Carrol's work. [1] I wrote a short article for my | |
university magazine a long time ago. Some interesting references at | |
the end | |
[1]: https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%... | |
devnonymous wrote 1 day ago: | |
No, I think the inspiration is more direct | |
[1]: https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+... | |
iainmerrick wrote 1 day ago: | |
It's a pretty common trope, especially for math-related books, e.g. | |
Alex Bellos' "Alex's Adventures in Numberland". | |
<- back to front page |