Introduction
Introduction Statistics Contact Development Disclaimer Help
_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
Visit Hacker News on the Web
COMMENT PAGE FOR:
Don't “let it crash”, let it heal
Jtsummers wrote 21 hours 57 min ago:
I think a lot of folks who have never looked at Erlang or Elixir and
BEAM before misunderstand this concept because they don't understand
how fine-grained processes are, or can be, in Erlang. A very important
note: Processes in BEAM languages are cheap, both to create and for
context switching, compared to OS threads. While design-wise they offer
similar capabilities, this cost difference results in a substantially
different approach to design in Erlang than in systems where the cost
of introducing and switching between threads is more expensive.
In a more conventional language where concurrency is relatively
expensive, and assuming you're not an idiot who writes 1-10k SLOC
functions, you end up with functions that have a "single
responsibility" (maybe not actually a single responsibility, but closer
to it than having 100 duties in one function) near the bottom of your
call tree, but they all exist in one thread of execution. In a system,
hypothetical, created in this model if your lowest level function is
something like:
retrieve_data(db_connection, query_parameters) -> data
And the database connection fails, would you attempt to restart the
database connection in this function? Maybe, but that'd be bad design.
You'd most likely raise an exception or change the signature so you
could express an error return, in Rust and similar it would become
something like:
retrieve_data(db_connection, query_parameters) -> Result
Somewhere higher in the call stack you have a handler which will catch
the exception or process the error and determine what to do. That is,
the function `retrieve_data` crashes, it fails to achieve its objective
and does not attempt any corrective action (beyond maybe a few retries
in case the error is transient).
In Erlang, you have a supervision tree which corresponds to this call
tree concept but for processes. The process handling data retrieval,
having been given some db_conn handler and the parameters, will fail
for some reason. Instead of handling the error in this process, the
process crashes. The failure condition is passed to the supervisor
which may or may not have a handler for this situation.
You might put the simple retry policy in the supervisor (that basic
assumption of transient errors, maybe a second or third attempt will
succeed). It might have other retry policies, like trying the request
again but with a different db_connection (that other one must be bad
for some reason, perhaps the db instance it references is down). If it
continues to fail, then this supervisor will either handle the error
some other way (signaling to another process that the db is down, fix
it or tell the supervisor what to do) or perhaps crash itself. This
repeats all the way up the supervision tree, ultimately it could mean
bringing down the whole system if the error propagates to a high enough
level.
This is conceptually no different than how errors and exceptions are
handled in sequential, non-concurrent systems. You have handlers that
provide mechanisms for retrying or dealing with the errors, and if you
don't the error is propagated up (hopefully you don't continue running
in a known-bad state) until it is handled or the program crashes
entirely.
In languages that offer more expensive concurrency (traditional OS
threads), the cost of concurrency (in memory and time) means you end up
with a policy that sits somewhere between Erlang's and a straight-line
sequential program. Your threads will be larger than Erlang processes
so they'll include more error handling within themselves, but
ultimately they can still fail and you'll have a supervisor of some
sort that determines what happens next (hopefully).
As more languages move to cheap concurrency (Go's goroutines, Java's
virtual threads), system designs have a chance to shift closer to
Erlang than that straight-line sequential approach if people are
willing to take advantage of it.
juped wrote 1 day ago:
There's really not more that's useful to say than the relevant section
(4.4) of Joe Armstrong's thesis says:
>How does our philosophy of handling errors fit in with coding
practices? What kind of code must the programmer write when they find
an error? The philosophy is let some other process fix the error, but
what does this mean for their code? The answer is let it crash. By this
I mean that in the event of an error, then the program should just
crash. But what is an error? For programming purpose we can say that:
>• exceptions occur when the run-time system does not know what to
do.
>• errors occur when the programmer doesn’t know what to do.
>If an exception is generated by the run-time system, but the
programmer had foreseen this and knows what to do to correct the
condition that caused the exception, then this is not an error. For
example, opening a file which does not exist might cause an exception,
but the programmer might decide that this is not an error. They
therefore write code which traps this exception and takes the necessary
corrective action.
>Errors occur when the programmer does not know what to do. Programmers
are supposed to follow specifications, but often the specification does
not say what to do and therefore the programmer does not know what to
do.
>[...]
>The defensive code detracts from the pure case and confuses the
reader—the diagnostic is often no better than the diagnostic which
the compiler supplies automatically.
Note that this "program" is a process. For a process doing work,
encountering something it can't handle is an error per the above
definitions, and the process should just die, since there's nothing
better for it to do; for a supervisor process supervising such
processes-doing-work, "my child process exited" is an exception at
worst, and usually not even an exception since the standard library
supervisor code already handles that.
stcg wrote 1 day ago:
"Let it crash" is a sentence that gets attention. It makes a person
want to know more about it, as it sounds controversial and different.
"Let it heal" doesn't have that.
jonhohle wrote 1 day ago:
It also has a deeper philosophical meaning of unexpected software
bugs should be noisy and obvious instead of causing silently
corruption or misleading user experience. If monitoring doesn’t
catch the failure, customers will and it can be fixed right away
(whether it’s the software, a hardware error, dependency issue,
etc.).
A web service returning a 500 error code is a lot more obvious than a
200 with an invalid payload. A crashed app with a stack trace is
easier to debug and will cause more user feedback than an app than
hangs in a retry loop.
When I had to deal with these things in the Java world, it meant not
blindly handling or swallowing exceptions that business code had no
business caring about. Does your account management code really think
it knows how to properly handle an InterruptedException? Unless your
answer is rollback and reset the interrupted flag it’s probably
wrong. Can’t write a test for a particular failure scenario? That
better blow up loudly with enough context that makes it possible to
understand the error condition (and then write a test for it).
tmcb wrote 1 day ago:
It is very common to interpret taglines by their face value, and I
believe the author did just that, although the point brought up is
valid.
In order to “let it crash”, we must design the system in a way that
crashes would not be catastrophic, stability wise. Letting it crash is
not a commandment, though: it is a reminder that, in most cases, a
smart healing strategy might be overkill.
borromakot wrote 1 day ago:
Author: I'm literally explaining not to interpret the tag line at
face value.
Muromec wrote 1 day ago:
Yeah, but it's internet forum and for opinion pieces people first
read comments and then maybe read the article if it's interesting.
tmcb wrote 1 day ago:
I actually skimmed the article before posting. I have some
exposure to Erlang, but not to Elixir. As I’ve already
mentioned, I think the author’s covering of application
behavior is OK, but there is more to the tagline than meets the
eye.
tmcb wrote 1 day ago:
Maybe I didn’t make myself clear. “Let it crash” is not
something that should be thought of at the component level, it
should be thought of at the system level. The fact that the
application crashes “gracefully” or not is not what is really
important. You should design the system in a crash-friendly way,
and not to write the application and think: “oh, I believe it is
OK to let it crash here”.
borromakot wrote 1 day ago:
Then I don't think you understand how the phrase is used in
Elixir/Erlang. The phrase is about letting processes crash.
tmcb wrote 1 day ago:
No need for the snarky comment. If I am wrong, that is fine.
Of course Joe Armstrong could explain what I meant, but in a
much better way: [1] (edit: see the "Why was error handling
designed like this?" part for reference)
My personal interpretation is that systems must be able to
handle crashing processes gracefully. There is no benefit in
letting processes crash just for the sake of it.
[1]: https://erlang.org/pipermail/erlang-questions/2003-Mar...
tmcb wrote 1 day ago:
Actually, now I thought about it, I know exactly what irked
me about the approach. I hope the author takes it as
constructive feedback:
Saying "let it crash is a tagline that actually means
something else because the BEAM is supposed to be used in
this particular way" sounds slightly "cargo-cultish", to the
point where we have to challenge the meaning of the actual
word to make sense of it.
Joe Armstrong's e-mail, on the other hand, says (and I
paraphrase): "the BEAM was designed from the ground up to
help developers avoid the creation of ad-hoc protocols for
process communication, and the OTP takes that into
consideration already. Make sure your system, not your
process, is resilient, and literally let processes crash."
Boom. There is no gotcha there. Also, there is the added
benefit that developers for other platforms now understand
that the rationale is justified by the way BEAM/OTP were
designed and may not be applicable to their own platforms.
borromakot wrote 23 hours 56 min ago:
If I sounded snarky that wasn't my intention. At the end of
the day though it doesn't feel like you read the article
which was clearly in a different context than the one in
which you responded. FWIW I didn't expect this small
article speaking to a small audience (Elixir devs) to make
the rounds on hacker news.
I agree on the importance of defining terms, and I think
the important thing here is that "process" in Joe's
parlance is not an OS level process, it is one of a fleet
of processes running inside the BEAM VM. And the "system"
in this case is the supervisory system around it, which
itself consists of individual processes.
I'm critiquing a common misunderstanding of the phrase "Let
it crash", whereby effectively no local error handling is
performed. This leads to worse user experiences and worse
outcomes in general. I understand that you're offering
critique, but it again sounds like you're critiquing a
reductive element (the headline itself).
tmcb wrote 23 hours 29 min ago:
I did read the article. I concede that I might not have
understood it. Again, I never said it is wrong, but
rather that it has a blind spot. I am familiar with Joe
Armstrong’s work because I worked on a proprietary (and
rather worse tbf) native distributed systems middleware
in the past.
IshKebab wrote 1 day ago:
Ah this makes sense. I always thought "let it crash" made it sound like
Elixir devs just don't bother with error checking, like writing Java
without any `catch`es, or writing Rust that only uses `.unwrap()`.
If they just mean "processes should be restartable" then that sounds
way more reasonable. Similar idea to this but less fancy: [1] It's a
pretty terrible slogan if it makes your language sound worse than it
actually is.
[1]: https://flawless.dev/
JonChesterfield wrote 14 hours 29 min ago:
Flawless is interesting.
It can't work in the general case because replaying a sequence of
syscalls is not sufficient to put the machine back in the same state
as it was last time. E.g. second time around open behaves differently
so you need to follow the error handling.
However sometimes that approach would work. I wonder how wide the
area of effective application is. It might be wide enough to be very
useful. The all or nothing database transaction model fits it well.
bccdee wrote 23 hours 24 min ago:
I've been seeing a lot of these durable workflow engines around
lately, for some reason. I'm not sure I understand the pitch. It just
seems like a thin wrapper around some very normal patterns for
running background jobs. Persist your jobs in a db, checkpoint as
necessary, periodically retry. I guess they're meant to be a low-code
alternative to writing the db tables yourself, but it seems like
you're not saving much code in practice.
vendiddy wrote 1 day ago:
I think the slogan was meant to be provocative but unfortunately it
has been misinterpreted more often than not.
For example, imagine you're working with a 3rd party API and,
according to the documentation, it is supposed to return responses in
a certain format. What if suddenly that API stops working? Or what if
the format changes?
You could write code to handle that "what if" scenario, but then
trying to handle every hypothetical your code becomes bloated, more
complicated, and hard to understand.
So in these cases, you accept that the system will crash. But to
ensure reliability, you don't want to bring down the whole system. So
there are primitives that let you control the blast radius of the
crash if something unexpected happens.
Let it crash does not mean you skip validating user input. Those are
issues that you expect to happen. You handle those just as you would
in any programming language.
zmgsabst wrote 1 day ago:
I think it’s more subtle:
Imagine that you’re trying to access an API, which for some reason
fails.
“Let it crash” isn’t an argument against handling the timeout,
but rather that you should only retry a few, bounded times rather
than (eg) exponentially back off indefinitely.
When you design from that perspective, you just fail your request
processing (returning the request to the queue) and make that your
manager’s problem. Your managing process can then restart you,
reassign the work to healthy workers, etc. If your manager can’t
get things working and the queue overflows, it throws it into dead
letters and crashes. That might restart the server, it might page
oncall, etc.
The core idea is that within your business logic is the wrong place
to handle system health — and that many problems can be solved by
routing around problems (ie, give task to a healthy worker) or
restarting a process. A process should crash when it isn’t scoped
to handle the problem it’s facing (eg, server OOM, critical
dependency offline, bad permissions). Crashing escalates the problem
until somebody can resolve it.
johnisgood wrote 1 day ago:
As someone has linked it: [1] It is about self-healing, too.
[1]: https://erlang.org/pipermail/erlang-questions/2003-March/007...
PicassoCTs wrote 1 day ago:
[1] Railway orientated programming to the rescue?
[1]: https://fsharpforfunandprofit.com/rop/
adregan wrote 51 min ago:
There are a couple of patterns for accomplishing this in Elixir.
One is to build multiple function heads that pattern match on the
arguments. If it’s an error tuple, pass it along. Build up your
pipeline and handle any errors at the end.
Another is to use the `with else`[0] expression for building up a
railroad. This has the benefit of not having to teach your functions
how to pass along errors. Error handling in the else block can be a
little gnarly.
I find it a little more manual than languages that have a `runEffect`
or compose operator. In large part that’s due to the :ok, :error
tuples being more of a convention than a primitive like
Either/Result.
0:
[1]: https://elixirschool.com/en/lessons/basics/control_structure...
hesus_ruiz wrote 1 day ago:
It is very strange that a post trying to explain the concept of "let it
crash" in Elixir (which runs on the BEAM VM) does not mention the
doctoral thesis of Joe Armstrong: "Making reliable distributed systems
in the presence of software errors".
It must be compulsory lecture for anybody interested in reliable
systems, even if they do not use the BEAM VM.
[1]: https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A10420...
plainOldText wrote 21 hours 11 min ago:
Some core ideas from the paper for the inpatient (failures,
isolation, healing):
- Failures are inevitabe, so systems must be designed to EXPECT and
recover from them, NOT AVOID them completely.
- Let it crash philosophy allows components to FAIL and RECOVER
quickly using supervision trees.
- Processes should be ISOLATED and communicate via MESSAGE PASSING,
which prevents cascading failures.
- Supervision trees monitor other processes and RESTART them when
they fail, creating a self-healing architecture.
anthk wrote 1 day ago:
Unix/BSD -> Crash, fix, restart.
GNU/MIT/Lisp -> Detect, offer a fix, continue.
atoav wrote 1 day ago:
The truth is that different errors have to lead to different results if
you want a good organisational outcome. These could be:
- Fundamental/Fatal error: something without the process cannot
function, e.g. we are missing an essential config option. Exiting with
an error is totally adequate. You can't just heal from that as it would
involve guessing information you don't have. Admins need to fix it
- Critical error: something that should not ever occur, e.g. having an
active user without password and email. You don't exit, you skip it if
thst is possible and ensure the first occurance is logged and admins
are contacted
- Expected/Regular error: something that is expected to happen during
the normal operations of the service, e.g. the other server you make
requests to is being restarted and thus unreachable. Here the strategy
may vary, but it could be something like retrying with random
exponential backoff. Or you could briefly accept the values provided by
that server are unknown and periodically retry to fill the unknown
values. Or you could escalate that into a critical error after a
certain amount of retries.
- Warnings: These are usually about something being not exactly ideal,
but do not impede with the flow of the program at all. Usually has to
do with bad data quality
If you can proceed without degrading the integrity of the system you
should, the next thing is to decide jow important it is for humans to
hear about it.
praptak wrote 1 day ago:
A condition that "should not happen" might still be a problem specific
to a particular request. If you "just crash" it turns this request from
one that only triggers a http 500 response to one that crashes the
process. This increases the risk of Query of Death scenarios where the
frontend that needs to serve this particular request starts retrying it
with different backends and triggers restarts faster than the processes
come back up.
So being too eager to "just crash" may turn a scenario where you fail
to serve 1% of requests into a scenario where you serve none because
all your processes keep restarting.
sarchertech wrote 1 day ago:
> If you "just crash" it turns this request from one that only
triggers a http 500 response to one that crashes the process.
In phoenix each request has its own process and crashing that process
will result in a 500 being sent to the client.
davidclark wrote 1 day ago:
You should try to do some load testing of a real Erlang system and
compare how it handles this scenario against other
languages/frameworks. What you are describing is one of the exact
things the Erlang system is strong against due to the scheduler.
rtpg wrote 1 day ago:
My impression is that in Erlang land each process handler is really
cheap so you can just keep on showing up with process handlers and
not reach exhaustion like you do with other systems (at least in
pre-async worlds...)
josevalim wrote 1 day ago:
Processes can be marked as temporary, which means they are not
restarted, and that’s what is used when managing http connections,
as you can’t really restart a request on the server without the
client. So the scenario above wouldn’t happen.
You still want those processes to crash though, as it allows it to
automatically clean up any concurrent work. For example, if during a
request you start three processes to do concurrent work, like
fetching APIs, then the request process crashes, the concurrent
processes are automatically cleaned up.
cyberax wrote 1 day ago:
"Let it crash" in Erlang/Elixir means that the process that serves
the request is allowed to crash. It then will be restarted by the
supervisor.
Supervisors themselves form a tree, so for a crash to take down the
whole app, it needs to propagate all the way to the top.
Another explanation for people familiar with exceptions in other
languages: "Don't try to catch the exception inside a request
handler".
zwnow wrote 1 day ago:
This is funny given Elixir/Erlangs whole idea is "let it crash". In
Go I just have a Recovery Middleware for any type of problem. Don't
know how other langs do it tho
davidclark wrote 1 day ago:
I don’t know Go, but that sounds like someone has simply written
part of Erlang in Go.
knome wrote 1 day ago:
erlang doesn't crash the program, it crashes the thread. erlang has
a layered management system built in as part of OTP (open telecom
platform, erlang was built for running highly concurrent telephony
hardware). when a thread crashes, it dies and signals its parent.
the parent then decides what to do. usually, that's just restarting
the worker. maybe if ten workers have crashed in a minute, the
manager itself will die and restart. issues bubble up, and managers
restart subsystems automatically. for some things, like parsing
user data, you might never cause the manager to die, and just
always restart the worker.
the article, if you should choose to read it, is explaining that
people have the misconception you appear to be having due to the
'let it fail' catchphrase. it goes into detail about this system,
when failing is appropriate, and when trying to work around errors
is appropriate.
as erlang uses greenthreads, restarting a thread for a user API is
effectively instant and free.
zwnow wrote 1 day ago:
It's not a misconception given that Elixir Forum and its Discords
members will say that to you. Also I never assumed the whole
program crashed so why would you explain this to me?
Why would one Blog guy know it better than a lot of other Elixir
devs?
sarchertech wrote 1 day ago:
It’s well known among elixir devs that for reasons unkown,
Elixir Forum is populated predominantly by people who don’t
know what they’re talking about.
borromakot wrote 1 day ago:
Blog guy here: I do, in fact, know it better than a lot of
other Elixir devs.
snickerbockers wrote 1 day ago:
>When people say “let it crash”, they are referring to the fact
that practically any exited process in your application will be
subsequently restarted. Because of this, you can often be much less
defensive around unexpected errors. You will see far fewer try/rescue,
or matching on error states in Elixir code.
I just threw up in my mouth when I read this. I've never used this
language so maybe my experience doesn't apply here but I'm imagining
all the different security implications that ive seen arise from
failing to check error codes.
vendiddy wrote 1 day ago:
If get a chance to read some Elixir/Erlang code you'll see that
pattern matching is used frequently to assert expected error codes.
It does not mean ignore errors.
This is a common misunderstanding because unfortunately the slogan is
frequently misinterpreted.
toast0 wrote 1 day ago:
Ok, so it's not really that you're not checking error codes. It's
that you can write stuff like
ok = whatever().
If whatever is successful and idomatic, it returns ok, or maybe a
tuple of {ok, SomeReturn}. In that case, execution would continue. If
it returns an error tuple like {error, Reason}... "Let it crash" says
you can just let it crash... You didn't have anything better to do,
the built in crash because {error, Reason} will do fine.
Or you could do a
case whatever of
ok -> ok;
{error, nxdomain} -> ok
end.
If it was fine to get nxdomain error, but any other error isn't
acceptable... It will just crash, and that's good or at least ok.
Better than having to enumerate all the possible errors, or having a
catchall that then explicitly throws an eeror. It's especially hard
to enumerate all possible errors because the running system can
change and may return a new error that wasn't enumerated when the
requesting code was written.
There's lots of places where crashing isn't actually what you want,
and you have to capture all errors, explicitly log it, and then move
on... But when you can, checking for success or success and a handful
of expected and recoverable errors is very nice.
josevalim wrote 1 day ago:
That’s actually a good example. Imagine someone forgot to check the
error code from an API response. In some languages, they may attempt
to parse it as if it was successful request, and succeed, leading to
a result with nulls, empty arrays, or missing data that then spreads
through the system. In Elixir, parsing would most likely fail thanks
to pattern matching [1] and if it by any chance that fails in a core
part of the system, that failure will be isolated and that particular
component can be restarted.
Elixir is not about willingly ignoring error codes or failure
scenarios. It is about naturally limiting the blast radius of errors
without a need to program defensively (as in writing code for
scenarios you don’t know “just in case”).
1:
[1]: https://dashbit.co/blog/writing-assertive-code-with-elixir
monkeyelite wrote 1 day ago:
This seems specific to BEAM as crashing a fast-cgi process is fine and
response will be handled correctly with Apache or nginx.
valenterry wrote 1 day ago:
There are a few stages, and each improves on the previous ones:
1. Detect crashes at runtime and by default stop/crash to prevent
continuing with invalid program state
2. Detect crashes at runtime and handle them according to the business
context (e.g. crash or retry or fallback-to or ...) to prevent bad UX
through crashes.
3. Detect potential crashes at compile-time to prevent the dev from
forgetting to handle them according to the business context
4. Don't just detect the possibility of crashes but also the specific
type and context to prevent the dev from making a logical mistake and
causing a potential runtime error during error handling according to
the business context
An example for stage 4 would be that the compiler checks that a
fall-back option will actually always resolve the errors and not
potentially introduce a new error / error type. Such as falling back to
another URL does not actually always resolve the problem, there still
needs to be handling for when the request to the alternative URL fails.
The philosophy described in the article is basically just stage 1 and a
(partial) default restart instead of a default crash, which is maybe a
slight improvement but not really sufficient, at least not by my
personal standards.
creatonez wrote 1 day ago:
Based on your list there is an opportunity to define stage -1 of
error handling sanity, the Eval-Rinse-Reload loop, as implemented by
FuckItJS, the original Javascript Error Steamroller: [1] > Through a
process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly
compiles your code, detecting errors and slicing those lines out of
the script. To survive such a violent process, FuckItJS reloads
itself after each iteration, allowing the onerror handler to catch
every single error in your terribly written code.
> [...]
> This will keep evaluating your code until all errors have been
sliced off like mold on a piece of perfectly good bread. Whether or
not the remaining code is even worth executing, we don't know. We
also don't particularly care.
[1]: https://github.com/mattdiamond/fuckitjs
valenterry wrote 1 day ago:
Oh, thank you for the nostalgic reminder of that one. I read that a
decade ago and found it hilarious.
BobbyTables2 wrote 1 day ago:
Hackers also love auto-restarting services.
Exploitation of vulnerabilities isn’t always 100% reliable. Heap
grooming might be limited or otherwise inadequate.
A quick automatic restart keeps them in business without any other
human interaction involved.
teiferer wrote 1 day ago:
Took me a minute to realize what you meant with "hackers". Quite the
irony, given the name of the site we are having this conversation on.
HexDecOctBin wrote 1 day ago:
How does restarting the process fix the crash? If the process crashed
because a file was missing, it will still be missing when the process
is restarted. Is an infinite crash-loop considered success in Erlang?
jlouis wrote 1 day ago:
It's not going to be missing the next time around. Usually the file
is missing due to some concurrency-problem where the file only gets
to exist a little later. A process restart certainly fixes this.
If the problem persists, a larger part of the supervision tree is
restarted. This eventually leads to a crash of the full application,
if nothing can proceed without this application existing in the
Erlang release.
The key point is that there's a very large class of errors which is
due to the concurrent interaction of different parts of the system.
These problems often go away on the next try, because the risk of
them occurring is low.
conradfr wrote 1 day ago:
If the rest of the program is still running while you fix it, yes?
Also, restarting endlessly is just one strategy between multiple
others.
victorbjorklund wrote 1 day ago:
Elixir dev: It does not solve all issues. But sometimes you have some
kind of rare bug that just happens once X,Z and Y happens in a
specific order. If it is restarted it might not happen that way
again. Or it might be a temporary problem. You are reaching for an
API and it temporarily has issues. It might not have it anymore in 50
ms.
But of course if it crashes because you are reading a file that does
not exist it doesnt solve the issue (but it avoids crashing the whole
system).
victorbjorklund wrote 1 day ago:
Note that let is crash doesnt mean we shouldnt fix bugs. It is more
about if there is a bug we havent fixed it is better to make the
crash just crash a tiny part of the program than the whole program
worthless-trash wrote 10 hours 17 min ago:
Or more importantly, you can't design robust recovery and retry
systems.
masklinn wrote 1 day ago:
> Is an infinite crash-loop considered success in Erlang?
Of course not, but usually that's not what happens, instead a process
crashes because some condition was not considered, the corresponding
request is aborted, and a supervisor restarts the process (or doesn't
because the acceptor spawns a process per request / client).
Or a long-running worker got into an incorrect state and crashed, and
a supervisor will restart it in a known good state (that's a pretty
common thing to do in hardware, BEAM makes that idiomatic in
software).
gopher_space wrote 1 day ago:
Both of your examples look like infinite crash-loops if your work
needs to be correct more than it needs to be available. E.g. there
aren't any known good states prior to an unexpected crash, you're
just throwing a hail mary because the alternatives are impractical.
dmsnell wrote 1 day ago:
When a process crashes, its supervisor restarts it according to
some policy. These specify whether to restart the sibling process
in their startup order or to only restart the crashed process.
But a supervisor also sets limits, like “10 restarts in a
timespan of 1 second.” Once the limits are reached, the
supervisor crashes. Supervisors have supervisors.
In this scenario the fault cascades upward through the system,
triggering more broad restarts and state-reinitializations until
the top-level supervisor crashes and takes the entire system down
with it.
An example might bee losing a connection to the database. It’s
not an expected fault to fail while querying it, so you let it
crash. That kills the web request, but then the web server ends
up crashing too because too many requests failed, then a task
runner fails for similar reasons. The logger is still reporting
all this because it’s a separate process tree, and the
top-level app supervisor ends up restarting the entire thing. It
shuts everything off, tries to restart the database connection,
and if that works everything will continue, but if not, the
system crashes completely.
Expected faults are not part of “let it crash.” E.g. if a
user supplies a bad file path or network resource. The
distinction is subjective and based around the expectations of
the given app. Failure to read some asset included in the
distribution is both unlikely and unrecoverable, so “let it
crash” allows the code to be simpler in the happy path without
giving up fault handling or burying errors deeper into the app or
data.
Muromec wrote 1 day ago:
If it has no good states you probably know it before deploying to
production.
masklinn wrote 1 day ago:
> there aren't any known good states prior to an unexpected crash
If there aren't any good states then the program straight up
doesn't work in the first place, which gets diagnosed pretty
quickly before it hits the field.
> your work needs to be correct more than it needs to be
available.
"correctness over availability" tends to not be a thing, if you
assume you can reach perfect and full correctness then either you
never release or reality quickly proves you wrong in the field.
So maximally resilient and safe systems generally plan for errors
happening and how to recover from them instead of assuming they
don't. There are very few fully proven non-trivial programs, and
there were even less 40 years ago.
And Erlang / BEAM was designed in a telecom context, so
availability is the prime directive. Which is also why
distribution is built-in: if you have a single machine and it
crashes you have nothing.
corysama wrote 1 day ago:
I’m only an armchair expert on Erlang. But, having looked into it
repeatedly for a couple decades, my take-away is the “Let it
crash” slogan is good. But, also presented a bit out of context.
Or, at least assuming context that most people don’t have.
Erlang is used in situations involving a zillion incoming requests.
If an individual request fails… Maybe it was important. Maybe it
wasn’t. If it was important, it’s expected they’ll try again.
What’s most important is that the rest of the requests are not
interrupted.
What makes Erlang different is that it is natural and trivial to be
able to shut down an individual request on the event of an error
without worrying about putting any other part of the system into a
bad state.
You can pull this off in other languages via careful attention to the
details of your request-handling code. But, the creators of the
Erlang language and foundational frameworks have set their users up
for success via careful attention to the design of the system as a
whole.
That’s great in the contexts in which Erlang is used. But, in the
context of a Java desktop app like Open Office, it’s more like
saying “Let it throw”. “It” being some user action. And,…
slogan being to have a language and framework with such robust
exception handling built-in that error handling becomes trivial and
nearly invisible.
Quekid5 wrote 1 day ago:
> You can pull this off in other languages via careful attention to
the details of your request-handling code. But, the creators of the
Erlang language and foundational frameworks have set their users up
for success via careful attention to the design of the system as a
whole.
+10. So many people miss this very important point. If you have
lots of mutable shared state, or can accidentally leak such into
your actor code then the whole actor/supervision tree thing falls
over very easily... because you can't just restart any actor
without worrying about the rest of the system.
I think this is a large (but not the only[0]) part of why
actors/supervisors haven't really caught on anywhere outside of
Erlang, even for problem spaces where they would be suitable.
[0] I personally feel the model is very hard to reason about
compared to threaded/blocking straight-line code using e.g.
structured concurrency, but that may just be a me thing.
asa400 wrote 22 hours 21 min ago:
I have worked Elixir/Erlang and Rust a lot, and I agree. Rust in
particular gives ownership semantics to threaded/blocking/locking
code, which I often times find _much_ easier to understand than a
series of messages sent between tasks/processes in Elixir/Erlang.
However, in a world where you have to do concurrent
blocking/locking code without the help of rigorous
compiler-enforced ownership semantics, Elixir/Erlang is like
water in the desert.
jfengel wrote 22 hours 52 min ago:
The alternative to straight-line code used to be called
"spaghetti code".
There was a joke article parodying "GOTO considered harmful" by
suggesting a "COME FROM" command. But in a lot of always, that's
exactly what many modern frameworks and languages aim for.
Quekid5 wrote 22 hours 0 min ago:
Haha... be the change! Program in INTERCAL! :)
nine_k wrote 1 day ago:
Let it crash, so that if something goes wrong, it does not do so
silently.
Let it crash, because a relevant manager will detect it, report it,
clean it up, and restart it, without you having to write a line of
code for that.
Let it crash as soon as possible, so that any problem (like a crash
loop) is readily visible. It's very easy to replace arbitrary bits
of Erlang code in a running system, without affecting the rest of
it. "Fix it in prod" is better than "miss it in prod", especially
when you cannot stop the prod ever.
0x445442 wrote 1 day ago:
Are individual agents deployable on their own or does the entire
"app" of agents need to be deployed as a single group? If
individually deployable, what does this look like from a version
control and a CI/CD perspective?
nine_k wrote 23 hours 38 min ago:
To the best of my knowledge: yes, individual parts are
deployable separately, within reason. No, there explicitly no
need to deploy the whole thing at once, and especially to shut
it down all at once.
Erlang works by message passing and duck typing, so, as long as
your interfaces are compatible (backwards or forwards), you can
alter the implementation, and evolve the interfaces. Think
microservices, but when every function can be a microservice,
at an absolutely trivial cost.
ramchip wrote 1 day ago:
I recommend [1] starting from "if my configuration file is corrupted,
restarting won't fix anything". The tl;dr is it helps with transient
bugs.
[1]: https://ferd.ca/the-zen-of-erlang.html
bccdee wrote 23 hours 29 min ago:
> if you feel that your well-understood regular failure case is
viable, then all your error handling can fall-through to that case.
This is my favourite line, because it generalizes the underlying
principle beyond the specific BEAM/OTP model in a way that carries
over well to the more common sort of database-backed services that
people tend to write.
kimi wrote 1 day ago:
...and does no harm for unfixable bugs. It's the logical equivalent
of "switch off and on again" that as we know fixes most issues by
itself, but happening only on a part of your software deployment,
so most of it will keep running.
lawn wrote 1 day ago:
Typically you then let the error bubble up in the supervisor tree if
restarting multiple times doesn't fix it.
Of course there are still errors that can't be recovered from, in
which case the whole program may finally crash.
dns_snek wrote 1 day ago:
> in which case the whole program may finally crash.
This may happen if you let it, but it's basically never the desired
outcome. If you were handling a user request, it should stop by
returning a HTTP 500 to the client, or if you were processing a
background job of some sort, it should stop with a watchdog process
marking the job as a failure, not with the entire system crashing.
Muromec wrote 1 day ago:
returning HTTP 500 as early as possible is an example of "let it
crash" approach outside of Erlang.
dns_snek wrote 17 hours 52 min ago:
That's not what "let it crash" is about. Letting something
crash in Erlang means that a process (actor) is allowed to
crash, but then it gets restarted to try again, which would
resolve the situation in case of transient errors.
The equivalent of "let it crash" outside of Erlang is a
mountain of try-catch statements and hand-rolled retry wrappers
with time delays, with none of the observability and tooling
that you get in Erlang.
adastra22 wrote 1 day ago:
“Reset on error” might be a better phrasing.
refactor_master wrote 1 day ago:
Question as a complete outsider: If I run idempotent Python
applications in Kubernetes containers and they crash, Kubernetes will
eventually restart them. Of course, knowing what to do on IO errors is
nicer than destroying and restarting everything with a really bigger
hammer (as the article also mentions, you can serve a better error
message for whoever has to “deal” with the problem), but eventually
they should end up in the same workable state.
Is this conceptually similar, but perhaps at code-level instead?
valenterry wrote 1 day ago:
In general, if you can move any kind of logic to a lower level,
that's better.
For example, testing that kubernetes restarts work correctly is
tricky and requires a complicated setup. Testing that an erlang
process/actor behaves as expected is basically a unit test.
emoII wrote 1 day ago:
I bet the kubernetes project has test for that, why should I as an
application developer care about testing something other than my
own code?
bccdee wrote 18 hours 18 min ago:
That's assuming your code is well-configured. How do you test
your k8s configs?
valenterry wrote 1 day ago:
Oh of course, I'm sure the kubernetes project tests that they
trigger restarts correctly etc.
But that doesn't cover the behavior of your app, the specific
configuration you ask kubernetes to use and how the app uses its
health endpoints etc. - this is all purely about your own
code/config, the kubernetes team can't test that.
adastra22 wrote 1 day ago:
Conceptually similar, different implementation. The perhaps most
visible difference is that supervisors aren’t polling application
state but are rather notified about errors (crashes), and restarting
is extremely low latency. Erlang/BEAM was invented for telephony,
and it is possible for this to happen on the middle of a protocol and
the user not even notice.
goosejuice wrote 1 day ago:
Somewhat, yes but it's much less powerful. In the BEAM these are
trees of supervisors and monitors/links that choose how to restart
and receive the stacktrace/error reason of the failure respectively.
This gives a lot of freedom on how to handle the failure. In k8s,
it's often just a dumb monitor/controller that knows little about how
to remediate the issue on boot. Nevermind the boot time penalty. [1]
BEAM apps run great on k8s.
[1]: https://hexdocs.pm/elixir/1.18.4/Supervisor.html
goosejuice wrote 1 day ago:
[1] The origin, as far as I know it. I think it still holds, is
insightful, as a general case. Let it heal seems pretty close to what
Joe was getting at.
[1]: https://erlang.org/pipermail/erlang-questions/2003-March/00787...
vrnvu wrote 1 day ago:
>>This organization corresponds nicely to a idealized
human
organization of bosses and workers - bosses say what is to be
done,
workers do stuff. Bosses do quality control and check that things
get
done, if not they fire people re-organize and tell other people to
do
the stuff. If they fail (the bosses) they get sacked etc. <>
We miss you Joe :)
tombert wrote 1 day ago:
He was one of my favorite humans; the few emails I exchanged with
him were funny and insightful.
bitwize wrote 1 day ago:
I don't code in Erlang or Elixir, aside from messing about. But I've
found that letting an entire application crash is something that I can
do under certain circumstances, especially when "you have a very big
problem and will not go to space today". For example, if there's an
error reading some piece of data that's in the application bundle and
is needed to legitimately start up in the first place (assets for my
game for instance). Then upon error it just "screams and dies" (spits
out a stack trace and terminates).
borromakot wrote 1 day ago:
Errors during initialization of a BEAM language application will
crash the entire program, and you can decide to exit/crash a program
if you get into some unrecoverable state. The important thing is the
design of individual crashable/recoverable units.
bgdkbtv wrote 1 day ago:
This is great, thanks for sharing! I've been thinking about improving
error handling in my liveview app and this might be a nice way to
start.
<- back to front page
You are viewing proxied material from codevoid.de. The copyright of proxied material belongs to its original authors. Any comments or complaints in relation to proxied material should be directed to the original authors of the content concerned. Please see the disclaimer for more details.