_______ __ _______ | |
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | |
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --| | |
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| | |
on Gopher (inofficial) | |
Visit Hacker News on the Web | |
COMMENT PAGE FOR: | |
Don't âlet it crashâ, let it heal | |
Jtsummers wrote 21 hours 57 min ago: | |
I think a lot of folks who have never looked at Erlang or Elixir and | |
BEAM before misunderstand this concept because they don't understand | |
how fine-grained processes are, or can be, in Erlang. A very important | |
note: Processes in BEAM languages are cheap, both to create and for | |
context switching, compared to OS threads. While design-wise they offer | |
similar capabilities, this cost difference results in a substantially | |
different approach to design in Erlang than in systems where the cost | |
of introducing and switching between threads is more expensive. | |
In a more conventional language where concurrency is relatively | |
expensive, and assuming you're not an idiot who writes 1-10k SLOC | |
functions, you end up with functions that have a "single | |
responsibility" (maybe not actually a single responsibility, but closer | |
to it than having 100 duties in one function) near the bottom of your | |
call tree, but they all exist in one thread of execution. In a system, | |
hypothetical, created in this model if your lowest level function is | |
something like: | |
retrieve_data(db_connection, query_parameters) -> data | |
And the database connection fails, would you attempt to restart the | |
database connection in this function? Maybe, but that'd be bad design. | |
You'd most likely raise an exception or change the signature so you | |
could express an error return, in Rust and similar it would become | |
something like: | |
retrieve_data(db_connection, query_parameters) -> Result | |
Somewhere higher in the call stack you have a handler which will catch | |
the exception or process the error and determine what to do. That is, | |
the function `retrieve_data` crashes, it fails to achieve its objective | |
and does not attempt any corrective action (beyond maybe a few retries | |
in case the error is transient). | |
In Erlang, you have a supervision tree which corresponds to this call | |
tree concept but for processes. The process handling data retrieval, | |
having been given some db_conn handler and the parameters, will fail | |
for some reason. Instead of handling the error in this process, the | |
process crashes. The failure condition is passed to the supervisor | |
which may or may not have a handler for this situation. | |
You might put the simple retry policy in the supervisor (that basic | |
assumption of transient errors, maybe a second or third attempt will | |
succeed). It might have other retry policies, like trying the request | |
again but with a different db_connection (that other one must be bad | |
for some reason, perhaps the db instance it references is down). If it | |
continues to fail, then this supervisor will either handle the error | |
some other way (signaling to another process that the db is down, fix | |
it or tell the supervisor what to do) or perhaps crash itself. This | |
repeats all the way up the supervision tree, ultimately it could mean | |
bringing down the whole system if the error propagates to a high enough | |
level. | |
This is conceptually no different than how errors and exceptions are | |
handled in sequential, non-concurrent systems. You have handlers that | |
provide mechanisms for retrying or dealing with the errors, and if you | |
don't the error is propagated up (hopefully you don't continue running | |
in a known-bad state) until it is handled or the program crashes | |
entirely. | |
In languages that offer more expensive concurrency (traditional OS | |
threads), the cost of concurrency (in memory and time) means you end up | |
with a policy that sits somewhere between Erlang's and a straight-line | |
sequential program. Your threads will be larger than Erlang processes | |
so they'll include more error handling within themselves, but | |
ultimately they can still fail and you'll have a supervisor of some | |
sort that determines what happens next (hopefully). | |
As more languages move to cheap concurrency (Go's goroutines, Java's | |
virtual threads), system designs have a chance to shift closer to | |
Erlang than that straight-line sequential approach if people are | |
willing to take advantage of it. | |
juped wrote 1 day ago: | |
There's really not more that's useful to say than the relevant section | |
(4.4) of Joe Armstrong's thesis says: | |
>How does our philosophy of handling errors fit in with coding | |
practices? What kind of code must the programmer write when they find | |
an error? The philosophy is let some other process fix the error, but | |
what does this mean for their code? The answer is let it crash. By this | |
I mean that in the event of an error, then the program should just | |
crash. But what is an error? For programming purpose we can say that: | |
>⢠exceptions occur when the run-time system does not know what to | |
do. | |
>⢠errors occur when the programmer doesnât know what to do. | |
>If an exception is generated by the run-time system, but the | |
programmer had foreseen this and knows what to do to correct the | |
condition that caused the exception, then this is not an error. For | |
example, opening a file which does not exist might cause an exception, | |
but the programmer might decide that this is not an error. They | |
therefore write code which traps this exception and takes the necessary | |
corrective action. | |
>Errors occur when the programmer does not know what to do. Programmers | |
are supposed to follow specifications, but often the specification does | |
not say what to do and therefore the programmer does not know what to | |
do. | |
>[...] | |
>The defensive code detracts from the pure case and confuses the | |
readerâthe diagnostic is often no better than the diagnostic which | |
the compiler supplies automatically. | |
Note that this "program" is a process. For a process doing work, | |
encountering something it can't handle is an error per the above | |
definitions, and the process should just die, since there's nothing | |
better for it to do; for a supervisor process supervising such | |
processes-doing-work, "my child process exited" is an exception at | |
worst, and usually not even an exception since the standard library | |
supervisor code already handles that. | |
stcg wrote 1 day ago: | |
"Let it crash" is a sentence that gets attention. It makes a person | |
want to know more about it, as it sounds controversial and different. | |
"Let it heal" doesn't have that. | |
jonhohle wrote 1 day ago: | |
It also has a deeper philosophical meaning of unexpected software | |
bugs should be noisy and obvious instead of causing silently | |
corruption or misleading user experience. If monitoring doesnât | |
catch the failure, customers will and it can be fixed right away | |
(whether itâs the software, a hardware error, dependency issue, | |
etc.). | |
A web service returning a 500 error code is a lot more obvious than a | |
200 with an invalid payload. A crashed app with a stack trace is | |
easier to debug and will cause more user feedback than an app than | |
hangs in a retry loop. | |
When I had to deal with these things in the Java world, it meant not | |
blindly handling or swallowing exceptions that business code had no | |
business caring about. Does your account management code really think | |
it knows how to properly handle an InterruptedException? Unless your | |
answer is rollback and reset the interrupted flag itâs probably | |
wrong. Canât write a test for a particular failure scenario? That | |
better blow up loudly with enough context that makes it possible to | |
understand the error condition (and then write a test for it). | |
tmcb wrote 1 day ago: | |
It is very common to interpret taglines by their face value, and I | |
believe the author did just that, although the point brought up is | |
valid. | |
In order to âlet it crashâ, we must design the system in a way that | |
crashes would not be catastrophic, stability wise. Letting it crash is | |
not a commandment, though: it is a reminder that, in most cases, a | |
smart healing strategy might be overkill. | |
borromakot wrote 1 day ago: | |
Author: I'm literally explaining not to interpret the tag line at | |
face value. | |
Muromec wrote 1 day ago: | |
Yeah, but it's internet forum and for opinion pieces people first | |
read comments and then maybe read the article if it's interesting. | |
tmcb wrote 1 day ago: | |
I actually skimmed the article before posting. I have some | |
exposure to Erlang, but not to Elixir. As Iâve already | |
mentioned, I think the authorâs covering of application | |
behavior is OK, but there is more to the tagline than meets the | |
eye. | |
tmcb wrote 1 day ago: | |
Maybe I didnât make myself clear. âLet it crashâ is not | |
something that should be thought of at the component level, it | |
should be thought of at the system level. The fact that the | |
application crashes âgracefullyâ or not is not what is really | |
important. You should design the system in a crash-friendly way, | |
and not to write the application and think: âoh, I believe it is | |
OK to let it crash hereâ. | |
borromakot wrote 1 day ago: | |
Then I don't think you understand how the phrase is used in | |
Elixir/Erlang. The phrase is about letting processes crash. | |
tmcb wrote 1 day ago: | |
No need for the snarky comment. If I am wrong, that is fine. | |
Of course Joe Armstrong could explain what I meant, but in a | |
much better way: [1] (edit: see the "Why was error handling | |
designed like this?" part for reference) | |
My personal interpretation is that systems must be able to | |
handle crashing processes gracefully. There is no benefit in | |
letting processes crash just for the sake of it. | |
[1]: https://erlang.org/pipermail/erlang-questions/2003-Mar... | |
tmcb wrote 1 day ago: | |
Actually, now I thought about it, I know exactly what irked | |
me about the approach. I hope the author takes it as | |
constructive feedback: | |
Saying "let it crash is a tagline that actually means | |
something else because the BEAM is supposed to be used in | |
this particular way" sounds slightly "cargo-cultish", to the | |
point where we have to challenge the meaning of the actual | |
word to make sense of it. | |
Joe Armstrong's e-mail, on the other hand, says (and I | |
paraphrase): "the BEAM was designed from the ground up to | |
help developers avoid the creation of ad-hoc protocols for | |
process communication, and the OTP takes that into | |
consideration already. Make sure your system, not your | |
process, is resilient, and literally let processes crash." | |
Boom. There is no gotcha there. Also, there is the added | |
benefit that developers for other platforms now understand | |
that the rationale is justified by the way BEAM/OTP were | |
designed and may not be applicable to their own platforms. | |
borromakot wrote 23 hours 56 min ago: | |
If I sounded snarky that wasn't my intention. At the end of | |
the day though it doesn't feel like you read the article | |
which was clearly in a different context than the one in | |
which you responded. FWIW I didn't expect this small | |
article speaking to a small audience (Elixir devs) to make | |
the rounds on hacker news. | |
I agree on the importance of defining terms, and I think | |
the important thing here is that "process" in Joe's | |
parlance is not an OS level process, it is one of a fleet | |
of processes running inside the BEAM VM. And the "system" | |
in this case is the supervisory system around it, which | |
itself consists of individual processes. | |
I'm critiquing a common misunderstanding of the phrase "Let | |
it crash", whereby effectively no local error handling is | |
performed. This leads to worse user experiences and worse | |
outcomes in general. I understand that you're offering | |
critique, but it again sounds like you're critiquing a | |
reductive element (the headline itself). | |
tmcb wrote 23 hours 29 min ago: | |
I did read the article. I concede that I might not have | |
understood it. Again, I never said it is wrong, but | |
rather that it has a blind spot. I am familiar with Joe | |
Armstrongâs work because I worked on a proprietary (and | |
rather worse tbf) native distributed systems middleware | |
in the past. | |
IshKebab wrote 1 day ago: | |
Ah this makes sense. I always thought "let it crash" made it sound like | |
Elixir devs just don't bother with error checking, like writing Java | |
without any `catch`es, or writing Rust that only uses `.unwrap()`. | |
If they just mean "processes should be restartable" then that sounds | |
way more reasonable. Similar idea to this but less fancy: [1] It's a | |
pretty terrible slogan if it makes your language sound worse than it | |
actually is. | |
[1]: https://flawless.dev/ | |
JonChesterfield wrote 14 hours 29 min ago: | |
Flawless is interesting. | |
It can't work in the general case because replaying a sequence of | |
syscalls is not sufficient to put the machine back in the same state | |
as it was last time. E.g. second time around open behaves differently | |
so you need to follow the error handling. | |
However sometimes that approach would work. I wonder how wide the | |
area of effective application is. It might be wide enough to be very | |
useful. The all or nothing database transaction model fits it well. | |
bccdee wrote 23 hours 24 min ago: | |
I've been seeing a lot of these durable workflow engines around | |
lately, for some reason. I'm not sure I understand the pitch. It just | |
seems like a thin wrapper around some very normal patterns for | |
running background jobs. Persist your jobs in a db, checkpoint as | |
necessary, periodically retry. I guess they're meant to be a low-code | |
alternative to writing the db tables yourself, but it seems like | |
you're not saving much code in practice. | |
vendiddy wrote 1 day ago: | |
I think the slogan was meant to be provocative but unfortunately it | |
has been misinterpreted more often than not. | |
For example, imagine you're working with a 3rd party API and, | |
according to the documentation, it is supposed to return responses in | |
a certain format. What if suddenly that API stops working? Or what if | |
the format changes? | |
You could write code to handle that "what if" scenario, but then | |
trying to handle every hypothetical your code becomes bloated, more | |
complicated, and hard to understand. | |
So in these cases, you accept that the system will crash. But to | |
ensure reliability, you don't want to bring down the whole system. So | |
there are primitives that let you control the blast radius of the | |
crash if something unexpected happens. | |
Let it crash does not mean you skip validating user input. Those are | |
issues that you expect to happen. You handle those just as you would | |
in any programming language. | |
zmgsabst wrote 1 day ago: | |
I think itâs more subtle: | |
Imagine that youâre trying to access an API, which for some reason | |
fails. | |
âLet it crashâ isnât an argument against handling the timeout, | |
but rather that you should only retry a few, bounded times rather | |
than (eg) exponentially back off indefinitely. | |
When you design from that perspective, you just fail your request | |
processing (returning the request to the queue) and make that your | |
managerâs problem. Your managing process can then restart you, | |
reassign the work to healthy workers, etc. If your manager canât | |
get things working and the queue overflows, it throws it into dead | |
letters and crashes. That might restart the server, it might page | |
oncall, etc. | |
The core idea is that within your business logic is the wrong place | |
to handle system health â and that many problems can be solved by | |
routing around problems (ie, give task to a healthy worker) or | |
restarting a process. A process should crash when it isnât scoped | |
to handle the problem itâs facing (eg, server OOM, critical | |
dependency offline, bad permissions). Crashing escalates the problem | |
until somebody can resolve it. | |
johnisgood wrote 1 day ago: | |
As someone has linked it: [1] It is about self-healing, too. | |
[1]: https://erlang.org/pipermail/erlang-questions/2003-March/007... | |
PicassoCTs wrote 1 day ago: | |
[1] Railway orientated programming to the rescue? | |
[1]: https://fsharpforfunandprofit.com/rop/ | |
adregan wrote 51 min ago: | |
There are a couple of patterns for accomplishing this in Elixir. | |
One is to build multiple function heads that pattern match on the | |
arguments. If itâs an error tuple, pass it along. Build up your | |
pipeline and handle any errors at the end. | |
Another is to use the `with else`[0] expression for building up a | |
railroad. This has the benefit of not having to teach your functions | |
how to pass along errors. Error handling in the else block can be a | |
little gnarly. | |
I find it a little more manual than languages that have a `runEffect` | |
or compose operator. In large part thatâs due to the :ok, :error | |
tuples being more of a convention than a primitive like | |
Either/Result. | |
0: | |
[1]: https://elixirschool.com/en/lessons/basics/control_structure... | |
hesus_ruiz wrote 1 day ago: | |
It is very strange that a post trying to explain the concept of "let it | |
crash" in Elixir (which runs on the BEAM VM) does not mention the | |
doctoral thesis of Joe Armstrong: "Making reliable distributed systems | |
in the presence of software errors". | |
It must be compulsory lecture for anybody interested in reliable | |
systems, even if they do not use the BEAM VM. | |
[1]: https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A10420... | |
plainOldText wrote 21 hours 11 min ago: | |
Some core ideas from the paper for the inpatient (failures, | |
isolation, healing): | |
- Failures are inevitabe, so systems must be designed to EXPECT and | |
recover from them, NOT AVOID them completely. | |
- Let it crash philosophy allows components to FAIL and RECOVER | |
quickly using supervision trees. | |
- Processes should be ISOLATED and communicate via MESSAGE PASSING, | |
which prevents cascading failures. | |
- Supervision trees monitor other processes and RESTART them when | |
they fail, creating a self-healing architecture. | |
anthk wrote 1 day ago: | |
Unix/BSD -> Crash, fix, restart. | |
GNU/MIT/Lisp -> Detect, offer a fix, continue. | |
atoav wrote 1 day ago: | |
The truth is that different errors have to lead to different results if | |
you want a good organisational outcome. These could be: | |
- Fundamental/Fatal error: something without the process cannot | |
function, e.g. we are missing an essential config option. Exiting with | |
an error is totally adequate. You can't just heal from that as it would | |
involve guessing information you don't have. Admins need to fix it | |
- Critical error: something that should not ever occur, e.g. having an | |
active user without password and email. You don't exit, you skip it if | |
thst is possible and ensure the first occurance is logged and admins | |
are contacted | |
- Expected/Regular error: something that is expected to happen during | |
the normal operations of the service, e.g. the other server you make | |
requests to is being restarted and thus unreachable. Here the strategy | |
may vary, but it could be something like retrying with random | |
exponential backoff. Or you could briefly accept the values provided by | |
that server are unknown and periodically retry to fill the unknown | |
values. Or you could escalate that into a critical error after a | |
certain amount of retries. | |
- Warnings: These are usually about something being not exactly ideal, | |
but do not impede with the flow of the program at all. Usually has to | |
do with bad data quality | |
If you can proceed without degrading the integrity of the system you | |
should, the next thing is to decide jow important it is for humans to | |
hear about it. | |
praptak wrote 1 day ago: | |
A condition that "should not happen" might still be a problem specific | |
to a particular request. If you "just crash" it turns this request from | |
one that only triggers a http 500 response to one that crashes the | |
process. This increases the risk of Query of Death scenarios where the | |
frontend that needs to serve this particular request starts retrying it | |
with different backends and triggers restarts faster than the processes | |
come back up. | |
So being too eager to "just crash" may turn a scenario where you fail | |
to serve 1% of requests into a scenario where you serve none because | |
all your processes keep restarting. | |
sarchertech wrote 1 day ago: | |
> If you "just crash" it turns this request from one that only | |
triggers a http 500 response to one that crashes the process. | |
In phoenix each request has its own process and crashing that process | |
will result in a 500 being sent to the client. | |
davidclark wrote 1 day ago: | |
You should try to do some load testing of a real Erlang system and | |
compare how it handles this scenario against other | |
languages/frameworks. What you are describing is one of the exact | |
things the Erlang system is strong against due to the scheduler. | |
rtpg wrote 1 day ago: | |
My impression is that in Erlang land each process handler is really | |
cheap so you can just keep on showing up with process handlers and | |
not reach exhaustion like you do with other systems (at least in | |
pre-async worlds...) | |
josevalim wrote 1 day ago: | |
Processes can be marked as temporary, which means they are not | |
restarted, and thatâs what is used when managing http connections, | |
as you canât really restart a request on the server without the | |
client. So the scenario above wouldnât happen. | |
You still want those processes to crash though, as it allows it to | |
automatically clean up any concurrent work. For example, if during a | |
request you start three processes to do concurrent work, like | |
fetching APIs, then the request process crashes, the concurrent | |
processes are automatically cleaned up. | |
cyberax wrote 1 day ago: | |
"Let it crash" in Erlang/Elixir means that the process that serves | |
the request is allowed to crash. It then will be restarted by the | |
supervisor. | |
Supervisors themselves form a tree, so for a crash to take down the | |
whole app, it needs to propagate all the way to the top. | |
Another explanation for people familiar with exceptions in other | |
languages: "Don't try to catch the exception inside a request | |
handler". | |
zwnow wrote 1 day ago: | |
This is funny given Elixir/Erlangs whole idea is "let it crash". In | |
Go I just have a Recovery Middleware for any type of problem. Don't | |
know how other langs do it tho | |
davidclark wrote 1 day ago: | |
I donât know Go, but that sounds like someone has simply written | |
part of Erlang in Go. | |
knome wrote 1 day ago: | |
erlang doesn't crash the program, it crashes the thread. erlang has | |
a layered management system built in as part of OTP (open telecom | |
platform, erlang was built for running highly concurrent telephony | |
hardware). when a thread crashes, it dies and signals its parent. | |
the parent then decides what to do. usually, that's just restarting | |
the worker. maybe if ten workers have crashed in a minute, the | |
manager itself will die and restart. issues bubble up, and managers | |
restart subsystems automatically. for some things, like parsing | |
user data, you might never cause the manager to die, and just | |
always restart the worker. | |
the article, if you should choose to read it, is explaining that | |
people have the misconception you appear to be having due to the | |
'let it fail' catchphrase. it goes into detail about this system, | |
when failing is appropriate, and when trying to work around errors | |
is appropriate. | |
as erlang uses greenthreads, restarting a thread for a user API is | |
effectively instant and free. | |
zwnow wrote 1 day ago: | |
It's not a misconception given that Elixir Forum and its Discords | |
members will say that to you. Also I never assumed the whole | |
program crashed so why would you explain this to me? | |
Why would one Blog guy know it better than a lot of other Elixir | |
devs? | |
sarchertech wrote 1 day ago: | |
Itâs well known among elixir devs that for reasons unkown, | |
Elixir Forum is populated predominantly by people who donât | |
know what theyâre talking about. | |
borromakot wrote 1 day ago: | |
Blog guy here: I do, in fact, know it better than a lot of | |
other Elixir devs. | |
snickerbockers wrote 1 day ago: | |
>When people say âlet it crashâ, they are referring to the fact | |
that practically any exited process in your application will be | |
subsequently restarted. Because of this, you can often be much less | |
defensive around unexpected errors. You will see far fewer try/rescue, | |
or matching on error states in Elixir code. | |
I just threw up in my mouth when I read this. I've never used this | |
language so maybe my experience doesn't apply here but I'm imagining | |
all the different security implications that ive seen arise from | |
failing to check error codes. | |
vendiddy wrote 1 day ago: | |
If get a chance to read some Elixir/Erlang code you'll see that | |
pattern matching is used frequently to assert expected error codes. | |
It does not mean ignore errors. | |
This is a common misunderstanding because unfortunately the slogan is | |
frequently misinterpreted. | |
toast0 wrote 1 day ago: | |
Ok, so it's not really that you're not checking error codes. It's | |
that you can write stuff like | |
ok = whatever(). | |
If whatever is successful and idomatic, it returns ok, or maybe a | |
tuple of {ok, SomeReturn}. In that case, execution would continue. If | |
it returns an error tuple like {error, Reason}... "Let it crash" says | |
you can just let it crash... You didn't have anything better to do, | |
the built in crash because {error, Reason} will do fine. | |
Or you could do a | |
case whatever of | |
ok -> ok; | |
{error, nxdomain} -> ok | |
end. | |
If it was fine to get nxdomain error, but any other error isn't | |
acceptable... It will just crash, and that's good or at least ok. | |
Better than having to enumerate all the possible errors, or having a | |
catchall that then explicitly throws an eeror. It's especially hard | |
to enumerate all possible errors because the running system can | |
change and may return a new error that wasn't enumerated when the | |
requesting code was written. | |
There's lots of places where crashing isn't actually what you want, | |
and you have to capture all errors, explicitly log it, and then move | |
on... But when you can, checking for success or success and a handful | |
of expected and recoverable errors is very nice. | |
josevalim wrote 1 day ago: | |
Thatâs actually a good example. Imagine someone forgot to check the | |
error code from an API response. In some languages, they may attempt | |
to parse it as if it was successful request, and succeed, leading to | |
a result with nulls, empty arrays, or missing data that then spreads | |
through the system. In Elixir, parsing would most likely fail thanks | |
to pattern matching [1] and if it by any chance that fails in a core | |
part of the system, that failure will be isolated and that particular | |
component can be restarted. | |
Elixir is not about willingly ignoring error codes or failure | |
scenarios. It is about naturally limiting the blast radius of errors | |
without a need to program defensively (as in writing code for | |
scenarios you donât know âjust in caseâ). | |
1: | |
[1]: https://dashbit.co/blog/writing-assertive-code-with-elixir | |
monkeyelite wrote 1 day ago: | |
This seems specific to BEAM as crashing a fast-cgi process is fine and | |
response will be handled correctly with Apache or nginx. | |
valenterry wrote 1 day ago: | |
There are a few stages, and each improves on the previous ones: | |
1. Detect crashes at runtime and by default stop/crash to prevent | |
continuing with invalid program state | |
2. Detect crashes at runtime and handle them according to the business | |
context (e.g. crash or retry or fallback-to or ...) to prevent bad UX | |
through crashes. | |
3. Detect potential crashes at compile-time to prevent the dev from | |
forgetting to handle them according to the business context | |
4. Don't just detect the possibility of crashes but also the specific | |
type and context to prevent the dev from making a logical mistake and | |
causing a potential runtime error during error handling according to | |
the business context | |
An example for stage 4 would be that the compiler checks that a | |
fall-back option will actually always resolve the errors and not | |
potentially introduce a new error / error type. Such as falling back to | |
another URL does not actually always resolve the problem, there still | |
needs to be handling for when the request to the alternative URL fails. | |
The philosophy described in the article is basically just stage 1 and a | |
(partial) default restart instead of a default crash, which is maybe a | |
slight improvement but not really sufficient, at least not by my | |
personal standards. | |
creatonez wrote 1 day ago: | |
Based on your list there is an opportunity to define stage -1 of | |
error handling sanity, the Eval-Rinse-Reload loop, as implemented by | |
FuckItJS, the original Javascript Error Steamroller: [1] > Through a | |
process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly | |
compiles your code, detecting errors and slicing those lines out of | |
the script. To survive such a violent process, FuckItJS reloads | |
itself after each iteration, allowing the onerror handler to catch | |
every single error in your terribly written code. | |
> [...] | |
> This will keep evaluating your code until all errors have been | |
sliced off like mold on a piece of perfectly good bread. Whether or | |
not the remaining code is even worth executing, we don't know. We | |
also don't particularly care. | |
[1]: https://github.com/mattdiamond/fuckitjs | |
valenterry wrote 1 day ago: | |
Oh, thank you for the nostalgic reminder of that one. I read that a | |
decade ago and found it hilarious. | |
BobbyTables2 wrote 1 day ago: | |
Hackers also love auto-restarting services. | |
Exploitation of vulnerabilities isnât always 100% reliable. Heap | |
grooming might be limited or otherwise inadequate. | |
A quick automatic restart keeps them in business without any other | |
human interaction involved. | |
teiferer wrote 1 day ago: | |
Took me a minute to realize what you meant with "hackers". Quite the | |
irony, given the name of the site we are having this conversation on. | |
HexDecOctBin wrote 1 day ago: | |
How does restarting the process fix the crash? If the process crashed | |
because a file was missing, it will still be missing when the process | |
is restarted. Is an infinite crash-loop considered success in Erlang? | |
jlouis wrote 1 day ago: | |
It's not going to be missing the next time around. Usually the file | |
is missing due to some concurrency-problem where the file only gets | |
to exist a little later. A process restart certainly fixes this. | |
If the problem persists, a larger part of the supervision tree is | |
restarted. This eventually leads to a crash of the full application, | |
if nothing can proceed without this application existing in the | |
Erlang release. | |
The key point is that there's a very large class of errors which is | |
due to the concurrent interaction of different parts of the system. | |
These problems often go away on the next try, because the risk of | |
them occurring is low. | |
conradfr wrote 1 day ago: | |
If the rest of the program is still running while you fix it, yes? | |
Also, restarting endlessly is just one strategy between multiple | |
others. | |
victorbjorklund wrote 1 day ago: | |
Elixir dev: It does not solve all issues. But sometimes you have some | |
kind of rare bug that just happens once X,Z and Y happens in a | |
specific order. If it is restarted it might not happen that way | |
again. Or it might be a temporary problem. You are reaching for an | |
API and it temporarily has issues. It might not have it anymore in 50 | |
ms. | |
But of course if it crashes because you are reading a file that does | |
not exist it doesnt solve the issue (but it avoids crashing the whole | |
system). | |
victorbjorklund wrote 1 day ago: | |
Note that let is crash doesnt mean we shouldnt fix bugs. It is more | |
about if there is a bug we havent fixed it is better to make the | |
crash just crash a tiny part of the program than the whole program | |
worthless-trash wrote 10 hours 17 min ago: | |
Or more importantly, you can't design robust recovery and retry | |
systems. | |
masklinn wrote 1 day ago: | |
> Is an infinite crash-loop considered success in Erlang? | |
Of course not, but usually that's not what happens, instead a process | |
crashes because some condition was not considered, the corresponding | |
request is aborted, and a supervisor restarts the process (or doesn't | |
because the acceptor spawns a process per request / client). | |
Or a long-running worker got into an incorrect state and crashed, and | |
a supervisor will restart it in a known good state (that's a pretty | |
common thing to do in hardware, BEAM makes that idiomatic in | |
software). | |
gopher_space wrote 1 day ago: | |
Both of your examples look like infinite crash-loops if your work | |
needs to be correct more than it needs to be available. E.g. there | |
aren't any known good states prior to an unexpected crash, you're | |
just throwing a hail mary because the alternatives are impractical. | |
dmsnell wrote 1 day ago: | |
When a process crashes, its supervisor restarts it according to | |
some policy. These specify whether to restart the sibling process | |
in their startup order or to only restart the crashed process. | |
But a supervisor also sets limits, like â10 restarts in a | |
timespan of 1 second.â Once the limits are reached, the | |
supervisor crashes. Supervisors have supervisors. | |
In this scenario the fault cascades upward through the system, | |
triggering more broad restarts and state-reinitializations until | |
the top-level supervisor crashes and takes the entire system down | |
with it. | |
An example might bee losing a connection to the database. Itâs | |
not an expected fault to fail while querying it, so you let it | |
crash. That kills the web request, but then the web server ends | |
up crashing too because too many requests failed, then a task | |
runner fails for similar reasons. The logger is still reporting | |
all this because itâs a separate process tree, and the | |
top-level app supervisor ends up restarting the entire thing. It | |
shuts everything off, tries to restart the database connection, | |
and if that works everything will continue, but if not, the | |
system crashes completely. | |
Expected faults are not part of âlet it crash.â E.g. if a | |
user supplies a bad file path or network resource. The | |
distinction is subjective and based around the expectations of | |
the given app. Failure to read some asset included in the | |
distribution is both unlikely and unrecoverable, so âlet it | |
crashâ allows the code to be simpler in the happy path without | |
giving up fault handling or burying errors deeper into the app or | |
data. | |
Muromec wrote 1 day ago: | |
If it has no good states you probably know it before deploying to | |
production. | |
masklinn wrote 1 day ago: | |
> there aren't any known good states prior to an unexpected crash | |
If there aren't any good states then the program straight up | |
doesn't work in the first place, which gets diagnosed pretty | |
quickly before it hits the field. | |
> your work needs to be correct more than it needs to be | |
available. | |
"correctness over availability" tends to not be a thing, if you | |
assume you can reach perfect and full correctness then either you | |
never release or reality quickly proves you wrong in the field. | |
So maximally resilient and safe systems generally plan for errors | |
happening and how to recover from them instead of assuming they | |
don't. There are very few fully proven non-trivial programs, and | |
there were even less 40 years ago. | |
And Erlang / BEAM was designed in a telecom context, so | |
availability is the prime directive. Which is also why | |
distribution is built-in: if you have a single machine and it | |
crashes you have nothing. | |
corysama wrote 1 day ago: | |
Iâm only an armchair expert on Erlang. But, having looked into it | |
repeatedly for a couple decades, my take-away is the âLet it | |
crashâ slogan is good. But, also presented a bit out of context. | |
Or, at least assuming context that most people donât have. | |
Erlang is used in situations involving a zillion incoming requests. | |
If an individual request fails⦠Maybe it was important. Maybe it | |
wasnât. If it was important, itâs expected theyâll try again. | |
Whatâs most important is that the rest of the requests are not | |
interrupted. | |
What makes Erlang different is that it is natural and trivial to be | |
able to shut down an individual request on the event of an error | |
without worrying about putting any other part of the system into a | |
bad state. | |
You can pull this off in other languages via careful attention to the | |
details of your request-handling code. But, the creators of the | |
Erlang language and foundational frameworks have set their users up | |
for success via careful attention to the design of the system as a | |
whole. | |
Thatâs great in the contexts in which Erlang is used. But, in the | |
context of a Java desktop app like Open Office, itâs more like | |
saying âLet it throwâ. âItâ being some user action. And,… | |
slogan being to have a language and framework with such robust | |
exception handling built-in that error handling becomes trivial and | |
nearly invisible. | |
Quekid5 wrote 1 day ago: | |
> You can pull this off in other languages via careful attention to | |
the details of your request-handling code. But, the creators of the | |
Erlang language and foundational frameworks have set their users up | |
for success via careful attention to the design of the system as a | |
whole. | |
+10. So many people miss this very important point. If you have | |
lots of mutable shared state, or can accidentally leak such into | |
your actor code then the whole actor/supervision tree thing falls | |
over very easily... because you can't just restart any actor | |
without worrying about the rest of the system. | |
I think this is a large (but not the only[0]) part of why | |
actors/supervisors haven't really caught on anywhere outside of | |
Erlang, even for problem spaces where they would be suitable. | |
[0] I personally feel the model is very hard to reason about | |
compared to threaded/blocking straight-line code using e.g. | |
structured concurrency, but that may just be a me thing. | |
asa400 wrote 22 hours 21 min ago: | |
I have worked Elixir/Erlang and Rust a lot, and I agree. Rust in | |
particular gives ownership semantics to threaded/blocking/locking | |
code, which I often times find _much_ easier to understand than a | |
series of messages sent between tasks/processes in Elixir/Erlang. | |
However, in a world where you have to do concurrent | |
blocking/locking code without the help of rigorous | |
compiler-enforced ownership semantics, Elixir/Erlang is like | |
water in the desert. | |
jfengel wrote 22 hours 52 min ago: | |
The alternative to straight-line code used to be called | |
"spaghetti code". | |
There was a joke article parodying "GOTO considered harmful" by | |
suggesting a "COME FROM" command. But in a lot of always, that's | |
exactly what many modern frameworks and languages aim for. | |
Quekid5 wrote 22 hours 0 min ago: | |
Haha... be the change! Program in INTERCAL! :) | |
nine_k wrote 1 day ago: | |
Let it crash, so that if something goes wrong, it does not do so | |
silently. | |
Let it crash, because a relevant manager will detect it, report it, | |
clean it up, and restart it, without you having to write a line of | |
code for that. | |
Let it crash as soon as possible, so that any problem (like a crash | |
loop) is readily visible. It's very easy to replace arbitrary bits | |
of Erlang code in a running system, without affecting the rest of | |
it. "Fix it in prod" is better than "miss it in prod", especially | |
when you cannot stop the prod ever. | |
0x445442 wrote 1 day ago: | |
Are individual agents deployable on their own or does the entire | |
"app" of agents need to be deployed as a single group? If | |
individually deployable, what does this look like from a version | |
control and a CI/CD perspective? | |
nine_k wrote 23 hours 38 min ago: | |
To the best of my knowledge: yes, individual parts are | |
deployable separately, within reason. No, there explicitly no | |
need to deploy the whole thing at once, and especially to shut | |
it down all at once. | |
Erlang works by message passing and duck typing, so, as long as | |
your interfaces are compatible (backwards or forwards), you can | |
alter the implementation, and evolve the interfaces. Think | |
microservices, but when every function can be a microservice, | |
at an absolutely trivial cost. | |
ramchip wrote 1 day ago: | |
I recommend [1] starting from "if my configuration file is corrupted, | |
restarting won't fix anything". The tl;dr is it helps with transient | |
bugs. | |
[1]: https://ferd.ca/the-zen-of-erlang.html | |
bccdee wrote 23 hours 29 min ago: | |
> if you feel that your well-understood regular failure case is | |
viable, then all your error handling can fall-through to that case. | |
This is my favourite line, because it generalizes the underlying | |
principle beyond the specific BEAM/OTP model in a way that carries | |
over well to the more common sort of database-backed services that | |
people tend to write. | |
kimi wrote 1 day ago: | |
...and does no harm for unfixable bugs. It's the logical equivalent | |
of "switch off and on again" that as we know fixes most issues by | |
itself, but happening only on a part of your software deployment, | |
so most of it will keep running. | |
lawn wrote 1 day ago: | |
Typically you then let the error bubble up in the supervisor tree if | |
restarting multiple times doesn't fix it. | |
Of course there are still errors that can't be recovered from, in | |
which case the whole program may finally crash. | |
dns_snek wrote 1 day ago: | |
> in which case the whole program may finally crash. | |
This may happen if you let it, but it's basically never the desired | |
outcome. If you were handling a user request, it should stop by | |
returning a HTTP 500 to the client, or if you were processing a | |
background job of some sort, it should stop with a watchdog process | |
marking the job as a failure, not with the entire system crashing. | |
Muromec wrote 1 day ago: | |
returning HTTP 500 as early as possible is an example of "let it | |
crash" approach outside of Erlang. | |
dns_snek wrote 17 hours 52 min ago: | |
That's not what "let it crash" is about. Letting something | |
crash in Erlang means that a process (actor) is allowed to | |
crash, but then it gets restarted to try again, which would | |
resolve the situation in case of transient errors. | |
The equivalent of "let it crash" outside of Erlang is a | |
mountain of try-catch statements and hand-rolled retry wrappers | |
with time delays, with none of the observability and tooling | |
that you get in Erlang. | |
adastra22 wrote 1 day ago: | |
âReset on errorâ might be a better phrasing. | |
refactor_master wrote 1 day ago: | |
Question as a complete outsider: If I run idempotent Python | |
applications in Kubernetes containers and they crash, Kubernetes will | |
eventually restart them. Of course, knowing what to do on IO errors is | |
nicer than destroying and restarting everything with a really bigger | |
hammer (as the article also mentions, you can serve a better error | |
message for whoever has to âdealâ with the problem), but eventually | |
they should end up in the same workable state. | |
Is this conceptually similar, but perhaps at code-level instead? | |
valenterry wrote 1 day ago: | |
In general, if you can move any kind of logic to a lower level, | |
that's better. | |
For example, testing that kubernetes restarts work correctly is | |
tricky and requires a complicated setup. Testing that an erlang | |
process/actor behaves as expected is basically a unit test. | |
emoII wrote 1 day ago: | |
I bet the kubernetes project has test for that, why should I as an | |
application developer care about testing something other than my | |
own code? | |
bccdee wrote 18 hours 18 min ago: | |
That's assuming your code is well-configured. How do you test | |
your k8s configs? | |
valenterry wrote 1 day ago: | |
Oh of course, I'm sure the kubernetes project tests that they | |
trigger restarts correctly etc. | |
But that doesn't cover the behavior of your app, the specific | |
configuration you ask kubernetes to use and how the app uses its | |
health endpoints etc. - this is all purely about your own | |
code/config, the kubernetes team can't test that. | |
adastra22 wrote 1 day ago: | |
Conceptually similar, different implementation. The perhaps most | |
visible difference is that supervisors arenât polling application | |
state but are rather notified about errors (crashes), and restarting | |
is extremely low latency. Erlang/BEAM was invented for telephony, | |
and it is possible for this to happen on the middle of a protocol and | |
the user not even notice. | |
goosejuice wrote 1 day ago: | |
Somewhat, yes but it's much less powerful. In the BEAM these are | |
trees of supervisors and monitors/links that choose how to restart | |
and receive the stacktrace/error reason of the failure respectively. | |
This gives a lot of freedom on how to handle the failure. In k8s, | |
it's often just a dumb monitor/controller that knows little about how | |
to remediate the issue on boot. Nevermind the boot time penalty. [1] | |
BEAM apps run great on k8s. | |
[1]: https://hexdocs.pm/elixir/1.18.4/Supervisor.html | |
goosejuice wrote 1 day ago: | |
[1] The origin, as far as I know it. I think it still holds, is | |
insightful, as a general case. Let it heal seems pretty close to what | |
Joe was getting at. | |
[1]: https://erlang.org/pipermail/erlang-questions/2003-March/00787... | |
vrnvu wrote 1 day ago: | |
>>This organization corresponds nicely to a idealized | |
human | |
organization of bosses and workers - bosses say what is to be | |
done, | |
workers do stuff. Bosses do quality control and check that things | |
get | |
done, if not they fire people re-organize and tell other people to | |
do | |
the stuff. If they fail (the bosses) they get sacked etc. <> | |
We miss you Joe :) | |
tombert wrote 1 day ago: | |
He was one of my favorite humans; the few emails I exchanged with | |
him were funny and insightful. | |
bitwize wrote 1 day ago: | |
I don't code in Erlang or Elixir, aside from messing about. But I've | |
found that letting an entire application crash is something that I can | |
do under certain circumstances, especially when "you have a very big | |
problem and will not go to space today". For example, if there's an | |
error reading some piece of data that's in the application bundle and | |
is needed to legitimately start up in the first place (assets for my | |
game for instance). Then upon error it just "screams and dies" (spits | |
out a stack trace and terminates). | |
borromakot wrote 1 day ago: | |
Errors during initialization of a BEAM language application will | |
crash the entire program, and you can decide to exit/crash a program | |
if you get into some unrecoverable state. The important thing is the | |
design of individual crashable/recoverable units. | |
bgdkbtv wrote 1 day ago: | |
This is great, thanks for sharing! I've been thinking about improving | |
error handling in my liveview app and this might be a nice way to | |
start. | |
<- back to front page |