/hn/comments_44810890.gph on codevoid.de

	_______ __ _______
	\| \| \|.---.-..----.\| \|--..-----..----. \| \| \|.-----..--.--.--..-----.
	\| \|\| _ \|\| __\|\| < \| -__\|\| _\| \| \|\| -__\|\| \| \| \|\|__ --\|
	\|___\|___\|\|___._\|\|____\|\|__\|__\|\|_____\|\|__\| \|__\|____\|\|_____\|\|________\|\|_____\|
	on Gopher (inofficial)
	Visit Hacker News on the Web


	COMMENT PAGE FOR:
	Don't âlet it crashâ, let it heal


	Jtsummers wrote 21 hours 57 min ago:
	I think a lot of folks who have never looked at Erlang or Elixir and
	BEAM before misunderstand this concept because they don't understand
	how fine-grained processes are, or can be, in Erlang. A very important
	note: Processes in BEAM languages are cheap, both to create and for
	context switching, compared to OS threads. While design-wise they offer
	similar capabilities, this cost difference results in a substantially
	different approach to design in Erlang than in systems where the cost
	of introducing and switching between threads is more expensive.

	In a more conventional language where concurrency is relatively
	expensive, and assuming you're not an idiot who writes 1-10k SLOC
	functions, you end up with functions that have a "single
	responsibility" (maybe not actually a single responsibility, but closer
	to it than having 100 duties in one function) near the bottom of your
	call tree, but they all exist in one thread of execution. In a system,
	hypothetical, created in this model if your lowest level function is
	something like:

	retrieve_data(db_connection, query_parameters) -> data

	And the database connection fails, would you attempt to restart the
	database connection in this function? Maybe, but that'd be bad design.
	You'd most likely raise an exception or change the signature so you
	could express an error return, in Rust and similar it would become
	something like:

	retrieve_data(db_connection, query_parameters) -> Result

	Somewhere higher in the call stack you have a handler which will catch
	the exception or process the error and determine what to do. That is,
	the function `retrieve_data` crashes, it fails to achieve its objective
	and does not attempt any corrective action (beyond maybe a few retries
	in case the error is transient).

	In Erlang, you have a supervision tree which corresponds to this call
	tree concept but for processes. The process handling data retrieval,
	having been given some db_conn handler and the parameters, will fail
	for some reason. Instead of handling the error in this process, the
	process crashes. The failure condition is passed to the supervisor
	which may or may not have a handler for this situation.

	You might put the simple retry policy in the supervisor (that basic
	assumption of transient errors, maybe a second or third attempt will
	succeed). It might have other retry policies, like trying the request
	again but with a different db_connection (that other one must be bad
	for some reason, perhaps the db instance it references is down). If it
	continues to fail, then this supervisor will either handle the error
	some other way (signaling to another process that the db is down, fix
	it or tell the supervisor what to do) or perhaps crash itself. This
	repeats all the way up the supervision tree, ultimately it could mean
	bringing down the whole system if the error propagates to a high enough
	level.

	This is conceptually no different than how errors and exceptions are
	handled in sequential, non-concurrent systems. You have handlers that
	provide mechanisms for retrying or dealing with the errors, and if you
	don't the error is propagated up (hopefully you don't continue running
	in a known-bad state) until it is handled or the program crashes
	entirely.

	In languages that offer more expensive concurrency (traditional OS
	threads), the cost of concurrency (in memory and time) means you end up
	with a policy that sits somewhere between Erlang's and a straight-line
	sequential program. Your threads will be larger than Erlang processes
	so they'll include more error handling within themselves, but
	ultimately they can still fail and you'll have a supervisor of some
	sort that determines what happens next (hopefully).

	As more languages move to cheap concurrency (Go's goroutines, Java's
	virtual threads), system designs have a chance to shift closer to
	Erlang than that straight-line sequential approach if people are
	willing to take advantage of it.

	juped wrote 1 day ago:
	There's really not more that's useful to say than the relevant section
	(4.4) of Joe Armstrong's thesis says:

	>How does our philosophy of handling errors fit in with coding
	practices? What kind of code must the programmer write when they find
	an error? The philosophy is let some other process fix the error, but
	what does this mean for their code? The answer is let it crash. By this
	I mean that in the event of an error, then the program should just
	crash. But what is an error? For programming purpose we can say that:

	>â¢ exceptions occur when the run-time system does not know what to
	do.

	>â¢ errors occur when the programmer doesnât know what to do.

	>If an exception is generated by the run-time system, but the
	programmer had foreseen this and knows what to do to correct the
	condition that caused the exception, then this is not an error. For
	example, opening a file which does not exist might cause an exception,
	but the programmer might decide that this is not an error. They
	therefore write code which traps this exception and takes the necessary
	corrective action.

	>Errors occur when the programmer does not know what to do. Programmers
	are supposed to follow specifications, but often the specification does
	not say what to do and therefore the programmer does not know what to
	do.

	>[...]

	>The defensive code detracts from the pure case and confuses the
	readerâthe diagnostic is often no better than the diagnostic which
	the compiler supplies automatically.

	Note that this "program" is a process. For a process doing work,
	encountering something it can't handle is an error per the above
	definitions, and the process should just die, since there's nothing
	better for it to do; for a supervisor process supervising such
	processes-doing-work, "my child process exited" is an exception at
	worst, and usually not even an exception since the standard library
	supervisor code already handles that.

	stcg wrote 1 day ago:
	"Let it crash" is a sentence that gets attention. It makes a person
	want to know more about it, as it sounds controversial and different.
	"Let it heal" doesn't have that.

	jonhohle wrote 1 day ago:
	It also has a deeper philosophical meaning of unexpected software
	bugs should be noisy and obvious instead of causing silently
	corruption or misleading user experience. If monitoring doesnât
	catch the failure, customers will and it can be fixed right away
	(whether itâs the software, a hardware error, dependency issue,
	etc.).

	A web service returning a 500 error code is a lot more obvious than a
	200 with an invalid payload. A crashed app with a stack trace is
	easier to debug and will cause more user feedback than an app than
	hangs in a retry loop.

	When I had to deal with these things in the Java world, it meant not
	blindly handling or swallowing exceptions that business code had no
	business caring about. Does your account management code really think
	it knows how to properly handle an InterruptedException? Unless your
	answer is rollback and reset the interrupted flag itâs probably
	wrong. Canât write a test for a particular failure scenario? That
	better blow up loudly with enough context that makes it possible to
	understand the error condition (and then write a test for it).

	tmcb wrote 1 day ago:
	It is very common to interpret taglines by their face value, and I
	believe the author did just that, although the point brought up is
	valid.

	In order to âlet it crashâ, we must design the system in a way that
	crashes would not be catastrophic, stability wise. Letting it crash is
	not a commandment, though: it is a reminder that, in most cases, a
	smart healing strategy might be overkill.

	borromakot wrote 1 day ago:
	Author: I'm literally explaining not to interpret the tag line at
	face value.

	Muromec wrote 1 day ago:
	Yeah, but it's internet forum and for opinion pieces people first
	read comments and then maybe read the article if it's interesting.

	tmcb wrote 1 day ago:
	I actually skimmed the article before posting. I have some
	exposure to Erlang, but not to Elixir. As Iâve already
	mentioned, I think the authorâs covering of application
	behavior is OK, but there is more to the tagline than meets the
	eye.

	tmcb wrote 1 day ago:
	Maybe I didnât make myself clear. âLet it crashâ is not
	something that should be thought of at the component level, it
	should be thought of at the system level. The fact that the
	application crashes âgracefullyâ or not is not what is really
	important. You should design the system in a crash-friendly way,
	and not to write the application and think: âoh, I believe it is
	OK to let it crash hereâ.

	borromakot wrote 1 day ago:
	Then I don't think you understand how the phrase is used in
	Elixir/Erlang. The phrase is about letting processes crash.

	tmcb wrote 1 day ago:
	No need for the snarky comment. If I am wrong, that is fine.

	Of course Joe Armstrong could explain what I meant, but in a
	much better way: [1] (edit: see the "Why was error handling
	designed like this?" part for reference)

	My personal interpretation is that systems must be able to
	handle crashing processes gracefully. There is no benefit in
	letting processes crash just for the sake of it.

	[1]: https://erlang.org/pipermail/erlang-questions/2003-Mar...

	tmcb wrote 1 day ago:
	Actually, now I thought about it, I know exactly what irked
	me about the approach. I hope the author takes it as
	constructive feedback:

	Saying "let it crash is a tagline that actually means
	something else because the BEAM is supposed to be used in
	this particular way" sounds slightly "cargo-cultish", to the
	point where we have to challenge the meaning of the actual
	word to make sense of it.

	Joe Armstrong's e-mail, on the other hand, says (and I
	paraphrase): "the BEAM was designed from the ground up to
	help developers avoid the creation of ad-hoc protocols for
	process communication, and the OTP takes that into
	consideration already. Make sure your system, not your
	process, is resilient, and literally let processes crash."
	Boom. There is no gotcha there. Also, there is the added
	benefit that developers for other platforms now understand
	that the rationale is justified by the way BEAM/OTP were
	designed and may not be applicable to their own platforms.

	borromakot wrote 23 hours 56 min ago:
	If I sounded snarky that wasn't my intention. At the end of
	the day though it doesn't feel like you read the article
	which was clearly in a different context than the one in
	which you responded. FWIW I didn't expect this small
	article speaking to a small audience (Elixir devs) to make
	the rounds on hacker news.

	I agree on the importance of defining terms, and I think
	the important thing here is that "process" in Joe's
	parlance is not an OS level process, it is one of a fleet
	of processes running inside the BEAM VM. And the "system"
	in this case is the supervisory system around it, which
	itself consists of individual processes.

	I'm critiquing a common misunderstanding of the phrase "Let
	it crash", whereby effectively no local error handling is
	performed. This leads to worse user experiences and worse
	outcomes in general. I understand that you're offering
	critique, but it again sounds like you're critiquing a
	reductive element (the headline itself).

	tmcb wrote 23 hours 29 min ago:
	I did read the article. I concede that I might not have
	understood it. Again, I never said it is wrong, but
	rather that it has a blind spot. I am familiar with Joe
	Armstrongâs work because I worked on a proprietary (and
	rather worse tbf) native distributed systems middleware
	in the past.

	IshKebab wrote 1 day ago:
	Ah this makes sense. I always thought "let it crash" made it sound like
	Elixir devs just don't bother with error checking, like writing Java
	without any `catch`es, or writing Rust that only uses `.unwrap()`.

	If they just mean "processes should be restartable" then that sounds
	way more reasonable. Similar idea to this but less fancy: [1] It's a
	pretty terrible slogan if it makes your language sound worse than it
	actually is.

	[1]: https://flawless.dev/

	JonChesterfield wrote 14 hours 29 min ago:
	Flawless is interesting.

	It can't work in the general case because replaying a sequence of
	syscalls is not sufficient to put the machine back in the same state
	as it was last time. E.g. second time around open behaves differently
	so you need to follow the error handling.

	However sometimes that approach would work. I wonder how wide the
	area of effective application is. It might be wide enough to be very
	useful. The all or nothing database transaction model fits it well.

	bccdee wrote 23 hours 24 min ago:
	I've been seeing a lot of these durable workflow engines around
	lately, for some reason. I'm not sure I understand the pitch. It just
	seems like a thin wrapper around some very normal patterns for
	running background jobs. Persist your jobs in a db, checkpoint as
	necessary, periodically retry. I guess they're meant to be a low-code
	alternative to writing the db tables yourself, but it seems like
	you're not saving much code in practice.

	vendiddy wrote 1 day ago:
	I think the slogan was meant to be provocative but unfortunately it
	has been misinterpreted more often than not.

	For example, imagine you're working with a 3rd party API and,
	according to the documentation, it is supposed to return responses in
	a certain format. What if suddenly that API stops working? Or what if
	the format changes?

	You could write code to handle that "what if" scenario, but then
	trying to handle every hypothetical your code becomes bloated, more
	complicated, and hard to understand.

	So in these cases, you accept that the system will crash. But to
	ensure reliability, you don't want to bring down the whole system. So
	there are primitives that let you control the blast radius of the
	crash if something unexpected happens.

	Let it crash does not mean you skip validating user input. Those are
	issues that you expect to happen. You handle those just as you would
	in any programming language.

	zmgsabst wrote 1 day ago:
	I think itâs more subtle:

	Imagine that youâre trying to access an API, which for some reason
	fails.

	âLet it crashâ isnât an argument against handling the timeout,
	but rather that you should only retry a few, bounded times rather
	than (eg) exponentially back off indefinitely.

	When you design from that perspective, you just fail your request
	processing (returning the request to the queue) and make that your
	managerâs problem. Your managing process can then restart you,
	reassign the work to healthy workers, etc. If your manager canât
	get things working and the queue overflows, it throws it into dead
	letters and crashes. That might restart the server, it might page
	oncall, etc.

	The core idea is that within your business logic is the wrong place
	to handle system health â and that many problems can be solved by
	routing around problems (ie, give task to a healthy worker) or
	restarting a process. A process should crash when it isnât scoped
	to handle the problem itâs facing (eg, server OOM, critical
	dependency offline, bad permissions). Crashing escalates the problem
	until somebody can resolve it.

	johnisgood wrote 1 day ago:
	As someone has linked it: [1] It is about self-healing, too.

	[1]: https://erlang.org/pipermail/erlang-questions/2003-March/007...

	PicassoCTs wrote 1 day ago:
	[1] Railway orientated programming to the rescue?

	[1]: https://fsharpforfunandprofit.com/rop/

	adregan wrote 51 min ago:
	There are a couple of patterns for accomplishing this in Elixir.

	One is to build multiple function heads that pattern match on the
	arguments. If itâs an error tuple, pass it along. Build up your
	pipeline and handle any errors at the end.

	Another is to use the `with else`[0] expression for building up a
	railroad. This has the benefit of not having to teach your functions
	how to pass along errors. Error handling in the else block can be a
	little gnarly.

	I find it a little more manual than languages that have a `runEffect`
	or compose operator. In large part thatâs due to the :ok, :error
	tuples being more of a convention than a primitive like
	Either/Result.

	0:

	[1]: https://elixirschool.com/en/lessons/basics/control_structure...

	hesus_ruiz wrote 1 day ago:
	It is very strange that a post trying to explain the concept of "let it
	crash" in Elixir (which runs on the BEAM VM) does not mention the
	doctoral thesis of Joe Armstrong: "Making reliable distributed systems
	in the presence of software errors".

	It must be compulsory lecture for anybody interested in reliable
	systems, even if they do not use the BEAM VM.

	[1]: https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A10420...

	plainOldText wrote 21 hours 11 min ago:
	Some core ideas from the paper for the inpatient (failures,
	isolation, healing):

	- Failures are inevitabe, so systems must be designed to EXPECT and
	recover from them, NOT AVOID them completely.

	- Let it crash philosophy allows components to FAIL and RECOVER
	quickly using supervision trees.

	- Processes should be ISOLATED and communicate via MESSAGE PASSING,
	which prevents cascading failures.

	- Supervision trees monitor other processes and RESTART them when
	they fail, creating a self-healing architecture.

	anthk wrote 1 day ago:
	Unix/BSD -> Crash, fix, restart.

	GNU/MIT/Lisp -> Detect, offer a fix, continue.

	atoav wrote 1 day ago:
	The truth is that different errors have to lead to different results if
	you want a good organisational outcome. These could be:

	- Fundamental/Fatal error: something without the process cannot
	function, e.g. we are missing an essential config option. Exiting with
	an error is totally adequate. You can't just heal from that as it would
	involve guessing information you don't have. Admins need to fix it

	- Critical error: something that should not ever occur, e.g. having an
	active user without password and email. You don't exit, you skip it if
	thst is possible and ensure the first occurance is logged and admins
	are contacted

	- Expected/Regular error: something that is expected to happen during
	the normal operations of the service, e.g. the other server you make
	requests to is being restarted and thus unreachable. Here the strategy
	may vary, but it could be something like retrying with random
	exponential backoff. Or you could briefly accept the values provided by
	that server are unknown and periodically retry to fill the unknown
	values. Or you could escalate that into a critical error after a
	certain amount of retries.

	- Warnings: These are usually about something being not exactly ideal,
	but do not impede with the flow of the program at all. Usually has to
	do with bad data quality

	If you can proceed without degrading the integrity of the system you
	should, the next thing is to decide jow important it is for humans to
	hear about it.

	praptak wrote 1 day ago:
	A condition that "should not happen" might still be a problem specific
	to a particular request. If you "just crash" it turns this request from
	one that only triggers a http 500 response to one that crashes the
	process. This increases the risk of Query of Death scenarios where the
	frontend that needs to serve this particular request starts retrying it
	with different backends and triggers restarts faster than the processes
	come back up.

	So being too eager to "just crash" may turn a scenario where you fail
	to serve 1% of requests into a scenario where you serve none because
	all your processes keep restarting.

	sarchertech wrote 1 day ago:
	> If you "just crash" it turns this request from one that only
	triggers a http 500 response to one that crashes the process.

	In phoenix each request has its own process and crashing that process
	will result in a 500 being sent to the client.

	davidclark wrote 1 day ago:
	You should try to do some load testing of a real Erlang system and
	compare how it handles this scenario against other
	languages/frameworks. What you are describing is one of the exact
	things the Erlang system is strong against due to the scheduler.

	rtpg wrote 1 day ago:
	My impression is that in Erlang land each process handler is really
	cheap so you can just keep on showing up with process handlers and
	not reach exhaustion like you do with other systems (at least in
	pre-async worlds...)

	josevalim wrote 1 day ago:
	Processes can be marked as temporary, which means they are not
	restarted, and thatâs what is used when managing http connections,
	as you canât really restart a request on the server without the
	client. So the scenario above wouldnât happen.

	You still want those processes to crash though, as it allows it to
	automatically clean up any concurrent work. For example, if during a
	request you start three processes to do concurrent work, like
	fetching APIs, then the request process crashes, the concurrent
	processes are automatically cleaned up.

	cyberax wrote 1 day ago:
	"Let it crash" in Erlang/Elixir means that the process that serves
	the request is allowed to crash. It then will be restarted by the
	supervisor.

	Supervisors themselves form a tree, so for a crash to take down the
	whole app, it needs to propagate all the way to the top.

	Another explanation for people familiar with exceptions in other
	languages: "Don't try to catch the exception inside a request
	handler".

	zwnow wrote 1 day ago:
	This is funny given Elixir/Erlangs whole idea is "let it crash". In
	Go I just have a Recovery Middleware for any type of problem. Don't
	know how other langs do it tho

	davidclark wrote 1 day ago:
	I donât know Go, but that sounds like someone has simply written
	part of Erlang in Go.

	knome wrote 1 day ago:
	erlang doesn't crash the program, it crashes the thread. erlang has
	a layered management system built in as part of OTP (open telecom
	platform, erlang was built for running highly concurrent telephony
	hardware). when a thread crashes, it dies and signals its parent.
	the parent then decides what to do. usually, that's just restarting
	the worker. maybe if ten workers have crashed in a minute, the
	manager itself will die and restart. issues bubble up, and managers
	restart subsystems automatically. for some things, like parsing
	user data, you might never cause the manager to die, and just
	always restart the worker.

	the article, if you should choose to read it, is explaining that
	people have the misconception you appear to be having due to the
	'let it fail' catchphrase. it goes into detail about this system,
	when failing is appropriate, and when trying to work around errors
	is appropriate.

	as erlang uses greenthreads, restarting a thread for a user API is
	effectively instant and free.

	zwnow wrote 1 day ago:
	It's not a misconception given that Elixir Forum and its Discords
	members will say that to you. Also I never assumed the whole
	program crashed so why would you explain this to me?
	Why would one Blog guy know it better than a lot of other Elixir
	devs?

	sarchertech wrote 1 day ago:
	Itâs well known among elixir devs that for reasons unkown,
	Elixir Forum is populated predominantly by people who donât
	know what theyâre talking about.

	borromakot wrote 1 day ago:
	Blog guy here: I do, in fact, know it better than a lot of
	other Elixir devs.

	snickerbockers wrote 1 day ago:
	>When people say âlet it crashâ, they are referring to the fact
	that practically any exited process in your application will be
	subsequently restarted. Because of this, you can often be much less
	defensive around unexpected errors. You will see far fewer try/rescue,
	or matching on error states in Elixir code.

	I just threw up in my mouth when I read this. I've never used this
	language so maybe my experience doesn't apply here but I'm imagining
	all the different security implications that ive seen arise from
	failing to check error codes.

	vendiddy wrote 1 day ago:
	If get a chance to read some Elixir/Erlang code you'll see that
	pattern matching is used frequently to assert expected error codes.
	It does not mean ignore errors.

	This is a common misunderstanding because unfortunately the slogan is
	frequently misinterpreted.

	toast0 wrote 1 day ago:
	Ok, so it's not really that you're not checking error codes. It's
	that you can write stuff like

	ok = whatever().

	If whatever is successful and idomatic, it returns ok, or maybe a
	tuple of {ok, SomeReturn}. In that case, execution would continue. If
	it returns an error tuple like {error, Reason}... "Let it crash" says
	you can just let it crash... You didn't have anything better to do,
	the built in crash because {error, Reason} will do fine.

	Or you could do a

	case whatever of
	ok -> ok;
	{error, nxdomain} -> ok
	end.

	If it was fine to get nxdomain error, but any other error isn't
	acceptable... It will just crash, and that's good or at least ok.
	Better than having to enumerate all the possible errors, or having a
	catchall that then explicitly throws an eeror. It's especially hard
	to enumerate all possible errors because the running system can
	change and may return a new error that wasn't enumerated when the
	requesting code was written.

	There's lots of places where crashing isn't actually what you want,
	and you have to capture all errors, explicitly log it, and then move
	on... But when you can, checking for success or success and a handful
	of expected and recoverable errors is very nice.

	josevalim wrote 1 day ago:
	Thatâs actually a good example. Imagine someone forgot to check the
	error code from an API response. In some languages, they may attempt
	to parse it as if it was successful request, and succeed, leading to
	a result with nulls, empty arrays, or missing data that then spreads
	through the system. In Elixir, parsing would most likely fail thanks
	to pattern matching [1] and if it by any chance that fails in a core
	part of the system, that failure will be isolated and that particular
	component can be restarted.

	Elixir is not about willingly ignoring error codes or failure
	scenarios. It is about naturally limiting the blast radius of errors
	without a need to program defensively (as in writing code for
	scenarios you donât know âjust in caseâ).

	1:

	[1]: https://dashbit.co/blog/writing-assertive-code-with-elixir

	monkeyelite wrote 1 day ago:
	This seems specific to BEAM as crashing a fast-cgi process is fine and
	response will be handled correctly with Apache or nginx.

	valenterry wrote 1 day ago:
	There are a few stages, and each improves on the previous ones:

	1. Detect crashes at runtime and by default stop/crash to prevent
	continuing with invalid program state

	2. Detect crashes at runtime and handle them according to the business
	context (e.g. crash or retry or fallback-to or ...) to prevent bad UX
	through crashes.

	3. Detect potential crashes at compile-time to prevent the dev from
	forgetting to handle them according to the business context

	4. Don't just detect the possibility of crashes but also the specific
	type and context to prevent the dev from making a logical mistake and
	causing a potential runtime error during error handling according to
	the business context

	An example for stage 4 would be that the compiler checks that a
	fall-back option will actually always resolve the errors and not
	potentially introduce a new error / error type. Such as falling back to
	another URL does not actually always resolve the problem, there still
	needs to be handling for when the request to the alternative URL fails.

	The philosophy described in the article is basically just stage 1 and a
	(partial) default restart instead of a default crash, which is maybe a
	slight improvement but not really sufficient, at least not by my
	personal standards.

	creatonez wrote 1 day ago:
	Based on your list there is an opportunity to define stage -1 of
	error handling sanity, the Eval-Rinse-Reload loop, as implemented by
	FuckItJS, the original Javascript Error Steamroller: [1] > Through a
	process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly
	compiles your code, detecting errors and slicing those lines out of
	the script. To survive such a violent process, FuckItJS reloads
	itself after each iteration, allowing the onerror handler to catch
	every single error in your terribly written code.

	> [...]

	> This will keep evaluating your code until all errors have been
	sliced off like mold on a piece of perfectly good bread. Whether or
	not the remaining code is even worth executing, we don't know. We
	also don't particularly care.

	[1]: https://github.com/mattdiamond/fuckitjs

	valenterry wrote 1 day ago:
	Oh, thank you for the nostalgic reminder of that one. I read that a
	decade ago and found it hilarious.

	BobbyTables2 wrote 1 day ago:
	Hackers also love auto-restarting services.

	Exploitation of vulnerabilities isnât always 100% reliable. Heap
	grooming might be limited or otherwise inadequate.

	A quick automatic restart keeps them in business without any other
	human interaction involved.

	teiferer wrote 1 day ago:
	Took me a minute to realize what you meant with "hackers". Quite the
	irony, given the name of the site we are having this conversation on.

	HexDecOctBin wrote 1 day ago:
	How does restarting the process fix the crash? If the process crashed
	because a file was missing, it will still be missing when the process
	is restarted. Is an infinite crash-loop considered success in Erlang?

	jlouis wrote 1 day ago:
	It's not going to be missing the next time around. Usually the file
	is missing due to some concurrency-problem where the file only gets
	to exist a little later. A process restart certainly fixes this.

	If the problem persists, a larger part of the supervision tree is
	restarted. This eventually leads to a crash of the full application,
	if nothing can proceed without this application existing in the
	Erlang release.

	The key point is that there's a very large class of errors which is
	due to the concurrent interaction of different parts of the system.
	These problems often go away on the next try, because the risk of
	them occurring is low.

	conradfr wrote 1 day ago:
	If the rest of the program is still running while you fix it, yes?

	Also, restarting endlessly is just one strategy between multiple
	others.

	victorbjorklund wrote 1 day ago:
	Elixir dev: It does not solve all issues. But sometimes you have some
	kind of rare bug that just happens once X,Z and Y happens in a
	specific order. If it is restarted it might not happen that way
	again. Or it might be a temporary problem. You are reaching for an
	API and it temporarily has issues. It might not have it anymore in 50
	ms.

	But of course if it crashes because you are reading a file that does
	not exist it doesnt solve the issue (but it avoids crashing the whole
	system).

	victorbjorklund wrote 1 day ago:
	Note that let is crash doesnt mean we shouldnt fix bugs. It is more
	about if there is a bug we havent fixed it is better to make the
	crash just crash a tiny part of the program than the whole program

	worthless-trash wrote 10 hours 17 min ago:
	Or more importantly, you can't design robust recovery and retry
	systems.

	masklinn wrote 1 day ago:
	> Is an infinite crash-loop considered success in Erlang?

	Of course not, but usually that's not what happens, instead a process
	crashes because some condition was not considered, the corresponding
	request is aborted, and a supervisor restarts the process (or doesn't
	because the acceptor spawns a process per request / client).

	Or a long-running worker got into an incorrect state and crashed, and
	a supervisor will restart it in a known good state (that's a pretty
	common thing to do in hardware, BEAM makes that idiomatic in
	software).

	gopher_space wrote 1 day ago:
	Both of your examples look like infinite crash-loops if your work
	needs to be correct more than it needs to be available. E.g. there
	aren't any known good states prior to an unexpected crash, you're
	just throwing a hail mary because the alternatives are impractical.

	dmsnell wrote 1 day ago:
	When a process crashes, its supervisor restarts it according to
	some policy. These specify whether to restart the sibling process
	in their startup order or to only restart the crashed process.

	But a supervisor also sets limits, like â10 restarts in a
	timespan of 1 second.â Once the limits are reached, the
	supervisor crashes. Supervisors have supervisors.

	In this scenario the fault cascades upward through the system,
	triggering more broad restarts and state-reinitializations until
	the top-level supervisor crashes and takes the entire system down
	with it.

	An example might bee losing a connection to the database. Itâs
	not an expected fault to fail while querying it, so you let it
	crash. That kills the web request, but then the web server ends
	up crashing too because too many requests failed, then a task
	runner fails for similar reasons. The logger is still reporting
	all this because itâs a separate process tree, and the
	top-level app supervisor ends up restarting the entire thing. It
	shuts everything off, tries to restart the database connection,
	and if that works everything will continue, but if not, the
	system crashes completely.

	Expected faults are not part of âlet it crash.â E.g. if a
	user supplies a bad file path or network resource. The
	distinction is subjective and based around the expectations of
	the given app. Failure to read some asset included in the
	distribution is both unlikely and unrecoverable, so âlet it
	crashâ allows the code to be simpler in the happy path without
	giving up fault handling or burying errors deeper into the app or
	data.

	Muromec wrote 1 day ago:
	If it has no good states you probably know it before deploying to
	production.

	masklinn wrote 1 day ago:
	> there aren't any known good states prior to an unexpected crash

	If there aren't any good states then the program straight up
	doesn't work in the first place, which gets diagnosed pretty
	quickly before it hits the field.

	> your work needs to be correct more than it needs to be
	available.

	"correctness over availability" tends to not be a thing, if you
	assume you can reach perfect and full correctness then either you
	never release or reality quickly proves you wrong in the field.
	So maximally resilient and safe systems generally plan for errors
	happening and how to recover from them instead of assuming they
	don't. There are very few fully proven non-trivial programs, and
	there were even less 40 years ago.

	And Erlang / BEAM was designed in a telecom context, so
	availability is the prime directive. Which is also why
	distribution is built-in: if you have a single machine and it
	crashes you have nothing.

	corysama wrote 1 day ago:
	Iâm only an armchair expert on Erlang. But, having looked into it
	repeatedly for a couple decades, my take-away is the âLet it
	crashâ slogan is good. But, also presented a bit out of context.
	Or, at least assuming context that most people donât have.

	Erlang is used in situations involving a zillion incoming requests.
	If an individual request failsâ¦ Maybe it was important. Maybe it
	wasnât. If it was important, itâs expected theyâll try again.
	Whatâs most important is that the rest of the requests are not
	interrupted.

	What makes Erlang different is that it is natural and trivial to be
	able to shut down an individual request on the event of an error
	without worrying about putting any other part of the system into a
	bad state.

	You can pull this off in other languages via careful attention to the
	details of your request-handling code. But, the creators of the
	Erlang language and foundational frameworks have set their users up
	for success via careful attention to the design of the system as a
	whole.

	Thatâs great in the contexts in which Erlang is used. But, in the
	context of a Java desktop app like Open Office, itâs more like
	saying âLet it throwâ. âItâ being some user action. And,…
	slogan being to have a language and framework with such robust
	exception handling built-in that error handling becomes trivial and
	nearly invisible.

	Quekid5 wrote 1 day ago:
	> You can pull this off in other languages via careful attention to
	the details of your request-handling code. But, the creators of the
	Erlang language and foundational frameworks have set their users up
	for success via careful attention to the design of the system as a
	whole.

	+10. So many people miss this very important point. If you have
	lots of mutable shared state, or can accidentally leak such into
	your actor code then the whole actor/supervision tree thing falls
	over very easily... because you can't just restart any actor
	without worrying about the rest of the system.

	I think this is a large (but not the only[0]) part of why
	actors/supervisors haven't really caught on anywhere outside of
	Erlang, even for problem spaces where they would be suitable.

	[0] I personally feel the model is very hard to reason about
	compared to threaded/blocking straight-line code using e.g.
	structured concurrency, but that may just be a me thing.

	asa400 wrote 22 hours 21 min ago:
	I have worked Elixir/Erlang and Rust a lot, and I agree. Rust in
	particular gives ownership semantics to threaded/blocking/locking
	code, which I often times find _much_ easier to understand than a
	series of messages sent between tasks/processes in Elixir/Erlang.

	However, in a world where you have to do concurrent
	blocking/locking code without the help of rigorous
	compiler-enforced ownership semantics, Elixir/Erlang is like
	water in the desert.

	jfengel wrote 22 hours 52 min ago:
	The alternative to straight-line code used to be called
	"spaghetti code".

	There was a joke article parodying "GOTO considered harmful" by
	suggesting a "COME FROM" command. But in a lot of always, that's
	exactly what many modern frameworks and languages aim for.

	Quekid5 wrote 22 hours 0 min ago:
	Haha... be the change! Program in INTERCAL! :)

	nine_k wrote 1 day ago:
	Let it crash, so that if something goes wrong, it does not do so
	silently.

	Let it crash, because a relevant manager will detect it, report it,
	clean it up, and restart it, without you having to write a line of
	code for that.

	Let it crash as soon as possible, so that any problem (like a crash
	loop) is readily visible. It's very easy to replace arbitrary bits
	of Erlang code in a running system, without affecting the rest of
	it. "Fix it in prod" is better than "miss it in prod", especially
	when you cannot stop the prod ever.

	0x445442 wrote 1 day ago:
	Are individual agents deployable on their own or does the entire
	"app" of agents need to be deployed as a single group? If
	individually deployable, what does this look like from a version
	control and a CI/CD perspective?

	nine_k wrote 23 hours 38 min ago:
	To the best of my knowledge: yes, individual parts are
	deployable separately, within reason. No, there explicitly no
	need to deploy the whole thing at once, and especially to shut
	it down all at once.

	Erlang works by message passing and duck typing, so, as long as
	your interfaces are compatible (backwards or forwards), you can
	alter the implementation, and evolve the interfaces. Think
	microservices, but when every function can be a microservice,
	at an absolutely trivial cost.

	ramchip wrote 1 day ago:
	I recommend [1] starting from "if my configuration file is corrupted,
	restarting won't fix anything". The tl;dr is it helps with transient
	bugs.

	[1]: https://ferd.ca/the-zen-of-erlang.html

	bccdee wrote 23 hours 29 min ago:
	> if you feel that your well-understood regular failure case is
	viable, then all your error handling can fall-through to that case.

	This is my favourite line, because it generalizes the underlying
	principle beyond the specific BEAM/OTP model in a way that carries
	over well to the more common sort of database-backed services that
	people tend to write.

	kimi wrote 1 day ago:
	...and does no harm for unfixable bugs. It's the logical equivalent
	of "switch off and on again" that as we know fixes most issues by
	itself, but happening only on a part of your software deployment,
	so most of it will keep running.

	lawn wrote 1 day ago:
	Typically you then let the error bubble up in the supervisor tree if
	restarting multiple times doesn't fix it.

	Of course there are still errors that can't be recovered from, in
	which case the whole program may finally crash.

	dns_snek wrote 1 day ago:
	> in which case the whole program may finally crash.

	This may happen if you let it, but it's basically never the desired
	outcome. If you were handling a user request, it should stop by
	returning a HTTP 500 to the client, or if you were processing a
	background job of some sort, it should stop with a watchdog process
	marking the job as a failure, not with the entire system crashing.

	Muromec wrote 1 day ago:
	returning HTTP 500 as early as possible is an example of "let it
	crash" approach outside of Erlang.

	dns_snek wrote 17 hours 52 min ago:
	That's not what "let it crash" is about. Letting something
	crash in Erlang means that a process (actor) is allowed to
	crash, but then it gets restarted to try again, which would
	resolve the situation in case of transient errors.

	The equivalent of "let it crash" outside of Erlang is a
	mountain of try-catch statements and hand-rolled retry wrappers
	with time delays, with none of the observability and tooling
	that you get in Erlang.

	adastra22 wrote 1 day ago:
	âReset on errorâ might be a better phrasing.

	refactor_master wrote 1 day ago:
	Question as a complete outsider: If I run idempotent Python
	applications in Kubernetes containers and they crash, Kubernetes will
	eventually restart them. Of course, knowing what to do on IO errors is
	nicer than destroying and restarting everything with a really bigger
	hammer (as the article also mentions, you can serve a better error
	message for whoever has to âdealâ with the problem), but eventually
	they should end up in the same workable state.

	Is this conceptually similar, but perhaps at code-level instead?

	valenterry wrote 1 day ago:
	In general, if you can move any kind of logic to a lower level,
	that's better.

	For example, testing that kubernetes restarts work correctly is
	tricky and requires a complicated setup. Testing that an erlang
	process/actor behaves as expected is basically a unit test.

	emoII wrote 1 day ago:
	I bet the kubernetes project has test for that, why should I as an
	application developer care about testing something other than my
	own code?

	bccdee wrote 18 hours 18 min ago:
	That's assuming your code is well-configured. How do you test
	your k8s configs?

	valenterry wrote 1 day ago:
	Oh of course, I'm sure the kubernetes project tests that they
	trigger restarts correctly etc.

	But that doesn't cover the behavior of your app, the specific
	configuration you ask kubernetes to use and how the app uses its
	health endpoints etc. - this is all purely about your own
	code/config, the kubernetes team can't test that.

	adastra22 wrote 1 day ago:
	Conceptually similar, different implementation. The perhaps most
	visible difference is that supervisors arenât polling application
	state but are rather notified about errors (crashes), and restarting
	is extremely low latency. Erlang/BEAM was invented for telephony,
	and it is possible for this to happen on the middle of a protocol and
	the user not even notice.

	goosejuice wrote 1 day ago:
	Somewhat, yes but it's much less powerful. In the BEAM these are
	trees of supervisors and monitors/links that choose how to restart
	and receive the stacktrace/error reason of the failure respectively.
	This gives a lot of freedom on how to handle the failure. In k8s,
	it's often just a dumb monitor/controller that knows little about how
	to remediate the issue on boot. Nevermind the boot time penalty. [1]
	BEAM apps run great on k8s.

	[1]: https://hexdocs.pm/elixir/1.18.4/Supervisor.html

	goosejuice wrote 1 day ago:
	[1] The origin, as far as I know it. I think it still holds, is
	insightful, as a general case. Let it heal seems pretty close to what
	Joe was getting at.

	[1]: https://erlang.org/pipermail/erlang-questions/2003-March/00787...

	vrnvu wrote 1 day ago:
	>>This organization corresponds nicely to a idealized
	human
	organization of bosses and workers - bosses say what is to be
	done,
	workers do stuff. Bosses do quality control and check that things
	get
	done, if not they fire people re-organize and tell other people to
	do
	the stuff. If they fail (the bosses) they get sacked etc. <>

	We miss you Joe :)

	tombert wrote 1 day ago:
	He was one of my favorite humans; the few emails I exchanged with
	him were funny and insightful.

	bitwize wrote 1 day ago:
	I don't code in Erlang or Elixir, aside from messing about. But I've
	found that letting an entire application crash is something that I can
	do under certain circumstances, especially when "you have a very big
	problem and will not go to space today". For example, if there's an
	error reading some piece of data that's in the application bundle and
	is needed to legitimately start up in the first place (assets for my
	game for instance). Then upon error it just "screams and dies" (spits
	out a stack trace and terminates).

	borromakot wrote 1 day ago:
	Errors during initialization of a BEAM language application will
	crash the entire program, and you can decide to exit/crash a program
	if you get into some unrecoverable state. The important thing is the
	design of individual crashable/recoverable units.

	bgdkbtv wrote 1 day ago:
	This is great, thanks for sharing! I've been thinking about improving
	error handling in my liveview app and this might be a nice way to
	start.


	<- back to front page