[HN Gopher] Slightly better named character reference tokenizati...

	[HN Gopher] Slightly better named character reference tokenizati...
	___________________________________________________________________

	Slightly better named character reference tokenization than Chrome,
	Safari, FF

	Author : todsacerdoti
	Score : 54 points
	Date : 2025-06-27 00:27 UTC (1 days ago)

	web link (www.ryanliptak.com)
	w3m dump (www.ryanliptak.com)

	\| deepdarkforest wrote:
	\| This might not get a lot of traction because it's very technical,
	\| but i wanted to say a massive well done for the effort. 20k words
	\| on anything this specific is not a joke. I wish i would put this
	\| level of commitment to anything in life, this was inspiring if
	\| nothing else.

	\| squeek502 wrote:
	\| Appreciate it (I'm the author). I'd like to think there's a
	\| good bit of interesting stuff in here outside of the specific
	\| topic of named character reference tokenization.

	\| chaps wrote:
	\| "no[t] a 'data structures' person"
	\|
	\| says the person who wrote an extremely technical 20k word
	\| blog post on data structures! <3

	\| arthurcolle wrote:
	\| Congratulations on your newfound promotion to data structures
	\| person btw

	\| Ndymium wrote:
	\| Thanks to your article I just realised my HTML entity codec
	\| library doesn't support decoding those named entities that can
	\| omit the semicolon at the end. More work for me, good thing my
	\| summer vacation just started! :)

	\| masfuerte wrote:
	\| That was a good read. I reread the relevant section of the HTML5
	\| spec and noticed an error in an example:
	\|
	\| > For example, ¬in will be parsed as "!in" whereas &notin
	\| will be parsed as "[?]".
	\|
	\| Only a small minority of the named character references are
	\| permitted without a closing semicolon, and notin is not one of
	\| them. So &notin is actually parsed as "!in". ∉ is parsed as
	\| "[?]".
	\|
	\| https://html.spec.whatwg.org/#parse-error-missing-semicolon-...

	\| squeek502 wrote:
	\| Good catch, that does indeed look like a mistake in the spec.
	\| Everything past the first sentence of that error description is
	\| suspect, honestly (seems like it was naively adapted from the
	\| example in [1] but that example isn't relevant to the missing-
	\| semicolon-after-character-reference error).
	\|
	\| Will submit an issue/PR to correct it when I get a chance.
	\|
	\| [1] https://html.spec.whatwg.org/multipage/parsing.html#named-
	\| ch...

	\| o11c wrote:
	\| Congratulations, you've reinvented regexes. This is still a win
	\| since you're using the sane kind of regex and are allowing
	\| multiple accept states rather than just one, in both cases unlike
	\| most modern implementations.
	\|
	\| (I'm mostly throwing my thoughts as they appear, some parts of
	\| this ends up duplicating what's in the article, hopefully with
	\| more standard terminology though)
	\|
	\| Note that at _runtime_ there is no difference between a standard
	\| DFA and what you can a DAFSA. The difference is entirely at
	\| construction time.
	\|
	\| In lexers, your `end_of_word` is usually called `accept`, and
	\| rather than being a `bool` it is an integer (0 for no-accept, N
	\| for the Nth valid accept value, which in your case should
	\| probably be an index within the array of all possible characters.
	\| Note that since multiple entity names map to the same character,
	\| you will have multiple nodes with the same `accept`). I think
	\| your perfect-hash approach requires duplicating them (which
	\| admittedly might be a win since you are far from the typical
	\| lexing case where there are many possible inputs for some
	\| outputs. However, this does mean you can't play games with the
	\| bits of accept` to encode the length of your lookup as well as
	\| the start - if we're saving size, I lean toward UTF-8, either
	\| nul-terminated or with an explicit length).
	\|
	\| The next thing you should do is use equivalence classes rather
	\| than dealing with every character individually. For this
	\| particular parsing problem, almost all of your equivalence
	\| classes will only have a single character, but you still win big
	\| by mapping all invalid characters to a single class. Since there
	\| are only 51 characters used in entity names, this means you only
	\| need 6 bits per character (which should be fast since you only
	\| need to special-case non-letters). And since many of those only
	\| appear for the first letter, you can probably deal with 5 or
	\| fewer with minimal logic ahead of time.
	\|
	\| That said - one important lesson from lexing is that it is almost
	\| always a mistake to lex keywords; whenever possible, just lex an
	\| identifier and then do a map lookup. The reason that _can 't_ be
	\| done is entirely because of those entities which do not require
	\| the semicolon, so I suspect that the optimal approach is going to
	\| be: after resolving `document.write`, look ahead for a semicolon,
	\| and if found use the fast path; only if that fails, enter the
	\| (much smaller) DFA for the few that do not require a semicolon.
	\| But since you don't _have_ identifiers you might not be hitting
	\| the worst case (explosive splitting) anyway.
	\|
	\| For something this small, binary search is probably a mistake
	\| (being very unpredictable for the CPU) if you're doing everything
	\| else right; you're better off doing a linear search if you can't
	\| just using SIMD magic to match them in parallel. Struct-of-arrays
	\| is probably pointless for a problem set that fits in L1, but
	\| might start winning again if you want to leave some L1 for other
	\| parts of the program. Storing siblings/cousins next to each other
	\| (as an accident of construction) means you're probably already as
	\| Eytzinger-like as you can be.
	\|
	\| (Edit: fix incomplete and missing thoughts)

	___________________________________________________________________
	(page generated 2025-06-28 11:01 UTC)