[HN Gopher] Slightly better named character reference tokenizati... | |
___________________________________________________________________ | |
Slightly better named character reference tokenization than Chrome, | |
Safari, FF | |
Author : todsacerdoti | |
Score : 54 points | |
Date : 2025-06-27 00:27 UTC (1 days ago) | |
web link (www.ryanliptak.com) | |
w3m dump (www.ryanliptak.com) | |
| deepdarkforest wrote: | |
| This might not get a lot of traction because it's very technical, | |
| but i wanted to say a massive well done for the effort. 20k words | |
| on anything this specific is not a joke. I wish i would put this | |
| level of commitment to anything in life, this was inspiring if | |
| nothing else. | |
| squeek502 wrote: | |
| Appreciate it (I'm the author). I'd like to think there's a | |
| good bit of interesting stuff in here outside of the specific | |
| topic of named character reference tokenization. | |
| chaps wrote: | |
| "no[t] a 'data structures' person" | |
| | |
| says the person who wrote an extremely technical 20k word | |
| blog post on data structures! <3 | |
| arthurcolle wrote: | |
| Congratulations on your newfound promotion to data structures | |
| person btw | |
| Ndymium wrote: | |
| Thanks to your article I just realised my HTML entity codec | |
| library doesn't support decoding those named entities that can | |
| omit the semicolon at the end. More work for me, good thing my | |
| summer vacation just started! :) | |
| masfuerte wrote: | |
| That was a good read. I reread the relevant section of the HTML5 | |
| spec and noticed an error in an example: | |
| | |
| > For example, ¬in will be parsed as "!in" whereas ¬in | |
| will be parsed as "[?]". | |
| | |
| Only a small minority of the named character references are | |
| permitted without a closing semicolon, and notin is not one of | |
| them. So ¬in is actually parsed as "!in". ∉ is parsed as | |
| "[?]". | |
| | |
| https://html.spec.whatwg.org/#parse-error-missing-semicolon-... | |
| squeek502 wrote: | |
| Good catch, that does indeed look like a mistake in the spec. | |
| Everything past the first sentence of that error description is | |
| suspect, honestly (seems like it was naively adapted from the | |
| example in [1] but that example isn't relevant to the missing- | |
| semicolon-after-character-reference error). | |
| | |
| Will submit an issue/PR to correct it when I get a chance. | |
| | |
| [1] https://html.spec.whatwg.org/multipage/parsing.html#named- | |
| ch... | |
| o11c wrote: | |
| Congratulations, you've reinvented regexes. This is still a win | |
| since you're using the sane kind of regex and are allowing | |
| multiple accept states rather than just one, in both cases unlike | |
| most modern implementations. | |
| | |
| (I'm mostly throwing my thoughts as they appear, some parts of | |
| this ends up duplicating what's in the article, hopefully with | |
| more standard terminology though) | |
| | |
| Note that at _runtime_ there is no difference between a standard | |
| DFA and what you can a DAFSA. The difference is entirely at | |
| construction time. | |
| | |
| In lexers, your `end_of_word` is usually called `accept`, and | |
| rather than being a `bool` it is an integer (0 for no-accept, N | |
| for the Nth valid accept value, which in your case should | |
| probably be an index within the array of all possible characters. | |
| Note that since multiple entity names map to the same character, | |
| you will have multiple nodes with the same `accept`). I think | |
| your perfect-hash approach requires duplicating them (which | |
| admittedly might be a win since you are far from the typical | |
| lexing case where there are many possible inputs for some | |
| outputs. However, this does mean you can't play games with the | |
| bits of accept` to encode the length of your lookup as well as | |
| the start - if we're saving size, I lean toward UTF-8, either | |
| nul-terminated or with an explicit length). | |
| | |
| The next thing you should do is use equivalence classes rather | |
| than dealing with every character individually. For this | |
| particular parsing problem, almost all of your equivalence | |
| classes will only have a single character, but you still win big | |
| by mapping all invalid characters to a single class. Since there | |
| are only 51 characters used in entity names, this means you only | |
| need 6 bits per character (which should be fast since you only | |
| need to special-case non-letters). And since many of those only | |
| appear for the first letter, you can probably deal with 5 or | |
| fewer with minimal logic ahead of time. | |
| | |
| That said - one important lesson from lexing is that it is almost | |
| always a mistake to lex keywords; whenever possible, just lex an | |
| identifier and then do a map lookup. The reason that _can 't_ be | |
| done is entirely because of those entities which do not require | |
| the semicolon, so I suspect that the optimal approach is going to | |
| be: after resolving `document.write`, look ahead for a semicolon, | |
| and if found use the fast path; only if that fails, enter the | |
| (much smaller) DFA for the few that do not require a semicolon. | |
| But since you don't _have_ identifiers you might not be hitting | |
| the worst case (explosive splitting) anyway. | |
| | |
| For something this small, binary search is probably a mistake | |
| (being very unpredictable for the CPU) if you're doing everything | |
| else right; you're better off doing a linear search if you can't | |
| just using SIMD magic to match them in parallel. Struct-of-arrays | |
| is probably pointless for a problem set that fits in L1, but | |
| might start winning again if you want to leave some L1 for other | |
| parts of the program. Storing siblings/cousins next to each other | |
| (as an accident of construction) means you're probably already as | |
| Eytzinger-like as you can be. | |
| | |
| (Edit: fix incomplete and missing thoughts) | |
___________________________________________________________________ | |
(page generated 2025-06-28 11:01 UTC) |