xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence - xmlparser… | |
git clone git://git.codemadness.org/xmlparser | |
Log | |
Files | |
Refs | |
README | |
LICENSE | |
--- | |
commit 2e33c882b88eebdaefb0477658a9cbb79d57e2b1 | |
parent 6d001c968814d93492e5925f63ede6aa94c12552 | |
Author: Hiltjo Posthuma <[email protected]> | |
Date: Fri, 22 Jan 2021 13:37:47 +0100 | |
xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence | |
In sfeed a simple way to reproduce: | |
printf '<item><title>�</title></item>' | sfeed | iconv -t utf-8 | |
Result: | |
iconv: (stdin):1:8: cannot convert | |
Output result: | |
printf '<item><title>�</title></item>' | sfeed | |
Before: | |
00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............| | |
0000000c | |
After: | |
00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.�.......| | |
00000010 0a |.| | |
00000011 | |
The entity is output as a literal string. This allows to see more easily whats | |
wrong and debug the feed and it is consistent with the current behaviour of | |
invalid named entities (&bla;). An alternative could be a UTF-8 replacement | |
symbol (codepoint 0xfffd). | |
Reference: https://unicode.org/faq/utf_bom.html , specificly: | |
"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? " | |
"A: A different issue arises if an unpaired surrogate is encountered when | |
converting ill-formed UTF-16 data. By representing such an unpaired surrogate | |
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become | |
ill-formed. While it faithfully reflects the nature of the input, Unicode | |
conformance requires that encoding form conversion always results in a valid | |
data stream. Therefore a converter must treat this as an error. [AF]" | |
Diffstat: | |
M xml.c | 3 ++- | |
1 file changed, 2 insertions(+), 1 deletion(-) | |
--- | |
diff --git a/xml.c b/xml.c | |
@@ -287,7 +287,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz) | |
else | |
l = strtol(e, &end, 10); | |
/* invalid value or not a well-formed entity or invalid code point */ | |
- if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff) | |
+ if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff || | |
+ (l >= 0xd800 && l <= 0xdffff)) | |
return -1; | |
len = codepointtoutf8(l, buf); | |
buf[len] = '\0'; |