Introduction
Introduction Statistics Contact Development Disclaimer Help
xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence - xmlparser…
git clone git://git.codemadness.org/xmlparser
Log
Files
Refs
README
LICENSE
---
commit 2e33c882b88eebdaefb0477658a9cbb79d57e2b1
parent 6d001c968814d93492e5925f63ede6aa94c12552
Author: Hiltjo Posthuma <[email protected]>
Date: Fri, 22 Jan 2021 13:37:47 +0100
xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence
In sfeed a simple way to reproduce:
printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8
Result:
iconv: (stdin):1:8: cannot convert
Output result:
printf '<item><title>&#xdc00;</title></item>' | sfeed
Before:
00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............|
0000000c
After:
00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.&#xdc00;.......|
00000010 0a |.|
00000011
The entity is output as a literal string. This allows to see more easily whats
wrong and debug the feed and it is consistent with the current behaviour of
invalid named entities (&bla;). An alternative could be a UTF-8 replacement
symbol (codepoint 0xfffd).
Reference: https://unicode.org/faq/utf_bom.html , specificly:
"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
"A: A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By representing such an unpaired surrogate
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance requires that encoding form conversion always results in a valid
data stream. Therefore a converter must treat this as an error. [AF]"
Diffstat:
M xml.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
---
diff --git a/xml.c b/xml.c
@@ -287,7 +287,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz)
else
l = strtol(e, &end, 10);
/* invalid value or not a well-formed entity or invalid code point */
- if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff)
+ if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff ||
+ (l >= 0xd800 && l <= 0xdffff))
return -1;
len = codepointtoutf8(l, buf);
buf[len] = '\0';
You are viewing proxied material from codemadness.org. The copyright of proxied material belongs to its original authors. Any comments or complaints in relation to proxied material should be directed to the original authors of the content concerned. Please see the disclaimer for more details.