GopherProxy

	xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence - xmlparser…
	git clone git://git.codemadness.org/xmlparser
	Log
	Files
	Refs
	README
	LICENSE
	---
	commit 2e33c882b88eebdaefb0477658a9cbb79d57e2b1
	parent 6d001c968814d93492e5925f63ede6aa94c12552
	Author: Hiltjo Posthuma <[email protected]>
	Date: Fri, 22 Jan 2021 13:37:47 +0100

	xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence

	In sfeed a simple way to reproduce:

	printf '<item><title>&#xdc00;</title></item>' \| sfeed \| iconv -t utf-8

	Result:
	iconv: (stdin):1:8: cannot convert

	Output result:

	printf '<item><title>&#xdc00;</title></item>' \| sfeed

	Before:

	00000000 09 ed b0 80 09 09 09 09 09 09 09 0a \|............\|
	0000000c

	After:

	00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 \|.&#xdc00;.......\|
	00000010 0a \|.\|
	00000011

	The entity is output as a literal string. This allows to see more easily whats
	wrong and debug the feed and it is consistent with the current behaviour of
	invalid named entities (&bla;). An alternative could be a UTF-8 replacement
	symbol (codepoint 0xfffd).

	Reference: https://unicode.org/faq/utf_bom.html , specificly:

	"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
	"A: A different issue arises if an unpaired surrogate is encountered when
	converting ill-formed UTF-16 data. By representing such an unpaired surrogate
	on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
	ill-formed. While it faithfully reflects the nature of the input, Unicode
	conformance requires that encoding form conversion always results in a valid
	data stream. Therefore a converter must treat this as an error. [AF]"

	Diffstat:
	M xml.c \| 3 ++-

	1 file changed, 2 insertions(+), 1 deletion(-)
	---
	diff --git a/xml.c b/xml.c
	@@ -287,7 +287,8 @@ numericentitytostr(const char e, char buf, size_t bufsiz)
	else
	l = strtol(e, &end, 10);
	/* invalid value or not a well-formed entity or invalid code point */
	- if (errno \|\| e == end \|\| *end != ';' \|\| l < 0 \|\| l > 0x10ffff)
	+ if (errno \|\| e == end \|\| *end != ';' \|\| l < 0 \|\| l > 0x10ffff \|\|
	+ (l >= 0xd800 && l <= 0xdffff))
	return -1;
	len = codepointtoutf8(l, buf);
	buf[len] = '\0';