panettone: some unicode codepoints are converted to gibberish via the HTTP API

Opened by sterni at 2021-09-08T15·30+00

Recreation of what panettone requests from cheddar:

$ curl -H 'Content-Type: application/json' -d "{ \"markdown\": \"\\u1F570\\uFE0F\" }" localhost:4238/markdown
{"markdown":"<p>ὗ0️</p>\n"}

~~Not sure if this may be a general JSON problem as jq exhibits the same behavior:~~

$ echo "\"\\u1F570\\uFE0F\"" | jq
"ὗ0️"

~~The code point is at least correct.~~

JSON RFC (https://www.ietf.org/rfc/rfc4627.txt), page 3:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

[...]

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

So, in this case, the unicode escape can't contain 5 hex-characters, it instead must be encoded as UTF-16 into 2 unicode escapes.

zseri at 2021-12-27T21·39+00
that means that the "\u1F570\uFE0F" string is interpreted as U+1F57 U+0030 U+FE0F , which is correct according to the JSON RFC.

zseri at 2021-12-29T19·49+00
zseri closed this issue at 2021-12-29T19·49+00
sterni reopened this issue at 2021-12-29T19·56+00
Maybe not an issue in cheddar, but a problem in panettone at the very least.

sterni at 2021-12-29T19·57+00
ok, can you adjust the issue title?

zseri at 2021-12-29T19·57+00
sterni changed the subject of this issue from "cheddar: some unicode codepoints are converted to gibberish via the HTTP API" to "panettone: some unicode codepoints are converted to gibberish via the HTTP API" at 2021-12-29T20·55+00
sterni updated the body of this issue at 2021-12-29T20·55+00
The fix is adding an utf-16 encoder to cl-json. Started to work on this, hopefully upstream is responsive as well and like… merges stuff.

sterni at 2022-05-29T11·05+00
https://github.com/sharplispers/cl-json/pull/12

sterni at 2022-06-18T13·17+00
It seems like upstream is pretty unresponsive, but I think we're using your fixed version?

tazjin at 2023-06-19T22·43+00
sterni closed this issue at 2024-03-14T15·59+00
Yes 🍾

sterni at 2024-03-14T16·00+00