panettone: some unicode codepoints are converted to gibberish via the HTTP API
Recreation of what panettone requests from cheddar:
$ curl -H 'Content-Type: application/json' -d "{ \"markdown\": \"\\u1F570\\uFE0F\" }" localhost:4238/markdown {"markdown":"<p>ὗ0️</p>\n"}
Not sure if this may be a general JSON problem as jq exhibits the same behavior:
$ echo "\"\\u1F570\\uFE0F\"" | jq "ὗ0️"
The code point is at least correct.
JSON RFC (https://www.ietf.org/rfc/rfc4627.txt), page 3:
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".
[...]
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
So, in this case, the unicode escape can't contain 5 hex-characters, it instead must be encoded as UTF-16 into 2 unicode escapes.
zseri at 2021-12-27T21·39+00
that means that the "\u1F570\uFE0F" string is interpreted as U+1F57 U+0030 U+FE0F , which is correct according to the JSON RFC.
zseri at 2021-12-29T19·49+00
- zseri closed this issue at 2021-12-29T19·49+00
- sterni reopened this issue at 2021-12-29T19·56+00
Maybe not an issue in cheddar, but a problem in panettone at the very least.
sterni at 2021-12-29T19·57+00
ok, can you adjust the issue title?
zseri at 2021-12-29T19·57+00
- sterni changed the subject of this issue from "cheddar: some unicode codepoints are converted to gibberish via the HTTP API" to "panettone: some unicode codepoints are converted to gibberish via the HTTP API" at 2021-12-29T20·55+00
- sterni updated the body of this issue at 2021-12-29T20·55+00
The fix is adding an utf-16 encoder to cl-json. Started to work on this, hopefully upstream is responsive as well and like… merges stuff.
sterni at 2022-05-29T11·05+00
https://github.com/sharplispers/cl-json/pull/12
sterni at 2022-06-18T13·17+00
It seems like upstream is pretty unresponsive, but I think we're using your fixed version?
tazjin at 2023-06-19T22·43+00
- sterni closed this issue at 2024-03-14T15·59+00
Yes 🍾
sterni at 2024-03-14T16·00+00