panettone: some unicode codepoints are converted to gibberish via the HTTP API

#145
Opened by sterni at 2021-09-08T15·30+00

Recreation of what panettone requests from cheddar:

$ curl -H 'Content-Type: application/json' -d "{ \"markdown\": \"\\u1F570\\uFE0F\" }" localhost:4238/markdown
{"markdown":"<p>ὗ0️</p>\n"}

Not sure if this may be a general JSON problem as jq exhibits the same behavior:

$ echo "\"\\u1F570\\uFE0F\"" | jq
"ὗ0️"

The code point is at least correct.

  1. JSON RFC (https://www.ietf.org/rfc/rfc4627.txt), page 3:

    Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

    [...]

    To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

    So, in this case, the unicode escape can't contain 5 hex-characters, it instead must be encoded as UTF-16 into 2 unicode escapes.

    zseri at 2021-12-27T21·39+00

  2. that means that the "\u1F570\uFE0F" string is interpreted as U+1F57 U+0030 U+FE0F , which is correct according to the JSON RFC.

    zseri at 2021-12-29T19·49+00

  3. zseri closed this issue at 2021-12-29T19·49+00
  4. sterni reopened this issue at 2021-12-29T19·56+00
  5. Maybe not an issue in cheddar, but a problem in panettone at the very least.

    sterni at 2021-12-29T19·57+00

  6. ok, can you adjust the issue title?

    zseri at 2021-12-29T19·57+00

  7. sterni changed the subject of this issue from "cheddar: some unicode codepoints are converted to gibberish via the HTTP API" to "panettone: some unicode codepoints are converted to gibberish via the HTTP API" at 2021-12-29T20·55+00
  8. sterni updated the body of this issue at 2021-12-29T20·55+00
  9. The fix is adding an utf-16 encoder to cl-json. Started to work on this, hopefully upstream is responsive as well and like… merges stuff.

    sterni at 2022-05-29T11·05+00

  10. https://github.com/sharplispers/cl-json/pull/12

    sterni at 2022-06-18T13·17+00