//tvix/eval: encoding behaviour?
This issue is to note a major (but maybe not very significant) difference between Tvix and C++ Nix at the moment, so we can discuss it and establish some sort of understanding on how to deal with it.
-
Currently we use Rust's
String
(or equivalent) to represent Nix strings which (philosophically) is a UTF-32 string, requiring the original on disk representation to have some kind of Unicode encoding. I think this choice is partially enforced upon us, sincernix-parser
seems to make the same assumption that all its input is Unicode-encoded [citation needed]. In C++ Nix, however, strings are C strings, i.e. they are byte sequences that forbid the use of theNUL
byte. The discrepancy is twofold: On the one hand, Tvix will not accept valid Nix programs (e.g. https://sterni.lv/tmp/ord-data.nix), on the other hand behaviour will differ for programs accepted by both programs, e.g. indexing into (many) strings behaves differently depending on whether you treat them as Unicode codepoint sequences or byte sequences. -
Paths are another topic we need to be mindful of. We are set up to handle this well with Rust's
PathBuf
andOsString
abstraction, but we currently require literals to be UTF-8 and there are many occasions where a path becomes a string and vice versa (toString
on paths, the attribute set keys resulting frombuiltins.readDir
, …). POSIX paths are arbitrary byte sequences (without/
andNUL
bytes IIRC) and we should probably also keep the Windows case in mind, since someone will surely want to port Tvix in the long term.
Maybe interesting: https://blog.burntsushi.net/bstr/
sterni at 2022-09-14T14·18+00
Add
tvix-repl> builtins.substring 0 1 "👩🏽❤️💋👨🏽" thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside '👩' (bytes 0..4) of `👩🏽❤️💋👨🏽`', library/core/src/str/mod.rs:127:5 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
to the list of problems.
sterni at 2022-09-21T19·42+00
For the last one, we can probably use something like https://docs.rs/substring/1.4.5/substring/index.html for reasonable behaviour.
However, doing that in Nix yields ... nothing? It's a bit unclear.
tazjin at 2022-09-23T00·28+00
It yields a string of length 1 containing only the first byte of that emoji in UTF-8 encoding (depending on the locale of course). It is probably non-printable.
sterni at 2023-05-30T21·58+00
https://b.tvl.fyi/issues/337 - this is load-bearing for evaluating eg
nixpkgs.hello
it seemsaspen at 2023-12-05T22·07+00
cl/10200 has started work on converting
NixString
to use byte vectorsaspen at 2023-12-05T23·03+00
- aspen closed this issue at 2024-01-31T14·52+00