builtins.split loops endlessly
nix-repl> builtins.split "(.*)" "abc" [ "" [ ... ] "" [ ... ] "" ]
vs
tvix-repl> builtins.split "(.*)" "abc" (runs forever)
Also reproducible with
builtins.split "([abc]*)" "abc"
Causes b/368.
Most minimal repro:
builtins.split ".*" ""
I think the problem is that you can always match the empty string against a 0-or-more match, making the
while let Some(…
loop in our implementation spin.profpatsch at 2024-05-26T10·35+00
i think the problem might be in the regex crate. i rewrote it to use a different regex implementation and it no longer has this problem, with the output exactly matching that of nix... unless you use capture groups, because i accidentally wrote it to use vanilla posix regular expressions instead of extended posix regular expressions.
will have to rewrite again to use the pcre2 crate.
#[builtin("split")] async fn builtin_split(co: GenCo, regex: Value, str: Value) -> Result<Value, ErrorKind> { // use C-compatible regular expression implementation. // this doesn't support unicode, but neither does nix. use posix_regex::compile::PosixRegexBuilder; if str.is_catchable() { return Ok(str); } if regex.is_catchable() { return Ok(regex); } let s = str.to_contextful_str()?; let text = s.as_bytes(); let re = regex.to_str()?; let re = PosixRegexBuilder::new(re.as_bytes()).compile().unwrap(); //let mut capture_locations = re.capture_locations(); //let num_captures = capture_locations.len(); let mut ret = imbl::Vector::new(); let mut pos = 0; for capture_locations in re.matches(text, None) { let Some((match_start, match_end)) = capture_locations[0] else { todo!() }; let num_captures = capture_locations.len()-1; dbg!(pos, text, match_start, match_end); // push the unmatched characters preceding the match ret.push_back(Value::from(NixString::new_inherit_context_from( &s, &text[pos..match_start], ))); // Push a list with one element for each capture // group in the regex, containing the characters // matched by that capture group, or null if no match. // We skip capture 0; it represents the whole match. let v: imbl::Vector<Value> = (1..num_captures) .map(|i| capture_locations.get(i)) .map(|o| { o.map(Option::as_ref).flatten().map(|&(start, end)| { // Here, a surprising thing happens: we silently discard the original // context. This is as intended, Nix does the same. Value::from(&text[start..end]) }) .unwrap_or(Value::Null) }) .collect(); ret.push_back(Value::List(NixList::from(v))); pos = match_end; } // push the unmatched characters following the last match // Here, a surprising thing happens: we silently discard the original // context. This is as intended, Nix does the same. ret.push_back(Value::from(&text[pos..])); Ok(Value::List(NixList::from(ret))) }
tebi at 2024-06-02T03·45+00
nvm, i was greatly overcomplicating things.
diff --git a/eval/src/builtins/mod.rs b/eval/src/builtins/mod.rs index 5cd94bc..d315cb3 100644 --- a/eval/src/builtins/mod.rs +++ b/eval/src/builtins/mod.rs @@ -1286,6 +1286,9 @@ mod pure_builtins { }) .collect(); ret.push_back(Value::List(NixList::from(v))); + if pos == text.len() { + break; + } pos = thematch.end(); }
i still think switching to pcre2 would be a good idea.
tebi at 2024-06-02T04·37+00
I queued that fix in https://cl.tvl.fyi/c/depot/+/11743, thanks!
pcre2
pulls in an external non-rust dependency, which gets ugly quite quickly for more exotic targets like WASM (used in tvixbolt), so I'd rather change this only if we really need to.flokli at 2024-06-02T18·11+00
- flokli closed this issue at 2024-06-15T20·13+00