builtins.split loops endlessly

#402
Opened by flokli at 2024-05-23T17·02+00

nix-repl> builtins.split "(.*)" "abc" 
[ "" [ ... ] "" [ ... ] "" ]

vs

tvix-repl> builtins.split "(.*)" "abc"
(runs forever)

Also reproducible with

builtins.split "([abc]*)" "abc"

Causes b/368.

  1. Most minimal repro:

    builtins.split ".*" ""
    

    I think the problem is that you can always match the empty string against a 0-or-more match, making the while let Some(… loop in our implementation spin.

    profpatsch at 2024-05-26T10·35+00

  2. i think the problem might be in the regex crate. i rewrote it to use a different regex implementation and it no longer has this problem, with the output exactly matching that of nix... unless you use capture groups, because i accidentally wrote it to use vanilla posix regular expressions instead of extended posix regular expressions.

    will have to rewrite again to use the pcre2 crate.

        #[builtin("split")]
        async fn builtin_split(co: GenCo, regex: Value, str: Value) -> Result<Value, ErrorKind> {
    		// use C-compatible regular expression implementation.
    		// this doesn't support unicode, but neither does nix.
    		use posix_regex::compile::PosixRegexBuilder;
            if str.is_catchable() {
                return Ok(str);
            }
    
            if regex.is_catchable() {
                return Ok(regex);
            }
    
            let s = str.to_contextful_str()?;
            let text = s.as_bytes();
            let re = regex.to_str()?;
            let re = PosixRegexBuilder::new(re.as_bytes()).compile().unwrap();
            //let mut capture_locations = re.capture_locations();
            //let num_captures = capture_locations.len();
            let mut ret = imbl::Vector::new();
            let mut pos = 0;
    
    		for capture_locations in re.matches(text, None) {
    			let Some((match_start, match_end)) = capture_locations[0] else { todo!() };
    			let num_captures = capture_locations.len()-1;
    			dbg!(pos, text, match_start, match_end);
                // push the unmatched characters preceding the match
                ret.push_back(Value::from(NixString::new_inherit_context_from(
                    &s,
                    &text[pos..match_start],
                )));
    
                // Push a list with one element for each capture
                // group in the regex, containing the characters
                // matched by that capture group, or null if no match.
                // We skip capture 0; it represents the whole match.
                let v: imbl::Vector<Value> = (1..num_captures)
                    .map(|i| capture_locations.get(i))
                    .map(|o| {
                        o.map(Option::as_ref).flatten().map(|&(start, end)| {
                            // Here, a surprising thing happens: we silently discard the original
                            // context. This is as intended, Nix does the same.
                            Value::from(&text[start..end])
                        })
                        .unwrap_or(Value::Null)
                    })
                    .collect();
                ret.push_back(Value::List(NixList::from(v)));
                pos = match_end;
            }
    
            // push the unmatched characters following the last match
            // Here, a surprising thing happens: we silently discard the original
            // context. This is as intended, Nix does the same.
            ret.push_back(Value::from(&text[pos..]));
    
            Ok(Value::List(NixList::from(ret)))
        }
    

    tebi at 2024-06-02T03·45+00

  3. nvm, i was greatly overcomplicating things.

    diff --git a/eval/src/builtins/mod.rs b/eval/src/builtins/mod.rs
    index 5cd94bc..d315cb3 100644
    --- a/eval/src/builtins/mod.rs
    +++ b/eval/src/builtins/mod.rs
    @@ -1286,6 +1286,9 @@ mod pure_builtins {
                     })
                     .collect();
                 ret.push_back(Value::List(NixList::from(v)));
    +			if pos == text.len() {
    +				break;
    +			}
                 pos = thematch.end();
             }
    

    i still think switching to pcre2 would be a good idea.

    tebi at 2024-06-02T04·37+00

  4. I queued that fix in https://cl.tvl.fyi/c/depot/+/11743, thanks!

    pcre2 pulls in an external non-rust dependency, which gets ugly quite quickly for more exotic targets like WASM (used in tvixbolt), so I'd rather change this only if we really need to.

    flokli at 2024-06-02T18·11+00

  5. flokli closed this issue at 2024-06-15T20·13+00