3p/josh: unknown Rust issue is causing josh crashes

#283
Opened by tazjin at 2023-07-02T13·37+00

After the channel bump in cl/8855, executions of josh-filter in depot push tasks (i.e. mirroring of repository parts to github) started failing with this:

Filtering depot through :workspace=views/tvix
memory allocation of 94709859315041 bytes failed
/nix/store/rxy8nsmlqh0pp09fkz5rpjccqp35nxnn--workspace=views-tvix-push: line 6: 3736962 Aborted                 (core dumped) josh-filter ':workspace=views/tvix'

An example build is this: https://buildkite.com/tvl/depot/builds/25562#0189166c-6b2a-4c3a-ab39-53d71f30d8bc

The issue occurs somewhere within josh when invoking gix, the Rust git implementation:

Thread 1 "josh-filter" received signal SIGABRT, Aborted.
0x00007ffff7c20a8c in __pthread_kill_implementation ()
   from /nix/store/wpgrc564ys39vbyv0m50qxmq8dvhi7cc-glibc-2.37-8/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7c20a8c in __pthread_kill_implementation ()
   from /nix/store/wpgrc564ys39vbyv0m50qxmq8dvhi7cc-glibc-2.37-8/lib/libc.so.6
#1  0x00007ffff7bd1c86 in raise () from /nix/store/wpgrc564ys39vbyv0m50qxmq8dvhi7cc-glibc-2.37-8/lib/libc.so.6
#2  0x00007ffff7bbb8ba in abort () from /nix/store/wpgrc564ys39vbyv0m50qxmq8dvhi7cc-glibc-2.37-8/lib/libc.so.6
#3  0x0000555555a69db7 in std::sys::unix::abort_internal::hf730997e9ccd1e2e ()
#4  0x00005555555e0b16 in std::process::abort::h5e94f7436e771820 ()
#5  0x0000555555a53f7b in std::alloc::rust_oom::h3e180efaa38ee02c ()
#6  0x0000555555a53f86 in __rg_oom ()
#7  0x000055555568eab6 in __rust_alloc_error_handler ()
#8  0x00005555556b3096 in alloc::alloc::handle_alloc_error::rt_error::h08f5902a73aba60c ()
#9  0x00005555555b24d6 in alloc::alloc::handle_alloc_error::h5b1e66e89806f984 ()
#10 0x00005555556b2082 in <&str as alloc::ffi::c_str::CString::new::SpecNewImpl>::spec_new_impl::ha745a009218ad9d4 ()
#11 0x00005555555df58b in std::sys::common::small_c_string::run_with_cstr_allocating::hb741e351232c3b55 ()
#12 0x00005555555e9ca4 in std::fs::metadata<&std::path::Path> (path=...)
    at /build/rustc-1.70.0-src/library/std/src/fs.rs:1847
#13 std::path::Path::metadata (self=...) at /build/rustc-1.70.0-src/library/std/src/path.rs:2727
#14 gix_discover::is::git<&std::path::PathBuf> (git_dir=<optimized out>) at /sources/gix-discover-0.18.1/src/is.rs:43
#15 0x00005555555eb53e in gix::types::ThreadSafeRepository::open_opts<&std::path::Path> (path=..., options=...)
    at /sources/gix-0.44.1/src/open/repository.rs:63
#16 gix::types::ThreadSafeRepository::open<&std::path::Path> (path=...) at /sources/gix-0.44.1/src/open/repository.rs:46
#17 0x0000555555685871 in josh_filter::run_filter (args=...) at josh-filter/src/bin/josh-filter.rs:155
#18 0x000055555568d165 in josh_filter::main () at josh-filter/src/bin/josh-filter.rs:445

The channel bump moved Rust from 1.69 to 1.70, in an experimental CL (cl/8917) we have seen that moving back to Rust 1.69 causes the issue to disappear. This may be either a problem in Rust itself, or in one of the dependencies in the build closure of either Rust or Cargo.

Relevant other changes:

Next step is to try and build with Rust 1.69 from the rust-overlay, and move that workaround into canon for now while debugging this further.

The issue can be reproduced on whitby by acting as a buildkite-agent user, for example:

sudo -u buildkite-agent-whitby-2 bash -c 'cd /var/lib/buildkite-agent-whitby-2/builds/whitby-2/tvl/depot && /nix/store/fjxq2lcg1qsydz4dfk3kz2fkz79bqlls-rust-workspace-unknown/bin/josh-filter ":/nix/nix-1p"'

We might want to bisect nixpkgs against this to see what is going on.

  1. cl/8917 is submitted and provides a workaround by pinning the build to Rust 1.69.0, but does not solve the root cause yet

    tazjin at 2023-07-02T16·40+00

  2. cl/9590 bumped the rustc versions, and we didn't see these josh to crash anymore. This can be closed.

    flokli at 2023-11-15T22·31+00

  3. flokli closed this issue at 2023-11-15T22·31+00