tvix-castore: virtiofs only accepts one client at the same time

#389
Opened by flokli at 2024-03-18T12·19+00

It looks like tvix-store virtiofs currently only allows one client.

In case (cloud-hypervisor) VM uses this as a backend, and the guest reboots (for example due to a kernel panic and panic=N with N!=0, rust-hypervisor is not able to reconnect:

[    0.630928] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[    0.631294] CPU: 0 PID: 75 Comm: switch_root Not tainted 6.7.9 #1-NixOS
[    0.631603] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0 
[    0.631903] Call Trace:
[    0.632037]  <TASK>
[    0.632147]  dump_stack_lvl+0x47/0x60
[    0.632340]  panic+0x325/0x350
[    0.632495]  do_exit+0x98c/0xb00
[    0.632657]  ? set_ptes.isra.0+0x1e/0xa0
[    0.632848]  do_group_exit+0x31/0x80
[    0.633022]  get_signal+0x9e1/0xa20
[    0.633196]  ? hrtimer_try_to_cancel.part.0+0x50/0xf0
[    0.633442]  arch_do_signal_or_restart+0x3e/0x270
[    0.633670]  exit_to_user_mode_prepare+0x119/0x1e0
[    0.633902]  syscall_exit_to_user_mode+0x1c/0x50
[    0.634126]  do_syscall_64+0x54/0x100
[    0.634319]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[    0.634567] RIP: 0033:0x4782b7
[    0.634720] Code: 8b 44 24 20 b9 40 42 0f 00 f7 f1 48 89 04 24 b8 e8 03 00 00 f7 e2 48 89 44 24 08 48 89 e7 be 00 00 00 00 b8 23 00 00 00 0f 05 <48> 83 c4 10 5d c3 cc cc cc b8 ba 00 00 00 0f 05 89 44 24 08 c3 cc
[    0.635568] RSP: 002b:000000c000061ef8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
[    0.635920] RAX: fffffffffffffdfc RBX: 0000000000000a00 RCX: 00000000004782b7
[    0.636256] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c000061ef8
[    0.636591] RBP: 000000c000061f08 R08: 0000000000000004 R09: 0000000000000001
[    0.636922] R10: 00007fff3498c080 R11: 0000000000000206 R12: 000000c000061ee0
[    0.637258] R13: 000000c000050000 R14: 000000c0000064e0 R15: 0000000000002031
[    0.637592]  </TASK>
[    0.637730] Kernel Offset: disabled
cloud-hypervisor: 60.955199s: <vmm> ERROR:virtio-devices/src/vhost_user/vu_common_ctrl.rs:411 -- Failed connecting the backend after trying for 1 minute: VhostUserProtocol(SocketConnect(Os { code: 2, kind: NotFound, message: "No such file or directory" }))

Independent of whether like multiple clients should be allowed at the same time (I think they should), we should definitely support having multiple subsequent connections, until the tvix-store virtiofs process is terminated.

Assuming the async support in the vhost crate allows this, we might want to use tokio-listener here too. It would be nice to be able to socket-activate the virtiofs socket.

  1. There's an upstream TODO about this: https://github.com/rust-vmm/vhost/blob/e3de13040b951cb6a96bb63d8829af89c38541c3/vhost-user-backend/src/lib.rs#L165-L166

    I think we can probably contribute upstream to support reconnections.

    cbrewster at 2024-03-18T15·53+00

  2. Ah actually we can fix this without an upstream change I think. In a loop, we can construct new VhostUserDaemons and call daemon.start with our listener (which calls accept under the hood).

    cbrewster at 2024-03-18T15·58+00