tvix-castore: virtiofs only accepts one client at the same time
It looks like tvix-store virtiofs currently only allows one client.
In case (cloud-hypervisor) VM uses this as a backend, and the guest reboots (for example due to a kernel panic and panic=N
with N!=0, rust-hypervisor is not able to reconnect:
[ 0.630928] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 [ 0.631294] CPU: 0 PID: 75 Comm: switch_root Not tainted 6.7.9 #1-NixOS [ 0.631603] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0 [ 0.631903] Call Trace: [ 0.632037] <TASK> [ 0.632147] dump_stack_lvl+0x47/0x60 [ 0.632340] panic+0x325/0x350 [ 0.632495] do_exit+0x98c/0xb00 [ 0.632657] ? set_ptes.isra.0+0x1e/0xa0 [ 0.632848] do_group_exit+0x31/0x80 [ 0.633022] get_signal+0x9e1/0xa20 [ 0.633196] ? hrtimer_try_to_cancel.part.0+0x50/0xf0 [ 0.633442] arch_do_signal_or_restart+0x3e/0x270 [ 0.633670] exit_to_user_mode_prepare+0x119/0x1e0 [ 0.633902] syscall_exit_to_user_mode+0x1c/0x50 [ 0.634126] do_syscall_64+0x54/0x100 [ 0.634319] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 0.634567] RIP: 0033:0x4782b7 [ 0.634720] Code: 8b 44 24 20 b9 40 42 0f 00 f7 f1 48 89 04 24 b8 e8 03 00 00 f7 e2 48 89 44 24 08 48 89 e7 be 00 00 00 00 b8 23 00 00 00 0f 05 <48> 83 c4 10 5d c3 cc cc cc b8 ba 00 00 00 0f 05 89 44 24 08 c3 cc [ 0.635568] RSP: 002b:000000c000061ef8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023 [ 0.635920] RAX: fffffffffffffdfc RBX: 0000000000000a00 RCX: 00000000004782b7 [ 0.636256] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c000061ef8 [ 0.636591] RBP: 000000c000061f08 R08: 0000000000000004 R09: 0000000000000001 [ 0.636922] R10: 00007fff3498c080 R11: 0000000000000206 R12: 000000c000061ee0 [ 0.637258] R13: 000000c000050000 R14: 000000c0000064e0 R15: 0000000000002031 [ 0.637592] </TASK> [ 0.637730] Kernel Offset: disabled cloud-hypervisor: 60.955199s: <vmm> ERROR:virtio-devices/src/vhost_user/vu_common_ctrl.rs:411 -- Failed connecting the backend after trying for 1 minute: VhostUserProtocol(SocketConnect(Os { code: 2, kind: NotFound, message: "No such file or directory" }))
Independent of whether like multiple clients should be allowed at the same time (I think they should), we should definitely support having multiple subsequent connections, until the tvix-store virtiofs
process is terminated.
Assuming the async support in the vhost
crate allows this, we might want to use tokio-listener here too. It would be nice to be able to socket-activate the virtiofs socket.
There's an upstream TODO about this: https://github.com/rust-vmm/vhost/blob/e3de13040b951cb6a96bb63d8829af89c38541c3/vhost-user-backend/src/lib.rs#L165-L166
I think we can probably contribute upstream to support reconnections.
cbrewster at 2024-03-18T15·53+00
Ah actually we can fix this without an upstream change I think. In a loop, we can construct new
VhostUserDaemon
s and calldaemon.start
with our listener (which calls accept under the hood).cbrewster at 2024-03-18T15·58+00