tvix: spllit fetch_url into multiple components to decouple naming

#390
Opened by flokli at 2024-04-02T09·44+00

At least builtin:fetchurl supports both outputHashMode: recursive as well as outputHashMode: flat.

This affects how a store path is named, however TvixStoreIO::fetch_url currently assumes outputHashMode: flat.

We should split fetching, ingesting and naming into different functions, and then call them from the various builtin implementations / the builtin:fetchurl handler.

1.) plain HTTP fetcher code, returning a stream of bytes (or AsyncRead, whatever makes more sense in that case).

2.) Ingestion code: a) tarball usecase: consumes the tarball stream and populates blobservice / directoryservice as it goes through the data. Returns a (unnamed) node::DirectoryNode pointing to the root Directory b) plain ingestion usecase: consumes the raw bytes and inserts them into the blobservice. Returns a (unnamed) node::FileNode pointing to the blake3 digest of the root node.

3.) PathInfo generation / naming: Should live in the builtin. Depending on the hash mode, context (in some cases) etc. a StorePath is calculated. We currently have TvixStoreIO::node_to_path_info for this, and can probably just call it from more places, rather than duplicating some of the logic there.


Explicitly left as futurework: Doing any of the NAR / Sha256 calculations during ingestion. We can always plumb this on later.


An interesting side note: both the tarball usecase, as well as the "filtered ingest" usecase used in builtins.path / builtins.filterSource retrieve a bunch of files to ingest into the blobservice, keep track of their attributes and synthesize a merkle directory structure, which is uploaded them when finished, returning a root node pointing to the data.

At some point (probably not during this refactor), it might make sense to distill a general interface out of this, put it into tvix-castore, and update some code to use this, rather than individual implementations over and over again. We'll need similar tracking logic for builtins.fetchgit too.

I had some initial thoughts on this a while ago with Linus (see cl/9606), but that's been way too early.

  1. cl/11462 created an ingest_entries function, accepting a stream of ingestion entries.

    Cleanups to the fetchers happened in in cl/11500 and cl/11501.

    flokli at 2024-04-22T21·14+00

  2. flokli closed this issue at 2024-04-24T13·18+00