tvix: spllit fetch_url into multiple components to decouple naming
At least builtin:fetchurl
supports both outputHashMode: recursive
as well as outputHashMode: flat
.
This affects how a store path is named, however TvixStoreIO::fetch_url
currently assumes outputHashMode: flat
.
We should split fetching, ingesting and naming into different functions, and then call them from the various builtin implementations / the builtin:fetchurl
handler.
1.) plain HTTP fetcher code, returning a stream of bytes (or AsyncRead, whatever makes more sense in that case).
2.) Ingestion code:
a) tarball usecase: consumes the tarball stream and populates blobservice / directoryservice as it goes through the data. Returns a (unnamed) node::DirectoryNode pointing to the root Directory
b) plain ingestion usecase: consumes the raw bytes and inserts them into the blobservice. Returns a (unnamed) node::FileNode pointing to the blake3 digest of the root node.
3.) PathInfo generation / naming: Should live in the builtin. Depending on the hash mode, context (in some cases) etc. a StorePath is calculated. We currently have TvixStoreIO::node_to_path_info
for this, and can probably just call it from more places, rather than duplicating some of the logic there.
Explicitly left as futurework: Doing any of the NAR / Sha256 calculations during ingestion. We can always plumb this on later.
An interesting side note: both the tarball usecase, as well as the "filtered ingest" usecase used in builtins.path
/ builtins.filterSource
retrieve a bunch of files to ingest into the blobservice, keep track of their attributes and synthesize a merkle directory structure, which is uploaded them when finished, returning a root node pointing to the data.
At some point (probably not during this refactor), it might make sense to distill a general interface out of this, put it into tvix-castore, and update some code to use this, rather than individual implementations over and over again. We'll need similar tracking logic for builtins.fetchgit
too.
I had some initial thoughts on this a while ago with Linus (see cl/9606), but that's been way too early.
cl/11462 created an
ingest_entries
function, accepting a stream of ingestion entries.Cleanups to the fetchers happened in in cl/11500 and cl/11501.
flokli at 2024-04-22T21·14+00
- flokli closed this issue at 2024-04-24T13·18+00