chaining data loaders #1522

Fil · 2024-07-17T10:30:10Z

Implemented as a FILE_SERVER variable passed to data loaders (as an environment variable), allowing them to query an asset server.

For instance, a bash data loader mags.txt.sh can query another file (say, quakes.json) by calling,

curl ${FILE_SERVER}quakes.json | jq .features[].properties.mag

and a JavaScript data loader will call:

fetch(`${process.env.FILE_SERVER}quakes.json`).then((reponse) => response.json)). (etc)

In preview, when quakes.json is updated, mags.txt gets updated. If quakes.json is in fact generated by a data loader quakes.json.sh, then touching that script live-updates mags.txt. (The interpreter used by any of the data loaders is inconsequent: python can talk to typescript, and vice-versa.)

We track the dependency "graph" on-disk, by associating to any file (say filename.json) generated by a data loader a file src/.observablehq/cache/filename.json__dependencies that is a simple list of the paths that it requested against its FILE_SERVER.

TODO:

closes #332

To test this easily within the preview server:

cp test/input/build/chain/chain*.* docs/

then open http://127.0.0.1:3000/chain

You can replace 3 in chain-source.json.sh by $RANDOM if you want to check that the source data loader runs only once for its two dependents.

Old description

We must use the same file server for browser preview and machine calls (i.e. chained data loaders), because concurrent requests on a same data loader must be joined.

But the paths are different. The data loader for caller.csv receives a $SERVER environment variable equal to
http://127.0.0.1:3000/_chain/caller.csv::, and might retrieve a dependency by concatenating that variable and the file path it needs, e.g. calling $SERVER/dependency.zip.

This should make it possible to derive the dependency graph, at least after we run the data loaders (not sure how to maintain state when we restart a server and we have a cache). Also, if the file is not found, we send an empty 404 instead of the decorated page intended for the browser; this makes it a bit more foolproof (tip: use curl -f to fail on 404).

The current server information is saved as a global (in process.env—should it be globalThis?) for now. It feels a bit wrong, but at the same time it really is a global state.

mbostock · 2024-08-29T01:23:18Z

This seems like the right direction. 👍

I’m thinking of going a step further and spinning up a temporary “asset server” each time we invoke a data loader, so that each data loader gets a separate port for making requests to load assets/dependent files. I think that’s cleaner than using a prefixed path on the preview server, means build and preview follow the same code path, and anyway it should be cheap to expose a little asset server port temporarily while running a data loader (since there’s no initialization other than opening the port). And the asset server can further restrict what each data loader has access too (only files, not pages). And we can give it clean paths.

I think we could also get away with not tracking the dependencies explicitly and just let circular data loaders deadlock for now…

mythmon · 2024-08-30T18:18:32Z

Could we make the "protocol" of how to load data loaders flexible enough to provide other capabilities? For example, it would be nice to be able to explicitly record dependencies that Framework can't detect, and one way we could do that is by making an HTTP call to the injected environment variable to record a file that we don't need to request but should be watched for changes.

Two ways I could imagine doing this is either to make that environment variable less generic, like $DATA_SERVER instead of $SERVER, or by adding a path prefix like $SERVER/load/path/to/my/file.csv. That would allow something like $COMMAND_SERVER or $SERVER/track/path/to/library.js. (all very bike-sheddable names)

(to be clear, I'm not asking you to implement any of that in this PR, just to leave enough room in the naming to allow future extension)

Fil · 2024-09-16T15:52:09Z

I have rewritten this based on the suggestions in the review. There is a new TODO list.

(rebased after #1662)

Fil · 2024-09-17T15:09:20Z

Some of our examples will benefit from chained data loaders (in lieu of an intermediate archive). This will allow to split the analysis into independent small scripts. (thinking about this in the context of #1667).

Fil force-pushed the fil/chaining branch from 0e8d163 to b92e77c Compare September 16, 2024 15:20

Fil force-pushed the fil/chaining branch from b92e77c to bf450c0 Compare September 17, 2024 10:03

a $FILE_SERVER that tracks dependencies in the cache

aa55380

(rebased after #1662)

Fil force-pushed the fil/chaining branch from bf450c0 to aa55380 Compare September 17, 2024 10:04

Fil added 4 commits September 17, 2024 14:11

dependency tree

e9217b8

remove spurious log

880ada2

fix chain test

8997f06

document "file server" (aka chained data loaders)

ae2ac1c

Fil marked this pull request as ready for review September 17, 2024 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chaining data loaders #1522

chaining data loaders #1522

Fil commented Jul 17, 2024 •

edited

Loading

mbostock commented Aug 29, 2024

mythmon commented Aug 30, 2024 •

edited

Loading

Fil commented Sep 16, 2024

Fil commented Sep 17, 2024

chaining data loaders #1522

Are you sure you want to change the base?

chaining data loaders #1522

Conversation

Fil commented Jul 17, 2024 • edited Loading

mbostock commented Aug 29, 2024

mythmon commented Aug 30, 2024 • edited Loading

Fil commented Sep 16, 2024

Fil commented Sep 17, 2024

Fil commented Jul 17, 2024 •

edited

Loading

mythmon commented Aug 30, 2024 •

edited

Loading