Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripted exports #1605

Open
eyeseast opened this issue Jan 19, 2022 · 10 comments
Open

Scripted exports #1605

eyeseast opened this issue Jan 19, 2022 · 10 comments

Comments

@eyeseast
Copy link
Contributor

Posting this while I'm thinking about it: I mentioned at the end of this thread that I'm usually doing datasette --get to export canned queries.

I used to use a tool called datafreeze to do scripted exports, but that project looks dead now. The ergonomics of it are pretty nice, though, and the Freezefile.yml structure is actually not too far from Datasette's canned queries.

This is related to the idea for datasette query (#1356) but I think it's a distinct feature. It's most likely a plugin, but I want to raise it here because it's probably something other people have thought about.

@simonw
Copy link
Owner

simonw commented Jan 19, 2022

Oh that's interesting. I was thinking about this from a slightly different angle recently - pondering what a static site generator built on top of Datasette might look like.

Just a sketch at the moment, but I was imagining a YAML configuration file with a SQL query that returns a list of paths - then a tool that runs that query and uses the equivalent of datasette --get to create a static copy of each of those paths.

I think these two ideas can probably be merged. I'd love to know more about how you are solving this right now!

@eyeseast
Copy link
Contributor Author

Right now, I usually have a line in a Makefile like this:

combined.geojson: project.db
    pipenv run datasette project.db --get /project/combined.geojson \
     --load-extension spatialite \
     --setting sql_time_limit_ms 5000 \
     --setting max_returned_rows 20000 \
     -m metadata.yml > $@

That all assumes I've loaded whatever I need into project.db and created a canned query called combined (and then uses datasette-geojson for geojson output).

It works, but as you can see, it's a lot to manage, a lot of boilerplate, and it wasn't obvious how to get there. If there's an error in the canned query, I get an HTML error page, so that's hard to debug. And it's only one query, so each output needs a line like this. Make isn't ideal, either, for that reason.

The thing I really liked with datafreeze was doing templated filenames. I have a project now where I need to export a bunch of litttle geojson files, based on queries, and it would be awesome to be able to do something like this:

databases:
  project:
    queries:
      boundaries:
        sql: "SELECT * FROM boundaries"
        filename: "boundaries/{id}.geojson"
        mode: "item"
        format: geojson

And then do:

datasette freeze -m metadata.yml project.db

For HTML export, maybe there's a template argument, or format: template or something. And that gets you a static site generator, kinda for free.

@eyeseast
Copy link
Contributor Author

Thinking about this more, as well as #1356 and various other tickets related to output formats, I think there's a missing plugin hook for formatting results, separate from register_output_renderer (or maybe part of it, depending on #1101).

Right now, as I understand it, getting output in any format goes through the normal view stack -- a table, a row or a query -- and so by the time register_output_renderer gets it, the results have already been truncated or paginated. What I'd want, I think, is to be able to register ways to format results independent of where those results are sent.

It's possible this could be done using conn.row_factory (maybe in the prepare_connection hook), but I'm not sure that's where it belongs.

Another option is some kind of registry of serializers, which register_output_renderer and other plugin hooks could use. What I'm trying to avoid here is writing a plugin that also needs plugins for formats I haven't thought of yet.

@simonw
Copy link
Owner

simonw commented Jan 21, 2022

Yeah I think this all hinges on:

Also this comment about streaming full JSON arrays (not just newline-delimited) using this trick:

I'm about ready to figure these out, as with so much it's still a little bit blocked on the refactor stuff from:

@eyeseast
Copy link
Contributor Author

Let me know if you want help prototyping any of this, because I'm thinking about it and trying stuff out. Happy to be a sounding board, if it helps.

@simonw
Copy link
Owner

simonw commented Mar 19, 2022

Had a thought about the implementation of this: it could make a really neat plugin.

Something like datasette-export which adds a export command using https://docs.datasette.io/en/stable/plugin_hooks.html#register-commands-cli - then you could run:

datasette export my-export-dir mydatabase.db -m metadata.json --template-dir templates/

And the command would then:

  • Create a Datasette() instance with those databases/metadata/etc
  • Executeawait datasette.client.get("/") to get the homepage HTML
  • Parse the HTML using BeautifulSoup to find all a[href], link[href], script[src], img[src] elements that reference a relative path as opposed to one that starts with http://
  • Write out the homepage to my-export-dir/index.html
  • Recursively fetch and dump all of the other pages and assets that it found too

All of that HTML parsing may be over-complicating things. It could alternatively accept options for which pages you want to export:

datasette export my-export-dir \
  mydatabase.db -m metadata.json --template-dir templates/ \
  --path / \
  --path /mydatabase ...

Or a really wild option: it could allow you to define the paths you want to export using a SQL query:

datasette export my-export-dir \
  mydatabase.db -m metadata.json --template-dir templates/ \
  --sql "
select '/' as path, 'index.html' as filename
  union all
select '/mydatabase/articles/' || id as path, 'article-' || id || '.html' as filename
from articles
  union all
select '/mydatabase/tags/' || tag as path, 'tag-' || tag || '.html' as filename
from tags
"

Which would save these files:

  • index.html as the content of /
  • article-1.html (and more) as the content of /mydatabase/articles/1
  • tag-python.html (and more) as the content of /mydatabase/tags/python

@simonw
Copy link
Owner

simonw commented Nov 27, 2022

@eyeseast I started work on that plugin: https://github.com/simonw/datasette-export

@eyeseast
Copy link
Contributor Author

Interesting. I started a version using metadata like I outlined up top, but I realized that there's no documented way for a plugin to access either metadata or canned queries. Or at least, I couldn't find a way.

There is this method: https://github.com/simonw/datasette/blob/main/datasette/app.py#L472 but I don't want to rely on it if it's not documented. Same with this: https://github.com/simonw/datasette/blob/main/datasette/app.py#L544

If those are safe, I'll build on them. I'm also happy to document them, if that greases the wheels.

@simonw
Copy link
Owner

simonw commented Nov 30, 2022

Those sounds to me like they should be promoted to documented, supported internals.

@eyeseast
Copy link
Contributor Author

I'll add issues for both and do a documentation PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants