-
-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
register_output_renderer() should support streaming data #1101
Comments
Current design: https://docs.datasette.io/en/stable/plugin_hooks.html#register-output-renderer-datasette @hookimpl
def register_output_renderer(datasette):
return {
"extension": "test",
"render": render_demo,
"can_render": can_render_demo, # Optional
} Where async def render_demo(datasette, columns, rows):
db = datasette.get_database()
result = await db.execute("select sqlite_version()")
first_row = " | ".join(columns)
lines = [first_row]
lines.append("=" * len(first_row))
for row in rows:
lines.append(" | ".join(row))
return Response(
"\n".join(lines),
content_type="text/plain; charset=utf-8",
headers={"x-sqlite-version": result.first()[0]}
) Meanwhile here's where the CSV streaming mode is implemented: datasette/datasette/views/base.py Lines 297 to 380 in 4bac9f1
|
The trick I'm using here is to follow the datasette/datasette/views/base.py Lines 304 to 307 in 4bac9f1
|
Yet another use-case for this: I want to be able to stream newline-delimited JSON in order to better import into Pandas:
|
Idea: instead of returning a dictionary,
I can then deprecate the existing |
With this structure it will become possible to stream non-newline-delimited JSON array-of-objects too - the |
This would really help with this issue: eyeseast/datasette-geojson#7 |
Relevant blog post: https://simonwillison.net/2021/Jun/25/streaming-large-api-responses/ - including notes on efficiently streaming formats with some kind of separator in between the records (regular JSON).
|
Maybe the simplest design for this is to add an optional @hookimpl
def register_output_renderer(datasette):
return {
"extension": "tsv",
"render": render_tsv,
"can_render": lambda: True,
"can_stream": lambda: True
} When streaming, a new parameter could be passed to the render function - maybe Or it could use the existing |
What if you split rendering and streaming into two things:
That way current plugins still work, and streaming is purely additive. A |
I'm questioning if the mechanisms should be separate at all now - a single response rendering is really just a case of a streaming response that only pulls the first N records from the iterator. It probably needs to be an This actually gets a fair bit more complicated due to the work I'm doing right now to improve the default JSON API: I want to do things like make faceting results optionally available to custom renderers - which is a separate concern from streaming rows. I'm going to poke around with a bunch of prototypes and see what sticks. |
The if isinstance(geometry, bytes):
results = await db.execute(
"SELECT AsGeoJSON(:geometry)", {"geometry": geometry}
)
return geojson.loads(results.single_value()) That actually seems to work really well as-is, but it does worry me a bit that it ends up having to execute an extra My PostgreSQL/MySQL engineering brain says that this would be better handled by doing a chunk of these (maybe 100) at once, to avoid the per-query-overhead - but with SQLite that might not be necessary. At any rate, this is one of the reasons I'm interested in "iterate over this sequence of chunks of 100 rows at a time" as a potential option here. Of course, a better solution would be for |
Ha! That was your idea (and a good one). But it's probably worth measuring to see what overhead it adds. It did require both passing in the database and making the whole thing Just timing the queries themselves:
Looking at the network panel:
I'm not sure how best to time the GeoJSON generation, but it would be interesting to check. Maybe I'll write a plugin to add query times to response headers. The other thing to consider with async streaming is that it might be well-suited for a slower response. When I have to get the whole result and send a response in a fixed amount of time, I need the most efficient query possible. If I can hang onto a connection and get things one chunk at a time, maybe it's ok if there's some overhead. |
Idea for supporting streaming with the @hookimpl
def register_output_renderer(datasette):
return {
"extension": "test",
"render": render_demo,
"can_render": can_render_demo,
"render_stream": render_demo_stream, # This is new
} So there's a new I'll play around with the design of that function signature in: |
I really like this. One thing that would be great to add would be the ability to cancel the rows_iterator. The use case I have is to have a datatables extension that would wrap the orginal sql query in a new sql query like if the streaming data was truly lazy, then this would be easy, I would just throw away the original row_iterator, and then construct my own. This could handle many of the use cases of #2000. |
Originally posted by @simonw in #1096 (comment)
The text was updated successfully, but these errors were encountered: