Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TableView to use asyncinject #1715

Closed
simonw opened this issue Apr 22, 2022 · 13 comments
Closed

Refactor TableView to use asyncinject #1715

simonw opened this issue Apr 22, 2022 · 13 comments

Comments

@simonw
Copy link
Owner

simonw commented Apr 22, 2022

I've been working on a dependency injection mechanism in a separate library:

I think it's ready to try out with Datasette to see if it's a pattern that will work here.

I'm going to attempt to refactor TableView to use it. There are two overall goals here:

  • Use asyncinject to add parallel execution of some aspects of the table page - most notably I want to be able to execute the count(*) query, the select ... query, the various faceting queries and the facet suggestion queries in parallel - and measure if doing so is good for performance.
  • Use it to execute different output formats (possibly with some changes to the existing register_output_renderer() plugin hook). I want CSV and JSON to use the same mechanism that plugins use.

Stretch goal is to get this working with streaming data too, see:

@simonw
Copy link
Owner Author

simonw commented Apr 22, 2022

I need a asyncio.Registry with functions registered to perform the role of the table view.

Something like this perhaps:

def table_html_context(facet_results, query, datasette, rows):
    return {...}

That then gets called like this:

async def view(request):
    registry = Registry(facet_results, query, datasette, rows)
    context = await registry.resolve(table_html, request=request, datasette=datasette)
    return Reponse.html(await datasette.render("table.html", context)

It's also interesting to start thinking about this from a Python client library point of view. If I'm writing code outside of the HTTP request cycle, what would it look like?

One thing I could do: break out is the code that turns a request into a list of pairs extracted from the request - this code here:

# Arguments that start with _ and don't contain a __ are
# special - things like ?_search= - and should not be
# treated as filters.
filter_args = []
for key in request.args:
if not (key.startswith("_") and "__" not in key):
for v in request.args.getlist(key):
filter_args.append((key, v))

I could turn that into a typed dependency injection function like this:

def filter_args(request: Request) -> List[Tuple[str, str]]:
    # Arguments that start with _ and don't contain a __ are
    # special - things like ?_search= - and should not be
    # treated as filters.
    filter_args = []
    for key in request.args:
        if not (key.startswith("_") and "__" not in key):
            for v in request.args.getlist(key):
                filter_args.append((key, v))
    return filter_args

Then I can either pass a request into a .resolve() call, or I can instead skip that function by passing:

output = registry.resolve(table_context, filter_args=[("foo", "bar")])

I do need to think about where plugins get executed in all of this.

@simonw
Copy link
Owner Author

simonw commented Apr 22, 2022

Looking at the start of TableView.data():

async def data(
self,
request,
default_labels=False,
_next=None,
_size=None,
):
database_route = tilde_decode(request.url_vars["database"])
table = tilde_decode(request.url_vars["table"])
try:
db = self.ds.get_database(route=database_route)
except KeyError:
raise NotFound("Database not found: {}".format(database_route))
database = db.name

I'm going to resolve table_name and database from the URL - table_name will be a string, database will be the DB object returned by datasette.get_database(). Then those can be passed in separately too.

@simonw
Copy link
Owner Author

simonw commented Apr 22, 2022

async def database(request: Request, datasette: Datasette) -> Database:
    database_route = tilde_decode(request.url_vars["database"])
    try:
        return datasette.get_database(route=database_route)
    except KeyError:
        raise NotFound("Database not found: {}".format(database_route))

async def table_name(request: Request) -> str:
    return tilde_decode(request.url_vars["table"])

@simonw
Copy link
Owner Author

simonw commented Apr 22, 2022

I'm having second thoughts about injecting request - might be better to have the view function pull the relevant pieces out of the request before triggering the rest of the resolution.

@simonw
Copy link
Owner Author

simonw commented Apr 25, 2022

The RowTableShared class is making this a whole lot more complicated.

I'm going to split the RowView view out into an entirely separate views/row.py module.

@simonw
Copy link
Owner Author

simonw commented Apr 25, 2022

Pushed my WIP on this to the api-extras branch: 5053f1e

@simonw
Copy link
Owner Author

simonw commented Apr 26, 2022

The refactor I did in #1719 pretty much clashes with all of the changes in 5053f1e so I'll probably need to start my api-extras branch again from scratch.

Using a new tableview-asyncinject branch.

@simonw
Copy link
Owner Author

simonw commented Apr 26, 2022

This time I'm not going to bother with the filter_args thing - I'm going to just try to use asyncinject to execute some big high level things in parallel - facets, suggested facets, counts, the query - and then combine it with the extras mechanism I'm trying to introduce too.

Most importantly: I want that extra_template() function that adds more template context for the HTML to be executed as part of an asyncinject flow!

@simonw
Copy link
Owner Author

simonw commented Apr 26, 2022

I'm going to rename database to database_name and table to table_name to avoid confusion with the Database object as opposed to the string name for the database.

@simonw
Copy link
Owner Author

simonw commented Apr 26, 2022

pytest tests/test_table_* runs the tests quickly.

@simonw
Copy link
Owner Author

simonw commented Apr 26, 2022

Well this is fun... I applied this change:

diff --git a/datasette/views/table.py b/datasette/views/table.py
index d66adb8..85f9e44 100644
--- a/datasette/views/table.py
+++ b/datasette/views/table.py
@@ -1,3 +1,4 @@
+import asyncio
 import itertools
 import json
 
@@ -5,6 +6,7 @@ import markupsafe
 
 from datasette.plugins import pm
 from datasette.database import QueryInterrupted
+from datasette import tracer
 from datasette.utils import (
     await_me_maybe,
     CustomRow,
@@ -174,8 +176,11 @@ class TableView(DataView):
                 write=bool(canned_query.get("write")),
             )
 
-        is_view = bool(await db.get_view_definition(table_name))
-        table_exists = bool(await db.table_exists(table_name))
+        with tracer.trace_child_tasks():
+            is_view, table_exists = map(bool, await asyncio.gather(
+                db.get_view_definition(table_name),
+                db.table_exists(table_name)
+            ))
 
         # If table or view not found, return 404
         if not is_view and not table_exists:

And now using https://datasette.io/plugins/datasette-pretty-traces I get this:

CleanShot 2022-04-26 at 14 03 33@2x

@simonw
Copy link
Owner Author

simonw commented Apr 26, 2022

Running facets and facet suggestions in parallel using asyncio.gather() turns out to be a lot less hassle than I had thought - maybe I don't need asyncinject for this at all?

         if not nofacet:
-            for facet in facet_instances:
-                (
-                    instance_facet_results,
-                    instance_facets_timed_out,
-                ) = await facet.facet_results()
+            # Run them in parallel
+            facet_awaitables = [facet.facet_results() for facet in facet_instances]
+            facet_awaitable_results = await asyncio.gather(*facet_awaitables)
+            for (
+                instance_facet_results,
+                instance_facets_timed_out,
+            ) in facet_awaitable_results:
                 for facet_info in instance_facet_results:
                     base_key = facet_info["name"]
                     key = base_key
@@ -522,8 +540,10 @@ class TableView(DataView):
             and not nofacet
             and not nosuggest
         ):
-            for facet in facet_instances:
-                suggested_facets.extend(await facet.suggest())
+            # Run them in parallel
+            facet_suggest_awaitables = [facet.suggest() for facet in facet_instances]
+            for suggest_result in await asyncio.gather(*facet_suggest_awaitables):
+                suggested_facets.extend(suggest_result)

simonw added a commit that referenced this issue Apr 26, 2022
Use ?_noparallel=1 to opt out (undocumented, useful for benchmark comparisons)

Refs #1723, #1715
@simonw
Copy link
Owner Author

simonw commented Apr 28, 2022

I'm not going to use asyncinject in this refactor - at least not until I really need it. My research in these issues has put me off the idea ( in favour of asyncio.gather() or even not trying for parallel execution at all):

@simonw simonw closed this as completed Apr 28, 2022
@simonw simonw added the wontfix label Apr 28, 2022
@simonw simonw removed this from the Datasette 1.0a1 milestone Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant