Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize field shorthand for columnar data? #191

Closed
mbostock opened this issue Mar 4, 2021 · 3 comments · Fixed by #2115
Closed

Optimize field shorthand for columnar data? #191

mbostock opened this issue Mar 4, 2021 · 3 comments · Fixed by #2115
Labels
enhancement New feature or request

Comments

@mbostock
Copy link
Member

mbostock commented Mar 4, 2021

We currently support field (named properties specified as strings) as channel value shorthand, e.g., y: "foo" is upgraded to y: d => d["foo"] or equivalently y: data.map(d => d["foo"]). For Arquero (and perhaps other column-oriented data structures), this still works because Arquero provides an object iterator; however, it’d be more efficient if Plot supported the same shorthand with Arquero’s column accessors, upgrading y: "foo" to y: data.column("foo") y: data.array("foo"), which returns an iterable over the values. Although, currently this would still involve copying the column into a new array, so we actually want y: data.column("foo").data if possible (e.g., Arquero’s Column class could support an toArray method which returns the data if possible, and otherwise constructs a new array using the iterator).

@mbostock mbostock added the enhancement New feature or request label Mar 4, 2021
@jheer
Copy link

jheer commented May 6, 2021

I was going to post an issue to this effect after a brief skim of the source code, as it appears to involve a fair bit of data copying. Glad to see an issue is already here! Arquero includes a table.array method that extracts an array (optionally into a specified typed array type). Currently this method performs a copy (necessary if coercing types or if the table has filter or orderby criteria) but we could optimize this to return an underlying array directly (zero copy) when applicable. Even with an array copy here, it might be nice to avoid object serialization of table rows. Along those lines, Arquero can also provide a column iterator using table.values.

Also, I would advise against column('foo').data as a general pattern, as other column implementations are not guaranteed to support this (Apache Arrow data, etc.).

If Observable/Plot eventually settles on a more general convention for accessing columnar data sources, we're happy to look into adapting Arquero along those lines.

@mbostock
Copy link
Member Author

I think we should focus on doing this specifically for Apache Arrow table instances first (since these are the most widely-used columnar data representation, including by the DuckDBClient/SQL code blocks in Observable Framework).

@mbostock
Copy link
Member Author

mbostock commented Mar 21, 2024

Here’s is trivial example of how to pass in columnar data. Given an Arrow table:

const data = Arrow.tableFromArrays({
  id: [1, 2, 3],
  name: ["Alice", "Bob", "Charlie"],
  age: [35, 25, 45]
});

You can pass the columns in like so:

Plot.barY({length: data.numRows}, {x: data.getChild("name"), y: data.getChild("age")}).plot()

The goal of this issue is to make the above equivalent to the following shorthand syntax:

Plot.barY(data, {x: "name", y: "age"}).plot()

To do this, we will need to detect when data is an Apache Arrow table, and change the definition of the field shorthand to use data.getChild(name) instead of Array.from(data, (d) => d[name]). Also, we’ll want to avoid materializing the array-of-objects as data (the arrayify(this.data) call in mark.initialize), and instead use an efficient numeric index [0, … numRows - 1].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants