Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shuhaowu
Copy link

@shuhaowu shuhaowu commented Apr 6, 2023

THIS PR IS NOT READY YET:

  • Potentially add documentation comparing the performance of .get vs [index], and maybe even the nuance of how .get can use a binary search algorithm when the Vector consists of multiple chunks.
  • Implement other methods such as filter on the Vector object to ensure 100% compatibility with JavaScript arrays.

Rationale for this change

Certain codebases that previously uses row-oriented way to access data may wish to migrate to Arrow to save serialization and deserialization cost, and to be able to gain access to fast column-oriented operations. As it stands, Arrow is sort of a drop-in replacement to row-oriented data such as a JavaScript Array of objects. This is great to incrementally migrate legacy codebases to Arrow, as it is frequently infeasible to rewrite the application to use the column-oriented data access patterns. For most data, JavaScript-object-compatible and row-oriented access is already provided via the StructRowProxy. However, if the structs themselves include a Vector, existing code will break as it assumes the Vector object to behave like a JavaScript array, which it does not due to the lack of index access. An example of such a data structure is as follows:

[
  {x: 1, y: [1, 2]},
  {x: 2, y: [2, 3]},
]

In this case, with the Arrow JS library as it is, the API consumer is unable to get individual element of the y array via table[i].y[j]. Instead, the API consumer must use the API table.get(i).y.get(j). In the situation where we are migrating a legacy code base to Arrow, this requires a large refactor of the entire codebase, which is infeasible in a short time. This negates the advantage of using Arrow as a drop-in replacement and prevents incremental migration of code to Arrow.

What changes are included in this PR?

To address this problem, this patch adds a Proxy at the root of the prototype chain for the Table, RecordBatch, and Vector objects and allow index access for these objects for backward compatibility purposes. Basically, objects like Vector now supports vector[i] in addition to vector.get(i).

However, code should not be using vector[i] as it is ~1.5 orders of magnitude slower than vector.get(i) as ES6 Proxy objects are quite slow. This should only be used to provide compatibility for legacy codebases. For code that desires high performance, .get remains a much better solution. This is also why the Proxy object is added to the root of the prototype chain, as opposed to the usual pattern where a Proxy object is returned from a constructor.

Documentation has been added to compare the performance of the various access.

Are there any user-facing changes?

Table, RecordBatch, and Vector elements can now be accessed via index operators, albeit much slower than .get. All changes should be backward compatible.

Are these changes tested?

Yes. The performance of the base objects does not seem to be affected. To establish the performance change, we first see how much variability in the ops/s there is between two runs of yarn perf. The left plot below shows the percent difference between the two runs. This shows a baseline level of "variability" (although may not be statistically significant, but gives an indication) between the runs. We see most benchmarks are within 20% of each other, although a few tests at the bottom show higher variability. The right plot shows the difference in ops/s before and after the patch. No benchmark showed significantly higher or lower performance and looks superficially similar to the variability comparison. As such, we can say that the performance of the objects are not impacted by this patch.

image

The following shows the performance of vector.get(i) vs vector[i]. The raw results shows the index access to have about 50 ops/s while .get are around 750 ops/s. This is about 1.5 orders of magnitude difference in performance. Basically, this shouldn't be used unless it is used for backward compatibility with code that expects native JavaScript arrays.

Certain codebases that previously uses row-oriented way to access data
may wish to migrate to Arrow to save serialization and deserialization
cost, and to be able to gain access to fast column-oriented operations.
As it stands, Arrow is sort of a drop-in replacement to row-oriented
data such as a JavaScript Array of objects. This is great to
incrementally migrate legacy codebases to Arrow, as it is frequently
infeasible to rewrite the application to use the column-oriented data
access patterns. For most data, JavaScript-object-compatible and
row-oriented access is already provided via the `StructRowProxy`.
However, if the structs themselves include a `Vector`, existing code
will break as it assumes the `Vector` object to behave like a JavaScript
array, which it does not due to the lack of index access. An example of
such a data structure is as follows:

```
[
  {x: 1, y: [1, 2]},
  {x: 2, y: [2, 3]},
]
```

In this case, with the Arrow JS library as it is, the API consumer is
unable to get individual element of the `y` array via `table[i].y[j]`.
Instead, the API consumer must use the API `table.get(i).y.get(j)`. In
the situation where we are migrating a legacy code base to Arrow, this
requires a large refactor of the entire codebase, which is infeasible in
a short time. This negates the advantage of using Arrow as a drop-in
replacement and prevents incremental migration of code to Arrow.

To address this problem, this patch adds a Proxy at the root of the
prototype chain for the `Table`, `RecordBatch`, and `Vector` objects and
allow index access for these objects for backward compatibility
purposes. Basically, objects like `Vector` now supports `vector[i]` in
addition to `vector.get(i)`.

However, code should not be using `vector[i]` as it is ~1.5 orders of
magnitude slower than `vector.get(i)` as ES6 Proxy objects are quite
slow. This should only be used to provide compatibility for legacy
codebases. For code that desires high performance, `.get` remains a much
better solution. This is also why the Proxy object is added to the root
of the prototype chain, as opposed to the usual pattern where a Proxy
object is returned from a constructor.

Documentation has been added to compare the performance of the various
access.
@github-actions
Copy link

github-actions bot commented Apr 6, 2023

@github-actions
Copy link

github-actions bot commented Apr 6, 2023

⚠️ GitHub issue #34936 has been automatically assigned in GitHub to PR creator.

@domoritz
Copy link
Member

domoritz commented Jan 5, 2024

I love the perf charts you made. Is there a script or something you could share to make them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JavaScript] Allow index access for Table, RecordBatch, and Vector
2 participants