GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939

shuhaowu · 2023-04-06T16:42:15Z

THIS PR IS NOT READY YET:

Potentially add documentation comparing the performance of .get vs [index], and maybe even the nuance of how .get can use a binary search algorithm when the Vector consists of multiple chunks.
Implement other methods such as filter on the Vector object to ensure 100% compatibility with JavaScript arrays.

Rationale for this change

Certain codebases that previously uses row-oriented way to access data may wish to migrate to Arrow to save serialization and deserialization cost, and to be able to gain access to fast column-oriented operations. As it stands, Arrow is sort of a drop-in replacement to row-oriented data such as a JavaScript Array of objects. This is great to incrementally migrate legacy codebases to Arrow, as it is frequently infeasible to rewrite the application to use the column-oriented data access patterns. For most data, JavaScript-object-compatible and row-oriented access is already provided via the StructRowProxy. However, if the structs themselves include a Vector, existing code will break as it assumes the Vector object to behave like a JavaScript array, which it does not due to the lack of index access. An example of such a data structure is as follows:

[
  {x: 1, y: [1, 2]},
  {x: 2, y: [2, 3]},
]

In this case, with the Arrow JS library as it is, the API consumer is unable to get individual element of the y array via table[i].y[j]. Instead, the API consumer must use the API table.get(i).y.get(j). In the situation where we are migrating a legacy code base to Arrow, this requires a large refactor of the entire codebase, which is infeasible in a short time. This negates the advantage of using Arrow as a drop-in replacement and prevents incremental migration of code to Arrow.

What changes are included in this PR?

To address this problem, this patch adds a Proxy at the root of the prototype chain for the Table, RecordBatch, and Vector objects and allow index access for these objects for backward compatibility purposes. Basically, objects like Vector now supports vector[i] in addition to vector.get(i).

However, code should not be using vector[i] as it is ~1.5 orders of magnitude slower than vector.get(i) as ES6 Proxy objects are quite slow. This should only be used to provide compatibility for legacy codebases. For code that desires high performance, .get remains a much better solution. This is also why the Proxy object is added to the root of the prototype chain, as opposed to the usual pattern where a Proxy object is returned from a constructor.

Documentation has been added to compare the performance of the various access.

Are there any user-facing changes?

Table, RecordBatch, and Vector elements can now be accessed via index operators, albeit much slower than .get. All changes should be backward compatible.

Are these changes tested?

Yes. The performance of the base objects does not seem to be affected. To establish the performance change, we first see how much variability in the ops/s there is between two runs of yarn perf. The left plot below shows the percent difference between the two runs. This shows a baseline level of "variability" (although may not be statistically significant, but gives an indication) between the runs. We see most benchmarks are within 20% of each other, although a few tests at the bottom show higher variability. The right plot shows the difference in ops/s before and after the patch. No benchmark showed significantly higher or lower performance and looks superficially similar to the variability comparison. As such, we can say that the performance of the objects are not impacted by this patch.

The following shows the performance of vector.get(i) vs vector[i]. The raw results shows the index access to have about 50 ops/s while .get are around 750 ops/s. This is about 1.5 orders of magnitude difference in performance. Basically, this shouldn't be used unless it is used for backward compatibility with code that expects native JavaScript arrays.

Closes: [JavaScript] Allow index access for Table, RecordBatch, and Vector #34936

Certain codebases that previously uses row-oriented way to access data may wish to migrate to Arrow to save serialization and deserialization cost, and to be able to gain access to fast column-oriented operations. As it stands, Arrow is sort of a drop-in replacement to row-oriented data such as a JavaScript Array of objects. This is great to incrementally migrate legacy codebases to Arrow, as it is frequently infeasible to rewrite the application to use the column-oriented data access patterns. For most data, JavaScript-object-compatible and row-oriented access is already provided via the `StructRowProxy`. However, if the structs themselves include a `Vector`, existing code will break as it assumes the `Vector` object to behave like a JavaScript array, which it does not due to the lack of index access. An example of such a data structure is as follows: ``` [ {x: 1, y: [1, 2]}, {x: 2, y: [2, 3]}, ] ``` In this case, with the Arrow JS library as it is, the API consumer is unable to get individual element of the `y` array via `table[i].y[j]`. Instead, the API consumer must use the API `table.get(i).y.get(j)`. In the situation where we are migrating a legacy code base to Arrow, this requires a large refactor of the entire codebase, which is infeasible in a short time. This negates the advantage of using Arrow as a drop-in replacement and prevents incremental migration of code to Arrow. To address this problem, this patch adds a Proxy at the root of the prototype chain for the `Table`, `RecordBatch`, and `Vector` objects and allow index access for these objects for backward compatibility purposes. Basically, objects like `Vector` now supports `vector[i]` in addition to `vector.get(i)`. However, code should not be using `vector[i]` as it is ~1.5 orders of magnitude slower than `vector.get(i)` as ES6 Proxy objects are quite slow. This should only be used to provide compatibility for legacy codebases. For code that desires high performance, `.get` remains a much better solution. This is also why the Proxy object is added to the root of the prototype chain, as opposed to the usual pattern where a Proxy object is returned from a constructor. Documentation has been added to compare the performance of the various access.

github-actions · 2023-04-06T16:42:36Z

Closes: [JavaScript] Allow index access for Table, RecordBatch, and Vector #34936

github-actions · 2023-04-06T16:42:38Z

⚠️ GitHub issue #34936 has been automatically assigned in GitHub to PR creator.

domoritz · 2024-01-05T17:13:08Z

I love the perf charts you made. Is there a script or something you could share to make them?

github-actions bot added Component: JavaScript awaiting review Awaiting review labels Apr 6, 2023

domoritz mentioned this pull request Apr 18, 2023

Support Apache Arrow tables in database clients observablehq/stdlib#320

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939

GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939

shuhaowu commented Apr 6, 2023 •

edited

Loading

github-actions bot commented Apr 6, 2023

github-actions bot commented Apr 6, 2023

domoritz commented Jan 5, 2024

GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939

Are you sure you want to change the base?

GH-34936: [JavaScript] Added Proxy for Table, RecordBatch, and Vector #34939

Conversation

shuhaowu commented Apr 6, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Are these changes tested?

github-actions bot commented Apr 6, 2023

github-actions bot commented Apr 6, 2023

domoritz commented Jan 5, 2024

shuhaowu commented Apr 6, 2023 •

edited

Loading