Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_columns_string() C++ JSON API #2315

Merged
merged 3 commits into from
Jul 29, 2023
Merged

Add to_columns_string() C++ JSON API #2315

merged 3 commits into from
Jul 29, 2023

Conversation

texodus
Copy link
Member

@texodus texodus commented Jul 25, 2023

This PR adds View.to_columns_string() method, which works similarly to to_columns(), except it returns a JSON string instead of a JavaScript object (in JS) or Python Dict (in Python). to_columns_string() is implemented in C++ useing RapidJSON library, and re-used verbatim in both JavaScript and Python, reducing duplicate code between these two projects.

In JavaScript:

  • In JavaScript, to_columns(), to_json() have been rewritten in terms of to_columns_string(), using JSON.parse()
  • As JavaScript is typically instantiated in a WebWorker, using to_columns_string() saves one Object->String->Object conversion (as the data can stay in string form until it is copied into the main renderer process).
  • Linear performance is improved ~2x for these methods.
  • @finos/perspective-viewer-datagrid and @finos/perspective-viewer-d3fc have been rewritten to use this method internally, leading to improved performance for both.

In Python

  • to_columns(), to_records() have been rewritten in terms of to_columns_string(), using json.loads()
  • All 3 of these methods are now GIL-less, so concurrency for virtual workloads is much improved.
  • (Breaking) Inf, -Inf and nan support has been removed from these methods. These data structures errored when serializing anyway, so were entirely useless for visualization purposes. to_numpy() and to_arrow() preserve these values still.

In both:

  • date and datetime columns now use YYYY-MM-DD and YYYY-MM-DD HH:MM:SS.SSSS formatting respectively, for CSV export and split_by headers. The former can only be fixed at the expense of performance and/or significant asset size (bundling a locale collection); the latter can be fixed with a refactoring of the JSON representation.

Serial (linear) performance by version:
Screenshot 2023-07-26 at 11 58 01 AM

10 clients concurrent performance time series:
Screenshot 2023-07-28 at 10 12 02 PM

@texodus texodus added enhancement Feature requests or improvements breaking labels Jul 25, 2023
@timkpaine
Copy link
Member

just noting that you've auto formatted the python tests here. I'm a fan of this, but it does make for a big changeset

@texodus
Copy link
Member Author

texodus commented Jul 26, 2023

@timkpaine The auto formatting is exclusively in a single commit authored by no author.

This code shouldn't have been excluded from formatting. The issue (currently, though the decision begets issues) is that VSCode does not respect the pyproject.toml configuration for black suddenly, resulting in these files being obliterated whenever they are edited. This PR modifies python tests, btu the modifications are not in the auto-format commit. We could move this commit to a separate PR but as it impacts nothing IMO don't think its worth the CI time.

await table_suite();
await view_suite();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why were these switched?

@@ -111,6 +122,19 @@ test.describe("leaks", function () {
view.delete();
table.delete();
});

test.skip("csv loading does not leak", async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, should this be skipped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes forever to run is the only reason to skip it

packages/perspective/test/js/pivots.spec.js Outdated Show resolved Hide resolved
{ datetime: "6/13/16" },
{ datetime: "6/14/16" },
{ datetime: "2016-06-13" },
{ datetime: "2016-06-14" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, how important is it to maintain datatime strring parsing across different systems? Since it's possible to use a parsed date string as part of a column name, should we at one point add a C++ based function that both JS and Python can call to get a standardized formatted date string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two conditions where date/datetime columns need to be formatted:

(1) When they are used in a split_by and we need to stringify them to create compound column names. This is really a deficiency of the JSON encoding we use which doesn't permit true column paths. Fixing this encoding would remove the need for formatting for this case.

(2) When the formatted flag is passed. This is only used by the CSV exporter (as users typically expect CSVs to be "human readable"). Formerly, this used sprintf to try to replicate the en/us locale, as the browser native locale is not cheaply accessible from C++; however, this was never right, especially in non-en/us locales. The format in this patch is a compromise - it is not locale-formatted, but it is consistently emitted and parsed (though assumption are made about the encoding timezone in this case which are symmetric wrt Perspective but maybe not always desired).

func = Benchmark(lambda: getattr(self._view, "to_{0}".format(name))(), meta=test_meta)
method = "to_{0}".format(name)
test_meta = make_meta("to_format", method)
func = Benchmark(getattr(self._view, method), meta=test_meta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If benchmarking results in static assets, maybe we can create an blocks example that pulls in the data.

@texodus texodus merged commit 37e3df3 into master Jul 29, 2023
12 checks passed
@texodus texodus deleted the rapidjson-api branch July 29, 2023 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking enhancement Feature requests or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants