Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Explorer: Use concurrent.futures background job queue to return get_column_profiles results asynchronously #4326

Merged
merged 9 commits into from
Sep 10, 2024

Conversation

wesm
Copy link
Contributor

@wesm wesm commented Aug 12, 2024

As discussed in #4300, comm requests have to be served in the order they are received. Since the data explorer has some request types, like get_column_profiles that can grow very expensive for large datasets, this blocks "fast" requests like get_data_values from executing while these "slow" requests are pending.

This change adds a return_column_profiles frontend method that allows these requests to be fulfilled in the background, allowing fast requests to be served unimpeded. This resolves the immediate performance issue that the data explorer is facing, so this pattern can be refined over time as we have need to do other asynchronous / background request handling in the kernels.

Here's what the "blocked request" behavior looks like on the main branch:

Screencast.from.2024-08-08.14-55-18.mp4

And here is with this change:

Screencast.from.2024-08-12.13-13-10.mp4

The main difference is that the grid values load immediately, and then the null percentages load later when the return_column_profiles event is fired.

This change can't be merged until the corresponding changes are implemented in Ark.

@wesm wesm requested review from seeM, dfalbel and jmcphers August 12, 2024 18:14
dfalbel
dfalbel previously approved these changes Aug 13, 2024
Copy link
Contributor

@dfalbel dfalbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

seeM
seeM previously approved these changes Aug 13, 2024
Copy link
Contributor

@seeM seeM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Works as advertised.

Does Ark have this same issue? If not, I wonder if we should keep the old get_column_profiles method and make this PR's implementation a new method. That way we don't need to change anything in Ark, and when we support concurrency in the Python kernel we'll use the original method too.

with self.lock:
futures = list(self.pending_futures)

for future in futures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also concurrent.futures.wait (docs) if we need more control in future.

const callbackId = generateUuid();
const promise = new DeferredPromise<Array<ColumnProfileResult>>();
this._asyncTasks.set(callbackId, promise);
await this._positronDataExplorerComm.getColumnProfiles(callbackId, profiles, this._profileFormatOptions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also catch errors here and reject the deferred promise immediately? It will timeout after 10 seconds in any case but could be helpful for the user to have immediate feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here: #4559
Currently the expected behavior is to catch all error and return empty results instead. We will need additional UI to somehow display these errors.

setTimeout(() => {
// If the promise has already been resolved, do nothing.
if (promise.isSettled) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to clean up _asyncTasks here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle the key should have been deleted from _asyncTasks in the resolution of the promise

@wesm
Copy link
Contributor Author

wesm commented Aug 13, 2024

I’m not sure if this issue affects Ark or not. My understanding from what @lionel- told me is that they also want to serve requests in order in Ark also, so the asynchronous API here may be the preferred way.

@seeM could you take charge of pushing this over the finish line (with whatever changes are necessary in Ark) since I’m OOO for the next couple weeks?

jmcphers
jmcphers previously approved these changes Aug 15, 2024
"params": [
{
"name": "callback_id",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me a little itchy to have to complicate the contract b/c of the limitations of the backend, but I don't see a better way than what you've done here.

wesm and others added 8 commits September 10, 2024 10:57
Co-authored-by: Wasim Lorgat <[email protected]>
Signed-off-by: Daniel Falbel <[email protected]>
)

Fixes an issue while running viewing a small polars data frame. In this
case, the background tasks that return the null cournts would be
processed faster than the actual RPC message, leading to an issue where
the `send_event` is sent before `send_result`, but with the message id
from `send_result`, which cause positron to interpret the event and the
return value of the RPC message.

This PR makes sure `send_event` waits for for the RPC message to be
handled before sending the event results.

```
import polars as pl

from datetime import date
df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6.0, 7.0, 8.0],
        "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)],
        "a": [None, 2, 3],
        "b": [0.5, None, 2.5],
        "c": [True, None, False],
    }
)


%view df
```
@dfalbel dfalbel force-pushed the feature/de-async-column-profiles branch from ef72f5a to 913deb3 Compare September 10, 2024 13:58
@dfalbel dfalbel merged commit 6cc6d2d into main Sep 10, 2024
22 checks passed
@dfalbel dfalbel deleted the feature/de-async-column-profiles branch September 10, 2024 15:12
@github-actions github-actions bot locked and limited conversation to collaborators Sep 10, 2024
@wesm
Copy link
Contributor Author

wesm commented Sep 10, 2024

Thanks for pushing this through!! This is a big UX improvement in the data explorer

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants