Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow: CSV to parquet #841

Merged
merged 35 commits into from
May 23, 2023
Merged

Workflow: CSV to parquet #841

merged 35 commits into from
May 23, 2023

Conversation

milesgranger
Copy link
Contributor

Continuation from #738

@milesgranger
Copy link
Contributor Author

Just to clarify, this is a WIP and not ready for a proper review... only posted to show the reproducer of the TCP error when running on Coiled is all. :)

@milesgranger milesgranger marked this pull request as ready for review May 10, 2023 14:17
Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milesgranger
This LGTM, left a couple of comments to consider.

cluster_kwargs.yaml Outdated Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved
cluster_kwargs.yaml Outdated Show resolved Hide resolved
cluster_kwargs.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@j-bennet j-bennet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, we can also chat in Slack or Google Meet, if that helps.

tests/workflows/test_from_csv_to_parquet.py Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Show resolved Hide resolved
@j-bennet
Copy link
Contributor

I just noticed that CI doesn't actually test workflows, unless it's a nightly job, or pr is marked with workflows label:

if: |
github.event_name == 'schedule'
|| (github.event_name == 'pull_request' && contains(github.event.pull_request.labels.*.name, 'workflows'))

You probably want that label (TIL).

@milesgranger
Copy link
Contributor Author

Sorry, I was sort of under the impression that (the label) was added towards the end when not many more adjustments were required so as to not waste runs(?). I've been running it from local machine though.

@j-bennet
Copy link
Contributor

Sorry, I was sort of under the impression that (the label) was added towards the end when not many more adjustments were required so as to not waste runs(?). I've been running it from local machine though.

Ah. :) In that case you're all set. Just making sure.

@milesgranger milesgranger added the workflows Related to representative Dask user workflows label May 15, 2023
@milesgranger
Copy link
Contributor Author

The failing test is unrelated it seems.

Copy link
Contributor

@j-bennet j-bennet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but before you merge, I'd still like to resolve the dataframe.convert-string issue. If it's getting applied, you should not need to explicitly convert object columns to string[pyarrow]. If it's not getting applied, take it out, it's misleading.

Not applied in current version
tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved
tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

tests/workflows/test_from_csv_to_parquet.py Outdated Show resolved Hide resolved

df = dd.read_csv(
files,
sep="\t",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes read_csv fall back to the python engine. 2 things:

  • if this is intended, we should be explicit
  • the python engine is really slow. So if this is not intended we should choose a file that has a 1 character separator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's quite likely the average user (including me) would not know this.

Therefore, IMO, I'd think it ought to stay this way so when it changes in the future, the bump in performance would be evident historically because it's what the average user might expect.

The second option I think would be a slightly hacky work-around that the benchmarks aren't designed for. My understanding is these workflows should represent somewhat realistic use cases and not as a means for maximizing performance itself. Could be mistaken though. cc @fjetter @jrbourbeau

Copy link
Contributor

@phofl phofl May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fine by me if this is intended. Just wanted to check.
read_csv raises a warning that tells you that it will use the Python engine if you execute this code. That's why I suggested switching it explicitly. There aren't any plans to support regex or multi char seps in the c engine and I doubt that there ever will be, so we don't have to plan for that scenario

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up ticket, we could parametrize this test with comma-separated and tab-separated files if we care about the performance of the different engines.

@hendrikmakait hendrikmakait merged commit aefe978 into main May 23, 2023
@milesgranger milesgranger deleted the 738-continued-csv-to-parquet branch May 23, 2023 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting feedback awaiting merge workflows Related to representative Dask user workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants