Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'mlr cut' is very slow #1527

Closed
tooptoop4 opened this issue Mar 16, 2024 · 8 comments
Closed

'mlr cut' is very slow #1527

tooptoop4 opened this issue Mar 16, 2024 · 8 comments

Comments

@tooptoop4
Copy link

for a 5gb csv running 'mlr cut' for one column takes 5mins but only takes 30seconds with 'xsv select' (https://github.com/BurntSushi/xsv)

@johnkerl
Copy link
Owner

johnkerl commented Mar 16, 2024

@tooptoop4 how many columns does the 5GB CSV have? (Also, if possible, can you link to the CSV file itself? No worries if not, but it'd be helpful).

The reason I ask about column-count is #1507 which will be in the next Miller release (6.12).

@aborruso
Copy link
Contributor

Hi @johnkerl I have made a test using this big CSV https://opencoesione.gov.it/media/open_data//progetti_esteso_20231231.zip

QSV

time qsv select -d ";" OC_COD_CICLO progetti_esteso_20231231.csv > cut_qsv.csv

3.988 total, using qsv 0.123

Miller

time mlr --csv --ifs ";" --from progetti_esteso_20231231.csv cut -f OC_COD_CICLO >cut_mlr.csv

I have [2] 119700 killed, after 2:15.86 total, using mlr 6.11.0

duckdb

time duckdb --csv -c "select OC_COD_CICLO from read_csv('progetti_esteso_20231231.csv',delim=';')" >cut_duckdb.csv

1.382 total, using v0.10.0 20b1486d11

@aborruso
Copy link
Contributor

1.382 total, using v0.10.0 20b1486d11

This is not a time to consider. Also duckdb fails and does not extract all 1977163 rows. Now I investigate further.

@aborruso
Copy link
Contributor

aborruso commented Mar 16, 2024

In duckdb 0.10 there was this bug. It has been closed, but is not available in the compiled stable version.
I redid everything using duckbd pre version via python - pip3 install duckdb --pre --upgrade - and I have the right output in 1.6 seconds, using 0.10.1-dev1037.

import duckdb
import time  # Importa il modulo time

start_time = time.time()

query = """
COPY (select OC_COD_CICLO from read_csv('progetti_esteso_20231231.csv',delim=';')) to 'cut_duckdb.csv'
"""

duckdb.query(query)

end_time = time.time()


execution_time = end_time - start_time

print("Tempo di esecuzione: {:.2f} secondi".format(execution_time))

@johnkerl
Copy link
Owner

Thanks @tooptoop4 and @aborruso

Also I would note, xsv is well-known for speed; there is no assertion that Miller will outperform it for speed ...

@aborruso
Copy link
Contributor

John I know, I love Miller, it is too comfortable, it is brilliant.
It is for me the best.

You were asking for an example and I included one that I am working with these days.

@johnkerl
Copy link
Owner

Indeed @aborruso I should have mentioned -- the example is quite sufficient -- thank you! :)

@aborruso
Copy link
Contributor

Hi @johnkerl, using the new release - 6.12.0 - I have had no errors.

The processing time was 12:45.40 total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants