Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement by JIT type inference #786

Merged
merged 19 commits into from
Dec 21, 2021
Merged

Performance improvement by JIT type inference #786

merged 19 commits into from
Dec 21, 2021

Conversation

johnkerl
Copy link
Owner

@johnkerl johnkerl commented Dec 21, 2021

Context

The Miller 6 Go port didn't go far enough (until now) on #151, namely, for just-in-time type-inferencing.

Recent performance analysis has shown that the implementation was spending too much time detecting string/int/float contents for fields. E.g. reading CSV data like

color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
...

each value was being scanned and detected as string/int/float. This was taking needless CPU cycles.

On this PR the existing MT_PENDING type (heretofore used only for JSON-scanner intermediates) is moved to the fore. Now, CSV values (and DKVP, NIDX, XTAB, etc.) have their string representations retained but their type set to MT_PENDING. Then, just in time, and when needed, they are inferred. For example, imagine a 20-column CSV, and mlr --csv --from thatfile.csv put $distance = $rate * $time. Before, all 20 columns of all rows were being type-inferred; now, only the two columns rate and time are type-inferred, and the rest are passed along as-was.

JSON processing, however, does not benefit from this gain: intrinsically to the JSON file format, we're required to type-infer all values, whether they are used in the processing chain or not.

Performance results

As of this PR, along with precedings #765, #774, #779, and #781, Miller 6 throughput regularly beats or exceeds Miller 5.

Benchmark scripts are at

An upcoming page at https://miller.readthedocs.io/en/latest/ will show graphical results.

Some numbers: processing times in seconds, on a commodity Mac laptop, for a million-line expand of example.csv:

Operation Miller 5 Miller 6 Speedup
CSV cat 2.609 1.925 1.35x
CSV-lite cat 1.720 1.726 0.99x
CSV check 1.492 1.234 1.20x
CSV tail 1.587 1.221 1.29x
CSV tac 2.943 4.030 0.73x
CSV sort 3.105 4.181 0.74x
DKVP cat 2.460 2.079 1.18x
NIDX cat 1.614 1.763 0.91x
JSON cat 11.834 11.578 1.02x
CSV 1-put chain 5.917 3.914 1.51x
CSV 2-put chain 9.094 4.598 1.97x
CSV 3-put chain 12.012 5.241 2.29x
CSV 4-put chain 15.093 6.053 2.49x
CSV 5-put chain 19.111 7.029 2.71x
CSV 6-put chain 21.218 8.371 2.53x

The speedups for all but the n-chain cases are due to work on this and recent PRs. The speedups for n-chain cases are intrinsic to the Go port's use of multicore processing.

@johnkerl johnkerl merged commit 7a97c9b into main Dec 21, 2021
@johnkerl johnkerl deleted the jit-type-infer branch December 21, 2021 04:56
@aborruso
Copy link
Contributor

I have made a simple cat test, really impressive. Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants