Performance improvement by JIT type inference #786

johnkerl · 2021-12-21T04:35:03Z

Context

The Miller 6 Go port didn't go far enough (until now) on #151, namely, for just-in-time type-inferencing.

Recent performance analysis has shown that the implementation was spending too much time detecting string/int/float contents for fields. E.g. reading CSV data like

color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
...

each value was being scanned and detected as string/int/float. This was taking needless CPU cycles.

On this PR the existing MT_PENDING type (heretofore used only for JSON-scanner intermediates) is moved to the fore. Now, CSV values (and DKVP, NIDX, XTAB, etc.) have their string representations retained but their type set to MT_PENDING. Then, just in time, and when needed, they are inferred. For example, imagine a 20-column CSV, and mlr --csv --from thatfile.csv put $distance = $rate * $time. Before, all 20 columns of all rows were being type-inferred; now, only the two columns rate and time are type-inferred, and the rest are passed along as-was.

JSON processing, however, does not benefit from this gain: intrinsically to the JSON file format, we're required to type-infer all values, whether they are used in the processing chain or not.

Performance results

As of this PR, along with precedings #765, #774, #779, and #781, Miller 6 throughput regularly beats or exceeds Miller 5.

Benchmark scripts are at

An upcoming page at https://miller.readthedocs.io/en/latest/ will show graphical results.

Some numbers: processing times in seconds, on a commodity Mac laptop, for a million-line expand of example.csv:

Operation	Miller 5	Miller 6	Speedup
CSV cat	2.609	1.925	1.35x
CSV-lite cat	1.720	1.726	0.99x
CSV check	1.492	1.234	1.20x
CSV tail	1.587	1.221	1.29x
CSV tac	2.943	4.030	0.73x
CSV sort	3.105	4.181	0.74x
DKVP cat	2.460	2.079	1.18x
NIDX cat	1.614	1.763	0.91x
JSON cat	11.834	11.578	1.02x
CSV 1-put chain	5.917	3.914	1.51x
CSV 2-put chain	9.094	4.598	1.97x
CSV 3-put chain	12.012	5.241	2.29x
CSV 4-put chain	15.093	6.053	2.49x
CSV 5-put chain	19.111	7.029	2.71x
CSV 6-put chain	21.218	8.371	2.53x

The speedups for all but the n-chain cases are due to work on this and recent PRs. The speedups for n-chain cases are intrinsic to the Go port's use of multicore processing.

aborruso · 2021-12-21T08:14:41Z

I have made a simple cat test, really impressive. Thank you very much

johnkerl added 19 commits December 14, 2021 22:46

JIT mlrval type-interfence: mlrval package

eae73fe

mlrmap refactor

1f191cb

complete merge from #779

6773f67

iterating

f8fd0da

mlrval/format.go

1d3a918

mlrval/copy.go

012e707

bifs/arithmetic_test.go

511a1bf

iterate on bifs/collections_test.go

6bcfe84

mlrval_cmp.go

f18983c

mlrval JSON iterate

e84897b

iterate applying mlrval refactors to dependent packages

d336820

first clean compile in a long while on this branch

cbe8ee8

results of first post-compile profiling

9a0b959

testing

4f9eae7

bugfix in ofmt formatting

5ba8d49

bugfix in octal-supporess

6274e61

go fmt

fc665c5

neaten

30bcb69

regression tests all passing

79b5bc1

johnkerl mentioned this pull request Dec 21, 2021

[REQ] JIT numeric conversion in mlr put, part 2 of 2 #151

Closed

johnkerl merged commit 7a97c9b into main Dec 21, 2021

johnkerl deleted the jit-type-infer branch December 21, 2021 04:56

This was referenced Dec 25, 2021

Question about 5.10.x branch #802

Closed

Remove cmd/mprof experimentation executables #803

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement by JIT type inference #786

Performance improvement by JIT type inference #786

johnkerl commented Dec 21, 2021 •

edited

Loading

aborruso commented Dec 21, 2021

Performance improvement by JIT type inference #786

Performance improvement by JIT type inference #786

Conversation

johnkerl commented Dec 21, 2021 • edited Loading

Context

Performance results

aborruso commented Dec 21, 2021

johnkerl commented Dec 21, 2021 •

edited

Loading