-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement by JIT type inference #786
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have made a simple cat test, really impressive. Thank you very much |
This was referenced Dec 25, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
The Miller 6 Go port didn't go far enough (until now) on #151, namely, for just-in-time type-inferencing.
Recent performance analysis has shown that the implementation was spending too much time detecting string/int/float contents for fields. E.g. reading CSV data like
each value was being scanned and detected as string/int/float. This was taking needless CPU cycles.
On this PR the existing
MT_PENDING
type (heretofore used only for JSON-scanner intermediates) is moved to the fore. Now, CSV values (and DKVP, NIDX, XTAB, etc.) have their string representations retained but their type set toMT_PENDING
. Then, just in time, and when needed, they are inferred. For example, imagine a 20-column CSV, andmlr --csv --from thatfile.csv put $distance = $rate * $time
. Before, all 20 columns of all rows were being type-inferred; now, only the two columnsrate
andtime
are type-inferred, and the rest are passed along as-was.JSON processing, however, does not benefit from this gain: intrinsically to the JSON file format, we're required to type-infer all values, whether they are used in the processing chain or not.
Performance results
As of this PR, along with precedings #765, #774, #779, and #781, Miller 6 throughput regularly beats or exceeds Miller 5.
Benchmark scripts are at
An upcoming page at https://miller.readthedocs.io/en/latest/ will show graphical results.
Some numbers: processing times in seconds, on a commodity Mac laptop, for a million-line expand of example.csv:
The speedups for all but the n-chain cases are due to work on this and recent PRs. The speedups for n-chain cases are intrinsic to the Go port's use of multicore processing.