-
last number not identified in an isolated value pair I am using the Lexer to parse json chunks that I get in stdin one line at a time.
I appreciate my use case isn't the intended one for this lib as all the tests deal with fully formed json objects and nowhere it states that this lib will work with chunks. But since it works perfectly for my use case, I thought others might benefit from this fix. But also feel free to ignore it.
This PR fixes a problem where the last number value won't be identified as a number because numbers don't have a terminator like strings. In normal circumstances the next token would serve as the terminator (comma, curly brackets, etc) but if the number line is the last one in an json object, the string just ends and the object closing curly brace is in the next line so the number gets ignored.
This fix checks if the last processed token was Number and returns the token type accordingly.
Happy to add any further improvements or additional tests that would be beneficial
- 8 commits contributed to the release over the course of 1635 calendar days.
- 1635 days passed between releases.
- 1 commit was understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
view details
- Uncategorized
- Avoid using the 'alpha' suffix (
1b9c1d7
) - Last number not identified in an isolated value pair (
927c987
) - Isolated value pairs fix (
2b47444
) - Optimize includes (
3fe2df2
) - Add new badges (
3c9fb9e
) - Add clippy and cargo-fmt lints (
6129e80
) - Create rust.yml (
ee18ffa
) - (cargo-release) start next development iteration 1.1.3-alpha.0 (
0668d4a
)
- Avoid using the 'alpha' suffix (
- 2 commits contributed to the release.
- 0 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
view details
- v1.0.1
- README misses BufferType argument to Lexer
- change description
- handle more string escapes
- handle numbers with exponents
- syntax error
- only run benches on nightly
- 16 commits contributed to the release over the course of 1124 calendar days.
- 1124 days passed between releases.
- 7 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
view details
- Uncategorized
- Use criterion for benchmarks (
2be30e9
) - Improve readme (
0b036d6
) - Clippy (
08057a6
) - Simplification and modernization (
4f168d4
) - Merge pull request #12 from FauxFaux/patch-1 (
520763f
) - README misses BufferType argument to Lexer (
bcaaebb
) - Cut 1.1.0 (
5be61b2
) - Merge pull request #11 from heycam/num-str-fixes (
df6d401
) - Handle more string escapes (
df37dc8
) - Handle numbers with exponents (
e1afcbc
) - Add crates badge (
b53e873
) - Apply latest rustfmt (
0750473
) - Syntax error (
1235550
) - Only run benches on nightly (
067e93a
) - V1.0.1 (
79ae678
) - Change description (
cc24d29
)
- Use criterion for benchmarks (
This adds support for Number tokens with exponents and String tokens with the full range of escapes that are allowed.
Changed the headline of the crate to be more descriptive.
- travis: more rust versions and no travis-cargo (fd4cd367)
- more rust versions and no travis-cargo The latter is only useful for doc-uploading, yet we don't need that anymore in the time of docs.rs
-
format everything
-
remove do-not-edit note A left-over of the original
[skip ci]
- benchmark: remove old code (f969498a)
- remove old code It didn't compile anymore, and also probably wasn't really testing us anyway.
- 8 commits contributed to the release over the course of 559 calendar days.
- 604 days passed between releases.
- 4 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
view details
- Uncategorized
- Format everything (
fb0758e
) - Remove old code (
f969498
) - More rust versions and no travis-cargo (
fd4cd36
) - Merge pull request #10 from cmr/master (
da0d2c8
) - Relicense to dual MIT/Apache-2.0 (
853e77a
) - Merge pull request #8 from jnicholls/master (
9303648
) - Rust 1.x stable compatibility fix. (
d2f6a43
) - Remove do-not-edit note (
030f580
)
- Format everything (
-
use
IntoItertor
We useIntoIterator
in place ofIterator
which should provide more flexibility when feeding the Lexer. However, in our tests, we can't actually use that, unfortunately, due to the consuming semantics (and literals cannot be consumed ... ).However, it's a nice proof of concept and doesn't hurt.
- v0.3.0
- 2 commits contributed to the release.
- 2 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
- v0.2.0
- more fun with Token-Iterators
- attaches constructor utilities to all
Iterator<Item=Token>
to ease using them. It's similar to (former)IteratorExt
which would putchain()
andnth
for instance.
- attaches constructor utilities to all
- 2 commits contributed to the release.
- 2 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
- set v0.1.1
- offset-map for target buffer Previously we would keep writing the first bytes of our destination buffer, as we wouldn't compute any offset at all. Now we produce slices of exactly the right size, and could could verify that this is working.
- 2 commits contributed to the release.
- 2 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
- initial commit
- 1 commit contributed to the release.
- 1 commit was understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
view details
- Uncategorized
- Initial commit (
f3b4120
)
- Initial commit (
- lexer
- null-filter initial implementation (97adcb85)
- token-reader
- optimize buffer usage
- only push characters when we have a buffer and are dealing with strings or numbers.
- added some performance tests to show the difference. We are not quite back at 280MB/s, but down to 220MB/s for the optimal/Span case. The Bytes case goes down to 137MB/s.
- use enum for state Instead of using many different variables for handling the state, we use an enumeration. That way, we don't unnecessarily initialize memory that will never be used, and lower our requirements for stack space.
- separarte lexer and filters
FilterNull
now has its own moduleLexer
and friends have their own module
- string value fast-path Technically it's not a fast-path, but it makes the code more uniform and easier to understand. It will be the default template for all other lexing we do, for instance when implementing booleans.
-
update to match current state We also anticipate pretty-printing, which technically isn't there yet.
-
high-speed serialize tests Producers are optimized for performance to show exactly how fast a span token producer is compared to a bytes token producer.
That way, performance improvements can exactly be quantified.
-
added benchmarks Under optimial conditions (source string is known) we remove null values from 144MB/s, fully streamed we do 104MB/s.
An acceptable result, considering the unoptimized buffer handling, using a deque might improve this a lot.
However, we manage to retrieve invalid tokens, which we have to handle somehow, and also don't expect here.
-
make it general
- What's formerly known as NullFilter can now filter out all key-value pairs with a given key-value type. That way, null can be filtered, as well as numbers, for example.
- Made certain dual-branch matches an if-clause, which moved everything furthe to the left again, while making the code easier to read.
-
operate on u8 instead of char This improved throughput from 230MB/s to 280MB/s, which is quite worth the while. In case of json, only the values within Strings are actually potentially unicode, everything else is not
-
added benchmark Also renamed test-related files to match their purpose a bit better.
-
benchmark and string_value tests
- verify escaping in string values works
- added more complex benchmark, lexing 415MB/s
- tested unclosed string value
-
machine serialization works In a first version, we show that serialization without any space works as expected according to very first tests.
More tests have to be conducted to be sure.
-
infrastructure setup
- added
TokenReader
type with basic API - improve existing filter tests to verify TokenReader operation
- added
-
support for
Buffer
enum It cuts our speed in half, currently, but allows to choose between high-speed Span mode and half-speed buffer mode. That way, all applications I see can be catered with best performance. -
initial implementation It is known to not work in all cases, i.e. it can only take one key-value pair at a time (no consecutive ones), but besides that is a pretty optimal implementation (even though it aint a pretty one).
However, the test still fails, we match nothing for some reason. Must be evaluated later.
-
number lexing Even though our performance dropped from 332MB/s to 274MB/s, we are happy as the implementation cleaned up our span handling considerably and thus made the code even more maintainable. Cleaning up the span handling also made it faster, i.e. slowest number parsing was at 232MB/s.
-
filtering iterator infrastucture
- FilterNull frame would allow implementing lexical token pattern matching to remove null values.
-
true and false value lexing
- including test for both
- refactored test case for null value easily test booleans as well.
-
null value lexing
- including tests for normal null values and invalid ones
-
string value parsing Including test
-
datastructures and basic tests The tests still fail as the actual lexer implementation is still to be done.
- added remaining documentation I believe it's best not to add redundant information into the library docs, but instead refer to the tests and benchmarks.
- state why numbers won't be lexed Also it's not required to solve our actual problem.
- usage added
- clog config + changelog
- also run benchmarks
- set GH_TOKEN ... instead of TOKEN
- with doc-upload Never worked for me, but let's try it one more time.
- no nightly please As we didn't set the package unstable
- added secret minor format adjustment
- lexer operate on u8 instead of char (d5a694d1)
- null-filter make it general (431f051d)
- README update to match current state (75181ff6)
- null-filter
- proper comma handling (321fa592)
- handle consecutive null values (96e20e65)
- minor fix to make it work (e489bffa)
-
minor fix to make it work It's still far from perfect, but a good proof of concept
-
handle consecutive null values With this in place, we handle null value filtering pretty well, as the tests indicate too.
However, we may still leave a trailing comma in non-null values which could be a problem and thus shouldn't be done !
-
proper comma handling
- Added support for leading
,
characters, which have to be removed conditionally. - Added tests to verify this works in valid streams, and even invalid ones.
- Added support for leading
-
removed possible overflow Previously it was possible to over-allocate memory by feeding us lots of
,
characters. This was alleviated by allowing a one-token look-ahead (implemented through put-back). -
handle whitespace at end of source Previously we would consider such whitespace invalid. Now we explictly set the invalid state, which is ... explicit = better :) !
Added we added a test to show this actually works.
- 35 commits contributed to the release over the course of 2 calendar days.
- 2 days passed between releases.
- 35 commits were understood as conventional.
- 0 issues like '(#ID)' were seen in commit messages
view details
- Uncategorized
- Added remaining documentation (
077fe41
) - Clog config + changelog (
ffb3c71
) - Update to match current state (
75181ff
) - Handle whitespace at end of source (
1d57bc9
) - High-speed serialize tests (
a5e3c3d
) - Added benchmarks (
8c5e9f2
) - Machine serialization works (
458928d
) - Infrastructure setup (
96dac09
) - Make it general (
431f051
) - Optimize buffer usage (
08ad49b
) - Support for
Buffer
enum (a3e72b5
) - Operate on u8 instead of char (
d5a694d
) - Removed possible overflow (
50c9f81
) - Proper comma handling (
321fa59
) - Handle consecutive null values (
96e20e6
) - Added benchmark (
43a1119
) - Minor fix to make it work (
e489bff
) - Initial implementation (
97adcb8
) - Number lexing (
f952f08
) - Use enum for state (
e924f03
) - Separarte lexer and filters (
0a7e5c7
) - Filtering iterator infrastucture (
fb94ea9
) - State why numbers won't be lexed (
270f57c
) - True and false value lexing (
97ae908
) - Also run benchmarks (
32cd37b
) - Set GH_TOKEN (
b5ab5a1
) - With doc-upload (
5287268
) - No nightly please (
dc439b6
) - String value fast-path (
dd40f6e
) - Null value lexing (
dc2f9a2
) - Benchmark and string_value tests (
d4782c8
) - String value parsing (
e9b6072
) - usage added (
e9cc19e
) - Added secret (
dbf42ec
) - Datastructures and basic tests (
f66ea5f
)
- Added remaining documentation (