Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix iterator_input_adapter consuming input before it should #3442

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

imwhocodes
Copy link

@imwhocodes imwhocodes commented Apr 14, 2022

Before this PR iterator_input_adapter was indiscriminately consuming/advancing the input iterator before actually know if it should on get_character()

Case example:

  1. iterator_input_adapter over a custom input (ex input adapter over a stream) that can yield on iterator advance: even with strict=false the iterator_input_adapter was trying to consume one item more even if the parser or the sax implementation had return false
  2. iterator_input_adapter over a custom input (ex vector) with multiple json chained together even with strict=false the iterator_input_adapter was consuming one char more than it should even if the parser or the sax implementation had return false

This implementation fix the issue, it advance the iterator only if the current has ben already consumed, otherwise it wait the next get_character() to actually advance the iterator


Pull request checklist

Read the Contribution Guidelines for detailed information.

  • Changes are described in the pull request
  • The test suite compiles and runs without error.
  • Code coverage is 100%. Test cases can be added by editing the test suite.
  • The source code is amalgamated; that is, after making changes to the sources in the include/nlohmann directory, run make amalgamate to create the single-header file single_include/nlohmann/json.hpp. The whole process is described here.


template<typename BaseInputAdapter, size_t T>
friend struct wide_string_input_helper;

bool empty() const
bool empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this remain const qualified? (Meaning current and current_has_been_consumed would have to be made mutable.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it should?
Empty actually need to do work (advance the underlying iterator) if input has been already consumed

keeping it const would be only to legacy reasons

Copy link
Contributor

@falbrechtskirchinger falbrechtskirchinger Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's why we have the mutable keyword in the language. It's more to meet expectations. You're not going to expect empty() to not work on a const object based on how the same function operates in other classes.

Edit: I'll take Niels' thumb up as agreement.

Copy link
Author

@imwhocodes imwhocodes Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that!

@@ -5542,30 +5543,46 @@ class iterator_input_adapter
using char_type = typename std::iterator_traits<IteratorType>::value_type;

iterator_input_adapter(IteratorType first, IteratorType last)
: current(std::move(first)), end(std::move(last))
: current(std::move(first)), end(std::move(last)), current_has_been_consumed(false)
Copy link
Contributor

@falbrechtskirchinger falbrechtskirchinger Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may as well remove the member initializer here, since you've already initialized it in the declaration.

I don't know if clang-tidy, etc. will complain. There're various checks for redundant stuff.

Edit: clang-tidy did in fact complain:
error: member initializer for 'current_has_been_consumed' is redundant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to keep it in the constructor (maybe explicitly delete the default constructor) and remove the initialisation in the declaration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to keep it in the constructor (maybe explicitly delete the default constructor) and remove the initialisation in the declaration?

Pick one. I prefer in-class myself, so I'd remove the member initializer as suggested.

The default constructor is implicitly deleted because we have a user-defined constructor, right?

Copy link
Author

@imwhocodes imwhocodes Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default constructor is implicitly deleted because we have a user-defined constructor, right?

Yes, this is actually correct

@coveralls
Copy link

coveralls commented Apr 14, 2022

Coverage Status

Coverage decreased (-0.03%) to 99.967% when pulling 2614b81 on imwhocodes:iterator_input_adapter_fix into 87cda1d on nlohmann:develop.

@falbrechtskirchinger
Copy link
Contributor

I assume this is at least somewhat related to this discussion? #3411

Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)

@imwhocodes
Copy link
Author

I assume this is at least somewhat related to this discussion? #3411

It could be, I'm going to check later

Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)

Yes, I was thinking to add some unit-test too after
Should I create another fork of develop, add the test (that will fail), push, then add on that brach the fixes or the fixes should need to stay on this branch/PR?

Thanks

@falbrechtskirchinger
Copy link
Contributor

falbrechtskirchinger commented Apr 14, 2022

I assume this is at least somewhat related to this discussion? #3411

It could be, I'm going to check later

Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)

Yes, I was thinking to add some unit-test too after Should I create another fork of develop, add the test (that will fail), push, then add on that brach the fixes or the fixes should need to stay on this branch/PR?

For small things like this you can always use Compiler Explorer.

A starting point: https://godbolt.org/z/c83xMcc77

Note that you can #include http(s) URLs, i.e., use #include <https://raw.githubusercontent.com/nlohmann/json/develop/single_include/nlohmann/json.hpp> for json.hpp from the develop branch.

Edit: To clarify: Add your test here. This is more for you to be sure your test actually demonstrates how your changes fixed something that would previously fail.

IteratorType end;
mutable IteratorType current;
const IteratorType end;
mutable bool current_has_been_consumed;
Copy link
Contributor

@falbrechtskirchinger falbrechtskirchinger Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/__w/json/json/include/nlohmann/detail/input/input_adapters.hpp:163:18: error: use default member initializer for 'current_has_been_consumed' [modernize-use-default-member-init,-warnings-as-errors]
    mutable bool current_has_been_consumed;
                 ^
                                          {false}

clang-tidy seems to insist on this:
mutable bool current_has_been_consumed{false};

(Didn't know/remember that.)

CI should turn green after these two changes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use

mutable bool current_has_been_consumed = false;

This library has a troubled relationship to braced initializers... :-)

{
current_has_been_consumed = false;
return std::char_traits<char_type>::eof();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And clang-tidy doesn't like else after return. It wants this instead:

if (JSON_HEDLEY_LIKELY(current != end))
{
    current_has_been_consumed = true;
    return std::char_traits<char_type>::to_int_type(*current);
}

current_has_been_consumed = false;
return std::char_traits<char_type>::eof();

@imwhocodes
Copy link
Author

imwhocodes commented Apr 14, 2022

@falbrechtskirchinger
Here is a link to Compiler Explorer with an example of failure case

Edit: extended test

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case

Thanks! That demonstrates the issue clearly enough.

@imwhocodes
Copy link
Author

@falbrechtskirchinger @nlohmann should be now strict exposed as parameter (defaulted to true) by static basic_json parse( ..... , const bool strict = true) ?

@nlohmann
Copy link
Owner

@falbrechtskirchinger @nlohmann should be now strict exposed as parameter (defaulted to true) by static basic_json parse( ..... , const bool strict = true) ?

Does this allow to finally parse JSON lines (https://json.nlohmann.me/features/parsing/json_lines/)?

@imwhocodes
Copy link
Author

imwhocodes commented Apr 14, 2022

Does this allow to finally parse JSON lines (https://json.nlohmann.me/features/parsing/json_lines/)?

Sorry can you expand on that? what is not working about json_lines?

If you are talking about something like this there the problem is multiple fold and is not directly corrected by this PR

The problem with friend std::istream& operator>>(std::istream& i, basic_json& j) are that:

  1. new line \n if present is not consumed at the end of the parsing of json object
  2. parsing of an empty string fail

If you want I can push a PR to address that problem too

[edit]
I prepared a fix, but is going to break unit-regression1.cpp, NDJson behaviour is no compatible with this tests

[edit2]
Proposed solution here

@nlohmann
Copy link
Owner

nlohmann commented May 1, 2022

@imwhocodes Can you please check the comments?

@imwhocodes
Copy link
Author

Just came back from holiday, gonna address in this days

@nlohmann
Copy link
Owner

@imwhocodes Is there anything I can do to support this?

@gregmarr
Copy link
Contributor

@imwhocodes Do you have a test case that demonstrates this error?

@falbrechtskirchinger
Copy link
Contributor

@imwhocodes Do you have a test case that demonstrates this error?

See quote below. I can share my test case which fails to reproduce the issue later.

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case

Edit: extended test

@gregmarr
Copy link
Contributor

@imwhocodes Do you have a test case that demonstrates this error?

See quote below. I can share my test case which fails to reproduce the issue later.

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case
Edit: extended test

Thanks. I thought that since you couldn't reproduce it in a test case, we didn't have a working repro yet.

@imwhocodes
Copy link
Author

Sorry everybody for the delay,
I added simple test case for the fix itself: is almost the same as the one used for the "Compiler Explorer" examples

Thanks for the patience

@falbrechtskirchinger
Copy link
Contributor

falbrechtskirchinger commented Jun 20, 2022

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.

Edit: To be crystal clear: I don't believe there's an issue to fix.

@imwhocodes
Copy link
Author

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.

Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case?

The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

  • recv on socket
  • Arduino Stream
  • generic input stream

The current (not fixed) implementation of json::details::iterator_input_adapter actually consume from the underlaying buffer one char ahead of the one returned to the upper layer, ending up already consuming one char past the end of the json on when parsing stop

On the link you can see this happening, you can change #define USE_FIX false on the top to see the difference

Here for any question,
Luca

@falbrechtskirchinger
Copy link
Contributor

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.
Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case?

The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case.

I'll need to look at it a bit closer when I have some spare time, meaning not right now.

@imwhocodes
Copy link
Author

imwhocodes commented Jun 20, 2022

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.
Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case?
The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case.

I'll need to look at it a bit closer when I have some spare time, meaning not right now.

The logic problem with your test is that you are de-referencing a pointer to the last read char, and given the fact that you passing strict = true to the sax_parser it is correct that the pointer/iterator point to the last char parsed.

When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)

Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter strict = true would be simply broken

@imwhocodes
Copy link
Author

imwhocodes commented Jun 20, 2022

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

You are getting the value when you deference the iterator pointer, but the pointer could be already the iterator::end
So technically UB by deferencing end iterator

Here is a failing test for your example, I had to move from std::string to std::vector otherwise there would be the hidden null-terminator of string masking the bug, as you can see the program crashes

Edit: Also with your proposed solution even strict = true with correct json input is broken

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

Well, this is your code. :-)
I assumed you knew that your operator!= was broken. There's more work needed, but I'm confident it's possible.

@falbrechtskirchinger
Copy link
Contributor

I've taken a closer look and even though I'd recommend changes to both operator!= and operator++, the code works just fine with correct JSON input. Did you remove those last few characters on purpose?

...,'u','e','_','B' };

After completing the input as follows the assertion succeedes.

...,'u','e','_','B','"','}' };

@falbrechtskirchinger
Copy link
Contributor

And here's the updated code with some improvements to the operators:
https://godbolt.org/z/Ee1Pr3Gse

Possibly still not 100% correct but at least closer.

@imwhocodes
Copy link
Author

imwhocodes commented Jun 20, 2022

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

Well, this is your code. :-) I assumed you knew that your operator!= was broken. There's more work needed, but I'm confident it's possible.

I'm honestly thinking that you are trolling me,
the broken code is taken from your message, with your proposed solution, this message exactly:

Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5

Code example where you moved the ConsumeChar logic from EmulateStream & operator++() to reference operator*() const

    reference operator*() const {
        if(!consumed) {
            c = ConsumeChar(target);
            consumed = true;
        }
        return c;
    }

And you proposed it as solution, solution where end iterator is dereferenced

As you can see on my initial test case (here edited to be more clear) the operator!= is not broken and fully working

@imwhocodes
Copy link
Author

I've taken a closer look and even though I'd recommend changes to both operator!= and operator++, the code works just fine with correct JSON input. Did you remove those last few characters on purpose?

...,'u','e','_','B' };

After completing the input as follows the assertion succeedes.

...,'u','e','_','B','"','}' };

Yes I removed on purpose to show you that on incomplete input your solution is UB

About that:

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse

Possibly still not 100% correct but at least closer.

This is not valid solution, on a STREAM-like input you don't have empty() or size() function, so you CANNOT know ahead how much data you have left, it is a stream, a flow like of information, not a container

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

Well, this is your code. :-) I assumed you knew that your operator!= was broken. There's more work needed, but I'm confident it's possible.

I'm honestly thinking that you are trolling me, the broken code is taken from your message, with your proposed solution, this message exactly:

Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5

I modified the pertinent bits of your code to make the unaltered example work. operator!= is broken in the sense that I can construct two instances of EmulateStream that return an unexpected result for operator!=. You can call that UB if you want.

Code example where you moved the ConsumeChar logic from EmulateStream & operator++() to reference operator*() const

    reference operator*() const {
        if(!consumed) {
            c = ConsumeChar(target);
            consumed = true;
        }
        return c;
    }

And you proposed as solution, solution where end iterator is dereferenced
I still don't see that happening, sorry. iterator_input_adapter does not dereference the end iterator (EmulateStream).

As you can see on my initial test case (here edited to be more clear) the operator!= is not broken and fully working

@falbrechtskirchinger
Copy link
Contributor

Yes I removed on purpose to show you that on incomplete input your solution is UB

How? The program is crashing on your assertion, not because the end iterator is dereferenced?

This is not valid solution, on a STREAM-like input you don't have empty() or size() function, so you CANNOT know ahead how much data you have left, it is a stream, a flow like of information, not a container

Well, the iterator_input_adapter needs to know, if it's reached the end…

@imwhocodes
Copy link
Author

imwhocodes commented Jun 20, 2022

@falbrechtskirchinger ok lets start fresh:

This is your proposed solution: to obtaining the char value during reference operator*() const but as you can see here this is broken because you can dereference a ::end pointer, to make it clear I changed str->front() (UB) with str->at(0) (Throw)

This is mine original solution: to obtaining the char value during EmulateStream & operator++() and then returning it on reference operator*() const, as you can see here, like this you can learn if it underlying stream is finished without the need of non-existent function like empty() or size(), in the example you can see a try/catch to manage/understand the end of stream ( this is an example, it can be a socket returning -1 or EOF on recv or whatever is the underlaying stream technique of telling you it is finished )

Said that my solution can work only if the upper layer (the parser) don't try to read one char more than needed, as you can see if on the top line you change #define USE_FIX true this will use my repo, with false current master, and you will notice on the terminal that more char are parsed than what they should if using current master

  • This happen because in current master input_adapter is advanced before actual consuming of the data, so you end up to it pointing one char before the last char read.
  • If the underlying data is a STREAM which is NOT a container (so you don't either have .size(), .empty() or dereference operator) advancing the iterator mean consuming one char from the stream itself.
    If you read the last 2 sentences you can understand where that one char is lost

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger ok lets start fresh:

This is your proposed solution: to obtaining the char value during reference operator*() const but as you can see here this is broken because you can dereference a ::end pointer, to make it clear I changed str->front() (UB) with str->at(0) (Throw)

So, I finally reproduced the dereference you were talking about. You're not using the last code I posted, which includes tweaked operator== and operator!= which does avoid the dereference of the end iterator.

This is why we were talking past one another. We were using different code.

Here's the quote with the update:

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse

Possibly still not 100% correct but at least closer.

@imwhocodes
Copy link
Author

imwhocodes commented Jun 20, 2022

@falbrechtskirchinger ok lets start fresh:
This is your proposed solution: to obtaining the char value during reference operator*() const but as you can see here this is broken because you can dereference a ::end pointer, to make it clear I changed str->front() (UB) with str->at(0) (Throw)

So, I finally reproduced the dereference you were talking about. You're not using the last code I posted, which includes tweaked operator== and operator!= which does avoid the dereference of the end iterator.

This is why we were talking past one another. We were using different code.

Here's the quote with the update:

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse
Possibly still not 100% correct but at least closer.

Yes i have seen the code at https://godbolt.org/z/Ee1Pr3Gse and this is why I'm telling you that it is not valid, if we want this to work with a stream (even simply std::iostream) we can NOT use .empty() or .size() member function because they do NOT exist on a stream like input, is it not a container it is a stream, it can me either open or closed, it has no size

Please read carefully this message, everything is explained

@falbrechtskirchinger
Copy link
Contributor

I've read your message.

You're trying to shoehorn a stream into the iterator interface. If you insist on parsing JSON directly from a stream, it's on you to find a way to make it work. My point is that the library code is working as intended and your "fix" breaks other valid code. I'm sorry I didn't catch this right at the beginning.

If you want help with your iterator code for a specific stream API (beyond the contrived examples using std::string and std::vector), open a discussion. I'm happy to help whenever I have time. There's always some mechanism to tell if a stream is at the end. As a last resort, you can try reading from it, but there's probably a better way.

Let me address one more point of your argument directly:

(...) you will notice on the terminal that more char are parsed than what they should if using current master

  • This happen because in current master input_adapter is advanced before actual consuming of the data, so you end up to it pointing one char before the last char read.

Advancing the iterator to the next character should never be a problem. I'll quote the standard to you again: "Iterators are generalized pointers". They are supposed to follow similar semantics.
Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

@nlohmann @gregmarr We've reached a stalemate and need someone else to chime in.

@imwhocodes
Copy link
Author

I've read your message.

You're trying to shoehorn a stream into the iterator interface. If you insist on parsing JSON directly from a stream, it's on you to find a way to make it work. My point is that the library code is working as intended and your "fix" breaks other valid code. I'm sorry I didn't catch this right at the beginning.

If you want help with your iterator code for a specific stream API (beyond the contrived examples using std::string and std::vector), open a discussion. I'm happy to help whenever I have time. There's always some mechanism to tell if a stream is at the end. As a last resort, you can try reading from it, but there's probably a better way.

Let me address one more point of your argument directly:

(...) you will notice on the terminal that more char are parsed than what they should if using current master

  • This happen because in current master input_adapter is advanced before actual consuming of the data, so you end up to it pointing one char before the last char read.

Advancing the iterator to the next character should never be a problem. I'll quote the standard to you again: "Iterators are generalized pointers". They are supposed to follow similar semantics. Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

@nlohmann @gregmarr We've reached a stalemate and need someone else to chime in.

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, this mean that if you receive one json on a socket you need the first char of the second to finish parsing the first, and the second one will fail missing it's own first char
IMHO reading on std::advance is the best way, for sure not during deferencing on the custom adapter, I don't see any problem about it, if it is valid you keep running otherwise you return ::end and the parser will manage it

About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function

My idea is it that it is better to have the input_adapter pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter simply do a ++ to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal

Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs)
This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library

Anyway let see what @nlohmann @gregmarr think,
Luca

@falbrechtskirchinger
Copy link
Contributor

falbrechtskirchinger commented Jun 21, 2022

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, (...)

You can avoid that with strict = false. See https://godbolt.org/z/Ee1Pr3Gse.

About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function

It's a breaking change and I've demonstrated why. Passing the unit tests does not invalidate that assertion.

My idea is it that it is better to have the input_adapter pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter simply do a ++ to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal

You're breaking expected iterator semantics. The past-the-end iterator points past the end because otherwise, we'd have no way to represent an empty range. You're suggestion to "simply do a ++" would not be sound because of this.

Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library

Nothing prevents you from doing that. You can implement your own iterator to behave as you describe. Just don't break iterator semantics for everyone else because it's convenient for one use case out of many.
Any C++ developer, understanding common iterator semantics, can implement your behavior on top of the current implementation without special knowledge about the implementation. Your "fix" changes semantics in a way where that'd no longer be true, as the past-the-end iterator no longer points past the end, but at the end (which is impossible in case of an empty range, creating another problem).

In summary: C++ developers understand iterators, they expect them to follow established rules. The concept of the past-the-end iterator is fundamental, don't mess with that. What you want can be done without modifying the library code.
At most, we should improve documentation around the subject.

I rest my case.

@imwhocodes
Copy link
Author

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, (...)

You can avoid that with strict = false. See https://godbolt.org/z/Ee1Pr3Gse.

No, this don't fix the problem, it is not about strict, it is about the parsing before the strict check

About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function

It's a breaking change and I've demonstrated why. Passing the unit tests does not invalidate that assertion.

My idea is it that it is better to have the input_adapter pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter simply do a ++ to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal

You're breaking expected iterator semantics. The past-the-end iterator points past the end because otherwise, we'd have no way to represent an empty range. You're suggestion to "simply do a ++" would not be sound because of this.

Never said that i want to break the past-the-end iterator, simply that value pointed by iterator when finished parsing should be last parsed char and not the one after it

Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library

Nothing prevents you from doing that. You can implement your own iterator to behave as you describe. Just don't break iterator semantics for everyone else because it's convenient for one use case out of many. Any C++ developer, understanding common iterator semantics, can implement your behavior on top of the current implementation without special knowledge about the implementation. Your "fix" changes semantics in a way where that'd no longer be true, as the past-the-end iterator no longer points past the end, but at the end (which is impossible in case of an empty range, creating another problem).

Yes, the fact that the upper layer is advancing the iterator one position ahead of the one actually used

In summary: C++ developers understand iterators, they expect them to follow established rules. The concept of the past-the-end iterator is fundamental, don't mess with that. What you want can be done without modifying the library code. At most, we should improve documentation around the subject.

I rest my case.

I reiterate, it is not about breaking iterator, like this library itself say I want to stick to this specification: https://en.cppreference.com/w/cpp/named_req/InputIterator

Explaining as a simple loop where bool parse(char) return false when the last } in json is encountered:

  • Current implementation is:
auto it = data.begin();
while( it != data.end() ){
    auto value = *it;
    it++;
    if( ! parse(value) ){
        break;
    }
}
//Now iterator point to one item after the last actually used
  • Proposed implementation is:
auto it = data.begin();
while( it != data.end() ){
    auto value = *it;
    if( ! parse(value) ){
        break;
    }
    it++;
}
//Now iterator point to the last item that was actually used

Hope that this is clear of where the problem lays

@gregmarr
Copy link
Contributor

gregmarr commented Jun 23, 2022

I've looked at the discussion, and thought a lot about this, and done some researching, and I'm still not sure if it's possible to "fix" this use case at this level. I looked at https://en.cppreference.com/w/cpp/iterator/istream_iterator and it seems like it should be possible for the stream iterator itself, rather than the json adapter, to handle this. However, the EmulateStream in the example seems to do exactly what I expect the stream iterator does.

I don't know how to reconcile the "leaving the iterator pointing at the last thing read" with "the iterator needs to be pointing to the next thing to read in order to start parsing".

On the other hand, there is no visible change as far as the caller of iterator_input_adapter::get_character() is concerned, and that's the only API that class has.

There are two callers of that function. One has unget() support, and the other doesn't. Could the real problem be the lack of unget() support in binary_reader.hpp?

That's JSON vs CBOR, not regular parse vs sax parse, got the two confused, never mind.

@gregmarr
Copy link
Contributor

@falbrechtskirchinger

Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

That's what std::istream_iterator does too.

https://en.cppreference.com/w/cpp/iterator/istream_iterator

std::istream_iterator is a single-pass input iterator that reads successive objects of type T from the std::basic_istream object for which it was constructed, by calling the appropriate operator>>. The actual read operation is performed when the iterator is incremented, not when it is dereferenced. The first object is read when the iterator is constructed. Dereferencing only returns a copy of the most recently read object.

@gregmarr
Copy link
Contributor

gregmarr commented Jun 23, 2022

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, this mean that if you receive one json on a socket you need the first char of the second to finish parsing the first, and the second one will fail missing it's own first char

The unget() support I mention would help with the second one will fail missing it's own first char but not you need the first char of the second to finish parsing the first if the read blocks waiting for the next char.

Having a newline or other whitespace after the last char of the first would actually fix both issues, but that doesn't help if you can't change the source.

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger

Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

That's what std::istream_iterator does too.

The difference being that std::istream_iterator is designed to be an adapter between streams and iterators. iterator_input_adapter is an adapter between the parser interface (i.e., get_character()) and iterators (i.e., NOT streams).

If you want a stream_like_input_adapter, build one, and let's add that. I've no objections to a solution that doesn't break iterator_input_adapter's current behavior.

@gregmarr
Copy link
Contributor

@falbrechtskirchinger I just went looking through the discussions in this PR for it and couldn't find it. Can you remind me again which case is broken with this change?

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger I just went looking through the discussions in this PR for it and couldn't find it. Can you remind me again which case is broken with this change?

See #3548.

@gregmarr
Copy link
Contributor

Thanks, that makes sense. I hadn't been considering the underlying iterator surviving beyond the adapter.

@falbrechtskirchinger
Copy link
Contributor

@falbrechtskirchinger

Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

That's what std::istream_iterator does too.

Another thought: Combining an EmulateStream that reads on advance with the proxy_iterator from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.

Once again, there are several (external) solutions to this problem. Ideally, we should work towards making input_adapter part of the public API and allow users to pass an input_adapter to parse() directly.

@imwhocodes
Copy link
Author

Is some way I cannot understand why we should preserve an internal behaviour that is not defined by the API, on the other side is possible that people is relaying on that hidden/implicit behaviour

What I'm proposing now is to split input_adapter like this, with the use of some SFINAE and/or std::iterator_traits:

  • Keep current implementation of input_adapter but mark it as deprecated
  • If user_iterator is derived from std::forward_iterator use current implementation
  • If user_iterator is derived from std::istream_iterator use proposed implementation

What do you think about it? @gregmarr @falbrechtskirchinger

@gregmarr
Copy link
Contributor

Combining an EmulateStream that reads on advance with the proxy_iterator from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.

I took a stringstream, created an istream_iterator on it, wrapped it in minor modification to the proxy_iterator that takes the iterator type instead of the container type, and that works. If sax_parse returned the iterator at the end, that would eliminate the need for the proxy_iterator. Since it doesn't, you need the disposable wrapper so you can keep the istream_iterator that potentially contains the next character around to use it for the next sax_parse.

https://godbolt.org/z/M5cYfbcbW

@imwhocodes
Copy link
Author

Combining an EmulateStream that reads on advance with the proxy_iterator from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.

I took a stringstream, created an istream_iterator on it, wrapped it in minor modification to the proxy_iterator that takes the iterator type instead of the container type, and that works. If sax_parse returned the iterator at the end, that would eliminate the need for the proxy_iterator. Since it doesn't, you need the disposable wrapper so you can keep the istream_iterator that potentially contains the next character around to use it for the next sax_parse.

https://godbolt.org/z/M5cYfbcbW

Got it, but now you need to always keep the istream_iterator around in scope, otherwise when it get destructed you lost a char

See https://godbolt.org/z/1f9sGW6cE (btw funny that losing one char make the json still valid but with a different meaning)

Still more convinced that the previously proposed could be a good solution managing all scenario:

Is some way I cannot understand why we should preserve an internal behaviour that is not defined by the API, on the other side is possible that people is relaying on that hidden/implicit behaviour

What I'm proposing now is to split input_adapter like this, with the use of some SFINAE and/or std::iterator_traits:

  • Keep current implementation of input_adapter but mark it as deprecated
  • If user_iterator is derived from std::forward_iterator use current implementation
  • If user_iterator is derived from std::istream_iterator use proposed implementation

What do you think about it? @gregmarr @falbrechtskirchinger

@nlohmann
Copy link
Owner

How should we proceed here?

@falbrechtskirchinger
Copy link
Contributor

As is, this PR would conflict with #3548. Unless @imwhocodes wants to try another solution, this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants