Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of operator>> should more closely resemble that of built-in overloads. #367

Closed
TurpentineDistillery opened this issue Nov 23, 2016 · 52 comments
Assignees
Milestone

Comments

@TurpentineDistillery
Copy link

Basically:

  • eat up leading blanks or newlines
  • leave the stream where the parsing ended, allowing for more data downstream, which may or may not be json.
#include "json.hpp"
    
template<typename T>
static void test(const std::string& s)
{
    T val{999};
    std::stringstream sstr(s);
    //sstr.exceptions(std::ifstream::failbit | std::ifstream::badbit);

    try {
        size_t i = 0;
        while(sstr >> val) {
            std::cerr << "    result[" << i++ << "]=" << val 
                      << "; streampos=" << sstr.tellg() << std::endl;
        }   

        std::cerr << "    final values: result=" << val 
                  << " i=" << i
                  << " streampos=" << sstr.tellg()
                  << " streamflags=" << sstr.flags() 
                  << std::endl;
    } catch(std::exception& e) {
        std::cerr << "    std::exception:" << e.what() << std::endl;
    }   

    std::cerr << std::endl;
}

int main()
{
    using json = nlohmann::json;
    size_t i = 0;

    for(std::string s : { 
        "", 
        "   ",
        "111",
        "222 \t\n",
        "\n\t 333", 
        " 111 \n222\n \n  333" })
    {   
        std::cerr << "\nTest-case[" << i++ << "]: '" << s << "'" << std::endl;
    
        std::cerr << "With int:" << std::endl;
        test<int>(s);

        std::cerr << "With json:" << std::endl;
        test<json>(s);

        std::cerr << "----" << std::endl;
    }   

    return 0;
}
Test-case[0]: ''
With int:
    final values: result=999 i=0 streampos=-1 streamflags=4098

With json:
    std::exception:parse error - unexpected end of input

----

Test-case[1]: '   '
With int:
    final values: result=999 i=0 streampos=-1 streamflags=4098

With json:
    std::exception:parse error - unexpected end of input

----

Test-case[2]: '111'
With int:
    result[0]=111; streampos=-1
    final values: result=111 i=1 streampos=-1 streamflags=4098

With json:
    result[0]=111; streampos=-1
    std::exception:basic_string::append

----

Test-case[3]: '222 	
'
With int:
    result[0]=222; streampos=3
    final values: result=222 i=1 streampos=-1 streamflags=4098

With json:
    final values: result=222 i=0 streampos=-1 streamflags=4098

----

Test-case[4]: '
	 333'
With int:
    result[0]=333; streampos=-1
    final values: result=333 i=1 streampos=-1 streamflags=4098

With json:
    result[0]=333; streampos=-1
    std::exception:basic_string::append

----

Test-case[5]: ' 111 
222
 
  333'
With int:
    result[0]=111; streampos=4
    result[1]=222; streampos=9
    result[2]=333; streampos=-1
    final values: result=333 i=3 streampos=-1 streamflags=4098

With json:
    std::exception:parse error - unexpected number literal; expected end of input

----
@nlohmann
Copy link
Owner

The lines std::exception:basic_string::append revealed a bug in the fill_line_buffer function where

m_line_buffer.append(n - 1, '\x01');

was executed with n = 0. This must be fixed.

About the inputs:

  1. "" (empty input): An empty input is not a valid JSON document. Hence, I think the message parse error - unexpected end of input is adequate.
  2. " " (white space): After ignoring whitespace, this is again empty input, cf. input 1.
  3. "111" (number): Parsing this input yields a number. Parsing the (then empty) input again, is the same as input 1.
  4. "222 \t\n" (number with trailing whitespace): The input is parsed as a number while reading the trailing whitespace. Parsing the (then empty) input again, is the same as input 1.
  5. "\n\t 333" (number with leading whitespace): Same as input 4.
  6. " 111 \n222\n \n 333" (number with leading whitespace, followed by additional numbers): A number must not be followed by a number, so the message parse error - unexpected number literal; expected end of input makes sense to me (note that after merging Exception line #301 this message will contain the byte offset of the error).

I can understand your thoughts about the behavior of the streams. However, I think it is not a good idea to stop parsing once we found a JSON value, because this would mean that code like

json j;
j << "false foo bar";

would run without exception.

Am I too strict about this?

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Nov 24, 2016
@TurpentineDistillery
Copy link
Author

An empty input is not a valid JSON document.

An empty input is not a valid integer either, yet cin>>i will not throw - it will consume the blanks and leave i unchanged.

There are two classes of use-cases:

  • Parse a singular json document from a string (e.g. using parse(...) method or a free function). Here the library should verify that the input is not empty and that there's nothing past the end.
  • Parse next json document from a stream (operator>>). Here the code cannot assume that the stream must contain a singleton json document and nothing else. Maybe there's some binary data that follows - it's not our business to know.

Think of a what kind of code a new user of the library would write, assuming they have not read any documentation other some examples, when they try to read a bunch of json values from stdin in a loop. Now make the API work that way, following the "design principle of least astonishment" : )

@nlohmann
Copy link
Owner

I understand your point and you answered my "Am I too strict about this?" question with "yes" 😄

It would be nice to hear other opinions about this. I still feel uncomfortable accepting non-valid JSON files by just looking at a prefix...

@gregmarr
Copy link
Contributor

Seems reasonable for streaming to stop after a JSON document has been read. If this is a behavior change, it might require a major version bump.

Note that both j << cin; and cin >> j; call the same code.

I agree that this seems weird, but I'm not sure it's valid right now, since there is no operator<< that takes a string:

json j;
j << "false foo bar";

@TurpentineDistillery
Copy link
Author

Note for the future:
Suppose input stream contains arbitrarily large, or infinite number of json records (e.g. sensor data being piped from some upstream process), which happen to not be newline-delimited. calling getline() on such stream will fail.

@ghost
Copy link

ghost commented Dec 21, 2016

Allowing to partially parse a stream would be very useful. In fact, that's what I'm struggling with right now. I get input from a WebSocket, and it's basically a stream of non-delimited JSON objects, e.g. {...}{...}. If parsing stopped once an object is found, this code would be valid:

json msg;
while (iss >> msg) {
    // ...
}

Right now I can't see any way to parse this kind of stream.

@nlohmann nlohmann added this to the Release 3.0.0 milestone Dec 23, 2016
@mrkgnao
Copy link

mrkgnao commented Jan 13, 2017

+1, this is basically the only thing I wish for here. Is there any way to simulate this behavior with the library as it stands now?

@nlohmann
Copy link
Owner

Unfortunately not.

@gregmarr
Copy link
Contributor

I saw someone do it in a way that required a two-stage parse. They parsed the data once, got the position of the parse error, assumed that was the start of the next whole entry, and so parsed again limiting to just before the parse error, and then moving the start position to the error position for the next entry.

@ghost
Copy link

ghost commented Jan 13, 2017

That's exactly what I did some time ago. Try to parse, catch the exception, save the offset, limit stream to that many characters, parse, repeat. I think that required the feature-userexceptions branch and of course it's not terribly efficient.

@nlohmann
Copy link
Owner

This is currently the problem in the preparation of 3.0.0 - a lot of API-changing features depend on each other, and I want to try to avoid releasing a 4.0.0 and 5.0.0 in a short period.

@gregmarr
Copy link
Contributor

@krzysztofwos Yeah, I think it was you that I was thinking of, but I couldn't find the post on slack, I think it's beyond the 10K limit.

@mrkgnao
Copy link

mrkgnao commented Jan 13, 2017

I just resorted to signaling the end of a JSON message with \0. It'll work for now. :)

@TurpentineDistillery
Copy link
Author

If you're in control of both producing and consuming json records, another approach is to dump in single-sine mode, and, when consuming, first fetch the json document from stream with getline, and then parse.

@nlohmann
Copy link
Owner

The current parser relies on reading input streams with getline(). This makes it quite hard to parse several JSON values in a row, because in values like true false\n"hello", false is lost, because it was read with true, but never processed to a value. I think I need to use the underlying std::basic_streambuf.

@szikra
Copy link

szikra commented Jan 28, 2017

Will be support added for parsing multiple JSON objects from strings too?
(Which I was also asking for in #438)

@gregmarr
Copy link
Contributor

gregmarr commented Jan 28, 2017

@szikra That's exactly what this issue is about. operator>> uses parse(). Although I suppose it's not strictly part of what's needed to support this, as there would need to be some indication of how much of the string was used.
Alternatively, use this method to create an istream on top of your existing string/buffer: http://stackoverflow.com/questions/1448467/initializing-a-c-stdistringstream-from-an-in-memory-buffer/1449527

@szikra
Copy link

szikra commented Jan 29, 2017

@gregmarr Thanks.
The comment "This class is sort of depreciated or at least you are normally told to use other classes." is not very reassuring though. :)

I'm sure my stringstream<<string is not the most efficient way to create a stream from a buffer, it was just a test, but at least it's not as ugly as other suggestions on stackoverflow :) And most of them are "implementation defined".
stringstream<<char* is also an accepted answer http://stackoverflow.com/questions/5188140/c-const-char-to-stringstream

And yes, it would be enough if we had a public method that returns the length parse() read the last time.
Or maybe parse could return the processed length too.
Something like this:

class J{
public:
	struct C {
		J* j;
		int len;
		operator J&() {
			return *j;
		}
	};

	C parse() {
		return C {this, 123};
	}
};

RVO would help here to discard length if we are not using it, so it wouldn't have to be stored as last processed amount inside json, right?
The downside is, that this would break auto res=j.parse(); (J res=j.parse(); should continue to work.)

@gregmarr
Copy link
Contributor

Yeah, that first comment isn't terribly useful. The next one is better, or the one linked in a comment:
http://stackoverflow.com/questions/2079912/simpler-way-to-create-a-c-memorystream-from-char-size-t-without-copying-t
I think an optional pointer-to-something argument to parse() that receives a length (safer) or end pointer would be better than changing the return type.

@szikra
Copy link

szikra commented Jan 30, 2017

Yeah, I don't want to break existing code. I also hate returning data with pointer arguments, so ugly. :)
Adding an extended parse method should fix the 'auto' issue

class J{
public:
	struct C {
		J* j;
		int len;
		operator J&() {
			return *j;
		}
	};

	C parseWithLen() {
		return C {this, 123};
	}
	J& parse() {
		return parseWithLen();
	}
       
	/// But if people really want, this could also be added as an option:
	J& parse(size_t& len) {
		C c = parseWithLen();
		len = c.len;
		return *c.j;
	}
	J& parse(size_t* len) {
		C c = parseWithLen();
		if (len) *len = c.len;
		return *c.j;
	}

};

@TurpentineDistillery
Copy link
Author

There are some subtleties to this.
The partial-parse is unambiguous in case of objects, arrays, strings, and literals, but not for numbers. Suppose the input stream contains 000123.... The syntactically correct repeated parsing would yield a sequence of json-records: 0, 0, 0, 123, (...) since an integer can be 0, but can't have leading zeros. However, this would likely be semantically wrong. Proper behavior would be to parse "000123" as a group of digits, and then error-out saying it's not a valid json rather than yield a sequence of syntactically-correct/semantically-likely-wrong records.

@nlohmann
Copy link
Owner

@TurpentineDistillery With the fix for #452 the library would indeed reject 000123 as ill-formed number.

@TurpentineDistillery
Copy link
Author

@nlohmann
For partial stream parsing we need to guarantee that the parser/lexer shall not consume the stream past the end of json record (with caveats mentioned above). This does not bode well with the "black-box" re2c approach. How do you feel about dispensing with re2c-generated parser it and replacing with a hand-written lexer/parser, provided that it meets the coding standards of this library (i.e. correct, efficient, and understandable)? This would also make the library "actually" single-header : )

nlohmann added a commit that referenced this issue May 20, 2017
@nlohmann
Copy link
Owner

I did the restructuring I mentioned in #367 (comment). I did not find the time to test this issue again though.

@nlohmann
Copy link
Owner

The output for the example above (#367 (comment)) is now


Test-case[0]: ''
With int:
    final values: result=999 i=0 streampos=-1 streamflags=4098

With json:
    std::exception:[json.exception.parse_error.101] parse error at 1: syntax error - unexpected end of input

----

Test-case[1]: '   '
With int:
    final values: result=999 i=0 streampos=-1 streamflags=4098

With json:
    std::exception:[json.exception.parse_error.101] parse error at 4: syntax error - unexpected end of input

----

Test-case[2]: '111'
With int:
    result[0]=111; streampos=-1
    final values: result=111 i=1 streampos=-1 streamflags=4098

With json:
    result[0]=111; streampos=3
    std::exception:[json.exception.parse_error.101] parse error at 1: syntax error - unexpected end of input

----

Test-case[3]: '222 	
'
With int:
    result[0]=222; streampos=3
    final values: result=222 i=1 streampos=-1 streamflags=4098

With json:
    result[0]=222; streampos=3
    std::exception:[json.exception.parse_error.101] parse error at 4: syntax error - unexpected end of input

----

Test-case[4]: '
	 333'
With int:
    result[0]=333; streampos=-1
    final values: result=333 i=1 streampos=-1 streamflags=4098

With json:
    result[0]=333; streampos=6
    std::exception:[json.exception.parse_error.101] parse error at 1: syntax error - unexpected end of input

----

Test-case[5]: ' 111 
222
 
  333'
With int:
    result[0]=111; streampos=4
    result[1]=222; streampos=9
    result[2]=333; streampos=-1
    final values: result=333 i=3 streampos=-1 streamflags=4098

With json:
    result[0]=111; streampos=4
    result[1]=222; streampos=9
    result[2]=333; streampos=17
    std::exception:[json.exception.parse_error.101] parse error at 1: syntax error - unexpected end of input

----
Program ended with exit code: 0

This seems reasonable to me.

nlohmann added a commit that referenced this issue May 21, 2017
operator>> now works as expected.
@nlohmann
Copy link
Owner

There was still a bug left, but now the examples from @TurpentineDistillery and @ceztko behave as expected. I added them as regression tests, see 9e507df. If the CI build succeeds, I think I can close this issue.

@nlohmann nlohmann self-assigned this May 21, 2017
@nlohmann nlohmann removed the state: please discuss please discuss the issue or vote for your favorite option label May 21, 2017
@ceztko
Copy link

ceztko commented May 21, 2017

Hello, I just took the test case I posted here and executed it verbatim on VS2017. Please note that the test is saving the stream to a file and after reading from it. The result is this:

    result[0]={"one":1,"two":2}; streampos=36
    std::exception:[json.exception.parse_error.101] parse error at 2: syntax error - unexpected '}'

After the first read the stream is actually postioned correctly after the first "}", but it's failing to read the second element. If I create a stringstream with exactly the same content with stringstream stream(test);, I get this result instead:

    result[0]={"one":1,"two":2}; streampos=36
    result[1]={"three":3}; streampos=56

Which means it correctly parsed the second element, without exceptions. The stream is positioned correctly after the first "}" on the first read and on EOF on the second read. Behavior may be compiler/library dependent but I am expecting the same result to happen on the file stream. Question: are you parsing the stream character by character or are you doing some kind of caching?

@nlohmann
Copy link
Owner

@ceztko I also realized different behavior in MSVC as the tests on AppVeyor fail (https://ci.appveyor.com/project/nlohmann/json/build/1965). I haven't had the chance to check the logs though.

About caching: yes, I read 16 KB from an input stream into a cache. When the parser is destructed, I call seekg to the last actually processed character. I tested it with string streams and hoped this would also work with file streams.

@ceztko
Copy link

ceztko commented May 21, 2017

I read 16 KB from an input stream into a cache. When the parser is
destructed, I call seekg to the last actually processed character

It's perfectly reasonable to cache some amount of bytes in advance when reading from any stream. But I am not expecting the seekg to work reliably on any kind of stream (for example: if the stream is a socket). The idea should work on filestream, though, so there may be some bug.
If you want to support streams that can be read forward-only, I recommend you to expose the internal json::parser class in the public api so any amount of unparsed cached data can't be lost.

@nlohmann nlohmann added state: help needed the issue needs help to proceed and removed solution: proposed fix a fix for the issue has been proposed and waits for confirmation labels May 21, 2017
@nlohmann
Copy link
Owner

In may add special overloads for ifstream and istringstream that use caches and remove caching for general istream cases. Would this work?

@ceztko
Copy link

ceztko commented May 21, 2017

In may add special overloads for ifstream and istringstream that use caches
and remove caching for general istream cases. Would this work?

For me it's also a workable solution. You may also consider having the caching only on ifstream since istringstream it's already on memory.

@nlohmann
Copy link
Owner

@ceztko I openend a question at StackOverflow and found out the code I used to "rewind" the stream was not correct. I fixed it in a feature branch and the example from #367 (comment) is now working with MSVC 2015 and MSVC 2017 (see AppVeyor build).

@ceztko
Copy link

ceztko commented Jun 12, 2017

@nlohmann great news, thanks! Sorry for not having looked at this: I wanted to debug it but I couldn't find the time.

@nlohmann
Copy link
Owner

Merged fd4a0ec which fixes this issue. Thanks everybody for the patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants