Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parse error with binary data in keys? #838

Closed
charlie-ht opened this issue Nov 20, 2017 · 10 comments
Closed

Incorrect parse error with binary data in keys? #838

charlie-ht opened this issue Nov 20, 2017 · 10 comments
Assignees
Milestone

Comments

@charlie-ht
Copy link

Bug Report

  • What is the issue you have?
    I get a parse error after deserializing a map<string, string>, the serialised message was created by same version of library. The key is binary data. The error message is not helpful to find what went wrong.

  • Please describe the steps to reproduce the issue. Can you provide a small but working code example?
    m_message below contains the data to parse, the library seems to have done some escaping of my data but a quick inspection of the data suggests it's valid.

m_message is {"g\u0007\r\"(G&߆":"KeyUsable","{\u0013(T[\u001e>":"KeyUsable","x R\u0000W}UF":"KeyUsable","u000e\u0012]=_":"KeyUsable"}
00000000 7b 22 67 5c 75 30 30 30 37 5c 72 d7 7e d6 5f 5c |{"g\u0007\r.~._\|
00000010 22 ae 28 47 26 df 86 f7 7f 22 3a 22 4b 65 79 55 |".(G&....":"KeyU|
00000020 73 61 62 6c 65 22 2c 22 7b 5c 75 30 30 31 33 28 |sable","{\u0013(|
00000030 eb 61 b5 54 e2 93 f7 5b 5c 75 30 30 31 65 3e 94 |.a.T...[\u001e>.|
00000040 cc 3b 22 3a 22 4b 65 79 55 73 61 62 6c 65 22 2c |.;":"KeyUsable",|
00000050 22 93 78 99 20 e8 d6 52 5c 75 30 30 30 30 98 57 |".x....R\u0000.W|
00000060 7d f8 f2 dd 55 46 22 3a 22 4b 65 79 55 73 61 62 |}...UF":"KeyUsab|
00000070 6c 65 22 2c 22 f0 92 9b 5c 75 30 30 30 65 5c 75 |le","...\u000e\u|
00000080 30 30 31 32 c8 5e 5d 95 fe 3d 5f da e6 fb a8 22 |0012.^]..=_...."|
00000090 3a 22 4b 65 79 55 73 61 62 6c 65 22 7d          |:"KeyUsable"}   |
terminate called after throwing an instance of 'std::invalid_argument'
  what():  parse error - unexpected '"'; expected string literal

I provide hexdump of message above the given error. That error isn't enough for me to understand what went wrong.

  • What is the expected behavior?
-{
"g\u0007\r\"(G&߆" : KeyUsable,
"{\u0013(T[\u001e>" : KeyUsable,
"x R\u0000W}UF" : KeyUsable,
"u000e\u0012]=_" : KeyUsable
}
  • And what is the actual behavior instead?
    A crash, please see above
  • Which compiler and operating system are you using? Is it a supported compiler?
    arm-buildroot-linux-gnueabihf-g++.br_real (Buildroot 2017.05-git-13055-g71f35cff8-dirty) 5.4.0
  • Did you use a released version of the library or the version from the develop branch?
    ersion 2.1.1
  • If you experience a compilation error: can you compile and run the unit tests?
    No compilation error
@nlohmann
Copy link
Owner

I cannot reproduce the error with the develop version. Code:

#include <iostream>
#include <fstream>
#include "json.hpp"

using json = nlohmann::json;

int main(int argc, char* argv[]) {
    std::ifstream f("data.json");
    json j;
    f >> j;
    std::cout << j << std::endl;
    std::string s = j.dump();
    std::cout << json::parse(s) << std::endl;
}

Output:

{"g\u0007\r\"(G&߆":"KeyUsable","u000e\u0012]=_":"KeyUsable","x R\u0000W}UF":"KeyUsable","{\u0013(T[\u001e>":"KeyUsable"}
{"g\u0007\r\"(G&߆":"KeyUsable","u000e\u0012]=_":"KeyUsable","x R\u0000W}UF":"KeyUsable","{\u0013(T[\u001e>":"KeyUsable"}

Can you please try the develop version?

@charlie-ht
Copy link
Author

charlie-ht commented Nov 21, 2017

I made a testcase with the develop version that will hopefully be more useful. The issue is an invalid UTF-8 sequence being serialized it seems. Here's the test case

#include <iostream>
#include <fstream>
#include "json.hpp"

using json = nlohmann::json;

int main(int argc, char* argv[]) {
    std::map<std::string, std::string> m;
    uint8_t key1[] = { 103, 92, 117, 48, 48, 48, 55, 92, 114, 215, 126, 214, 95, 92, 34, 174, 40, 71, 38, 174, 40, 71, 38, 223, 134, 247, 127 };
    std::string key1_str(key1, key1 + sizeof(key1)/sizeof(key1[0]));
    m[key1_str] = "KeyUsable"; 

    json jMap;
    for (const auto& pair : m)
      jMap[pair.first] = pair.second;

    std::ofstream outFile("data.json", std::ios::out);
    outFile << jMap;
    outFile.close();

    std::ifstream f("data.json");
    json j;
    f >> j;
    std::cout << j << std::endl;
    std::string s = j.dump();
    std::cout << json::parse(s) << std::endl;
}

And here is the error when I run it,

[tmp]> g++ --version
g++ (Debian 7.2.0-14) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[tmp]> g++ test.cpp 
[tmp]> ./a.out 
terminate called after throwing an instance of 'nlohmann::detail::parse_error'
  what():  [json.exception.parse_error.101] parse error at 15: syntax error - invalid string: ill-formed UTF-8 byte; last read: '"g\\u0007\\r�~'; expected string literal
Aborted
[tmp]> cat data.json 
{"g\\u0007\\r�~�_\\\"�(G&�(G&߆�":"KeyUsable"}

@abolz
Copy link
Contributor

abolz commented Nov 21, 2017

The key1 array does not contain valid UTF-8

               215 = 0b11010111 requires one more byte in the range [128,192)
                                                           v
uint8_t key1[] = { 103, 92, 117, 48, 48, 48, 55, 92, 114, 215, 126, ... };
                                                                ^
                                                             invalid

While serializing, the json library does not check for valid UTF-8 (only some special characters are escaped). But while deserializing it detects an invalid byte sequence.

@charlie-ht
Copy link
Author

std::string doesn't have such a requirement for UTF-8 (binary data is fine) so I assumed I could serialise to JSON using this library and it would do something like base64'ing the input and serializing/deserializing that as appropriate. If that's not the case and this library leaks the JSON requirement of UTF-8 strings into C++ strings, I can instead base64 encode them myself.

@abolz
Copy link
Contributor

abolz commented Nov 21, 2017

Ah ok, sorry, I should have read the "binary data" part...

@nlohmann
Copy link
Owner

Indeed the library copies strings to the byte and does check whether they are UTF-8 encoded. The library also does not check the encoding during serialization. Therefore, the non-UTF-8 encoding does not yield an error until the serialized value is parsed again.

So much for the explanation. I do not plan to implement base64 encoding (or other encodings) at the moment. We had a similar discussion about this regarding CBOR which also has some types that assume encoding/decoding. What I would like to discuss whether it would make sense to:

  • Make the dump function family throw an exception of non-UTF-8 strings are encountered?

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Nov 21, 2017
@charlie-ht
Copy link
Author

My take-away from this is that I'd be better off going with CBOR (hadn't heard of that before, thanks!).

Make the dump function family throw an exception of non-UTF-8 strings are encountered?

In principle I believe errors should be reported as soon as they occur, so an exception during encoding seems good to me :)

@theodelrieu
Copy link
Contributor

I agree with @charlie-ht, I would even go for an assert.

This is a precondition of the library after all

@nlohmann
Copy link
Owner

nlohmann commented Dec 7, 2017

I think an assertion would be too harsh here. I'll see that I define an exception for this.

nlohmann added a commit that referenced this issue Dec 11, 2017
We had a lot of issues with failing roundtrips (i.e., parse errors from serializations) in case string were stored in the library that were not UTF-8 encoded. This PR adds an exception in this case.
nlohmann added a commit that referenced this issue Dec 13, 2017
💥 throwing an exception in case dump encounters a non-UTF-8 string #838
@nlohmann nlohmann removed the state: please discuss please discuss the issue or vote for your favorite option label Dec 13, 2017
@nlohmann nlohmann self-assigned this Dec 13, 2017
@nlohmann nlohmann added this to the Release 3.0.0 milestone Dec 13, 2017
@nlohmann
Copy link
Owner

Trying to serialize a non-UTF-8 encoded string now throws an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants