Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it support chinese character? #1022

Closed
JueLance opened this issue Mar 22, 2018 · 13 comments
Closed

Does it support chinese character? #1022

JueLance opened this issue Mar 22, 2018 · 13 comments
Labels
kind: question solution: proposed fix a fix for the issue has been proposed and waits for confirmation

Comments

@JueLance
Copy link

JueLance commented Mar 22, 2018

I try to input some chinese characters, the application is crashed. The code as following below:

json j;

j["chinese"] = "中文";

string s = j.dump();

cout << s.c_str() << endl;

Can you tell me how to make it works?

@nlohmann
Copy link
Owner

Yes, the library full supports UTF-8. You need to make sure whether your IDE actually uses UTF-8, which is sometimes an issue with MSVC.

See also #694.

@nlohmann
Copy link
Owner

You say the application crashed. Could you please provide more information: compiler version, concrete example, library version?

@nlohmann nlohmann added the state: needs more info the author of the issue needs to provide more details label Mar 22, 2018
@JueLance
Copy link
Author

JueLance commented Mar 22, 2018

Thank you for your patience. During servual hours search from internet, I tried some methods on MSVC, but all can't get correct result. I guess the reason is that MSVC treat the "中文" as non-utf-8 character(maybe ASCII, Unicode-16, I'm not sure.) while compile code or/and runing the program. As your metion, I tried to compile the code with MingW, it works fine due to MingW using Utf-8 as default setting.

My full code is as below:

#include <iostream>  
#include <string> 
#include "json.hpp"

using namespace std;

using json = nlohmann::json;

int main() {

    try
    {
        json j;

        j["chinese"] = "中文";

        string s = j.dump();

        cout << s.c_str() << endl;
    }
    catch (const std::exception& e)
    {
        cout << e.what() << endl;
    }

    return 0;
}

when I run it in Visual Studio 2017, the command line will show as following:
[json.exception.type_error.316] invalid UTF-8 byte at index 1: 0xD0
Press any key to continue . . .

Addtional Infomation(by typing 'systeminfo' in command line):
OS Version: Microsoft Windows [Version 10.0.16299.309]
System Locale: zh-cn;Chinese (China)
Input Locale: zh-cn;Chinese (China)
Time Zone: (UTC+08:00) Beijing, Chongqing, Hong Kong, Urumqi

Visual Studio Project Setting:
Windows SDK version: 10.0.16299.0
Platform Toolset: Visual Studio 2017 (v141)
Character set: Use Unicode Character Set
json.hpp version: 3.1.2

@nlohmann
Copy link
Owner

The error message indicates that "中文" is not encoded with UTF-8. Could you execute the following program?

#include <string>
#include <iostream>

int main() {
    std::string s = "中文";
    for (size_t i = 0; i < s.size(); ++i)
    {
        std::cout << i << " " << std::hex << static_cast<int>(static_cast<uint8_t>(s[i])) << std::endl;
    }
}

With UTF-8, this should yield:

0 e4
1 b8
2 ad
3 e6
4 96
5 87

@JueLance
Copy link
Author

I have to add a flag "/utf-8" to compile the program, otherwise I still get the same error - invalid UTF-8 byte. Reference here: https://msdn.microsoft.com/zh-cn/library/mt708819.aspx.

`#include
#include
#include "json.hpp"

using namespace std;

using json = nlohmann::json;

int main() {
try
{
std::string s = "中文";
for (size_t i = 0; i < s.size(); ++i)
{
std::cout << i << " " << std::hex << static_cast(static_cast<uint8_t>(s[i])) << std::endl;
}

    json j;
    j["chinese"] = s;

    string jsonStr = j.dump();

    cout << jsonStr << endl;

}
catch (const std::exception& e)
{
    cout << e.what() << endl;
}

return 0;

}`

the result as following:
`
0 e4

1 b8

2 ad

3 e6

4 96

5 87

{"chinese":"涓枃"}
Press any key to continue . . .
`
I also tried to run "chcp 65001"(change the windows active page to utf-8) and "chcp 936"(Chinese) in command line first, then run the exe file, still get incorrect character. when windows active page set to utf-8, can't print all message in command line; When set to Chinese, same with above.

@OvermindDL1
Copy link

OvermindDL1 commented Mar 23, 2018

You should encode UTF-8 out of source. C++ does not define a text mapping format for the source files, hence why you should keep it in resource files. Adding that utf-8 switch is not really compliant. The standard states:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

As such C++ does not have chinese defined in the basic source character set so you cannot ever rely on it working, thus you either need to encode it in the source or use an external mapping file (like a resource file for windows).

But yes, this library supports chinese fine in JSON, but the source code is not under this library's control for the character set, it can only rely on the basic source character set as per the C++ standard.

(EDIT: 'encode it' meaning use the \uXXXX syntax)

@JueLance
Copy link
Author

JueLance commented Mar 23, 2018

Thanks for all, I found the root case: I try to write the json string to a file(following below code), "data.json" encoding with UTF-8, and can display chinese character correctly. So, the problem is caused by windows 10 bug or command line's bug.

string json_file = "data.json";

ofstream json_output(json_file);

json_output << j;// the json object

json_output.close();

@OvermindDL1 Thank you for your tips. Yes, I did try to change the source code file encoding.

@OvermindDL1
Copy link

Yes, I did try to change the source code file encoding.

Even that is not guaranteed to work depending on the compiler (MSVC is especially bad at it). It is better to not depend on encoding at all and instead follow the spec to the letter.

So, the problem is caused by windows 10 bug or command line's bug.

Windows is not UTF-8, it is USC-16 or whatever it's called (essentially a garbage format to be honest).

But yes, store UTF-8 in a non-source file and read it in should always work or it is a bug.

@nlohmann
Copy link
Owner

Can we close the issue?

@JueLance
Copy link
Author

sure

@nlohmann nlohmann added solution: proposed fix a fix for the issue has been proposed and waits for confirmation and removed state: needs more info the author of the issue needs to provide more details labels Mar 27, 2018
@JueLance
Copy link
Author

JueLance commented May 24, 2019

Perhapse the solution to display Chinese in Windows Console correct is too late, just put a sulotion from internet in here to help others fix this kind of issues quickly:
Condition:

  1. set project Charset Set to "Use Unicode Character Set"
  2. save below code with UTF-8 encoding.
  3. It should be display Chinese correctly in Windows Console.

#include
#include "json.h"
#include
#ifdef _WIN32
#include <windows.h>
#endif
//#include <tchar.h>

using namespace std;

// for convenience
using json = nlohmann::json;

int main() {

#ifdef _WIN32
//Change the Console Font to display Chinese
//Reference: http://m.blog.csdn.net/article/details?id=52789570
//system("chcp 65001"); //设置字符集(使用SetConsoleCP(65001)设置无效,原因未知)
SetConsoleOutputCP(65001);
CONSOLE_FONT_INFOEX info = { 0 }; // 以下设置字体来支持中文显示。
info.cbSize = sizeof(info);
info.dwFontSize.Y = 16; // leave X as zero
info.FontWeight = FW_NORMAL;
wcscpy(info.FaceName, L"Consolas");
SetCurrentConsoleFontEx(GetStdHandle(STD_OUTPUT_HANDLE), NULL, &info);
#endif

std::string s = "中文";
for (size_t i = 0; i < s.size(); ++i)
{
    std::cout << i << " " << std::hex << static_cast<int>(static_cast<uint8_t>(s[i])) << std::endl;
}

cout << "============" << endl;

try
{
    json j;

    j["chinese"] = "中文";

    string s = j.dump();

    cout << s.c_str() << endl;
}
catch (const std::exception& e)
{
    cout << e.what() << endl;
}

return 0;

}

@raphtimecn
Copy link

对于std::string类型字符串的处理:
可以在to_json.hpp中下列函数体内添加对字符串是否进行GBKToUTF8转码的处理:
static void construct(BasicJsonType& j, const typename BasicJsonType::string_t& s)

可以在from_json.hpp中下列函数体内添加对字符串是否进行UTF8ToGBK转码的处理:
inline void from_json(const BasicJsonType& j, typename BasicJsonType::string_t& s)

vs下,在上述相应位置打个断点,一看便知。

上述处理过程,不仅是针对Windows(默认GBK)控制台输出UTF8乱码问题,同时也是针对一些程序会输出日志到DebugView等工具中的GBK乱码问题。

在对上述代码进行更改后,要特别注意您的json文件本身是否已经是UTF-8的格式,若是,则不能让to_json.hpp下代码重复进行GBKToUTF8的转换。所以,我在to_json下添加了一个bool变量开关来自行判断何时需要转换。同理,在from_json下也添加了相应的bool开关。

@deltoro05
Copy link

The error message indicates that "中文" is not encoded with UTF-8. Could you execute the following program?

#include <string>
#include <iostream>

int main() {
    std::string s = "中文";
    for (size_t i = 0; i < s.size(); ++i)
    {
        std::cout << i << " " << std::hex << static_cast<int>(static_cast<uint8_t>(s[i])) << std::endl;
    }
}

With UTF-8, this should yield:

0 e4
1 b8
2 ad
3 e6
4 96
5 87

This URL https://unicode-table.com/ has changed to https://symbl.cc/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: question solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

No branches or pull requests

5 participants