Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on JSON import #19719

Closed
inkrement opened this issue Jan 27, 2021 · 9 comments · Fixed by #20286
Closed

Error on JSON import #19719

inkrement opened this issue Jan 27, 2021 · 9 comments · Fixed by #20286
Assignees
Labels
bug Confirmed user-visible misbehaviour in official release st-need-info We need extra data to continue (waiting for response)

Comments

@inkrement
Copy link

Describe the bug
When importing a jsonlines file (gzip-compressed around 2GB), I always get an error message indicating some issues regarding memory allocation. The machine has 512GB RAM and most of it is not used, therefore, it could be related with the configuration or just simply a software bug. I am using the default config, but did not change it since a year (has something changed in the default values that could be relevant here?). Either way, I am not sure how to fix it. I am using the most recent version (Client & Server 21.1.2.15) and I can reproduce the error. However, due to copyright issues, I cannot share the original dataset.

I import the file as follows

zcat dataset.jsonl.gz|clickhouse-client --input_format_skip_unknown_fields=1 --input_format_allow_errors_num=1000 -q "INSERT INTO mydataset.mytable FORMAT JSONEachRow"

Error message and/or stacktrace
And after a while, I always receive the following error message:

Code: 49, e.displayText() = DB::Exception: Too large size (18446744071562077831) passed to allocator. It indicates an error., Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception<unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long&) @ 0x864dfc7 in /usr/bin/clickhouse
1. Allocator<false, false>::checkSize(unsigned long) @ 0x864dd7e in /usr/bin/clickhouse
2. Allocator<false, false>::realloc(void*, unsigned long, unsigned long, unsigned long) @ 0x865c784 in /usr/bin/clickhouse
3. DB::loadAtPosition(DB::ReadBuffer&, DB::Memory<Allocator<false, false> >&, char*&) @ 0x865bbfa in /usr/bin/clickhouse
4. DB::fileSegmentationEngineJSONEachRowImpl(DB::ReadBuffer&, DB::Memory<Allocator<false, false> >&, unsigned long) @ 0xf98a612 in /usr/bin/clickhouse
5. DB::ParallelParsingInputFormat::segmentatorThreadFunction(std::__1::shared_ptr<DB::ThreadGroupStatus>) @ 0xf9aed84 in /usr/bin/clickhouse
6. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ParallelParsingInputFormat::*)(std::__1::shared_ptr<DB::ThreadGroupStatus>), DB::ParallelParsingInputFormat*, std::__1::shared_ptr<DB::ThreadGroupStatus> >(void (DB::ParallelParsingInputFormat::*&&)(std::__1::shared_ptr<DB::ThreadGroupStatus>), DB::ParallelParsingInputFormat*&&, std::__1::shared_ptr<DB::ThreadGroupStatus>&&)::'lambda'()::operator()() @ 0xf8a9677 in /usr/bin/clickhouse
7. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x86415ed in /usr/bin/clickhouse
8. ? @ 0x86451a3 in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.1.2.15 (official build))
Code: 49, e.displayText() = DB::Exception: Too large size (18446744071562077831) passed to allocator. It indicates an error., Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception<unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long&) @ 0x864dfc7 in /usr/bin/clickhouse
1. Allocator<false, false>::checkSize(unsigned long) @ 0x864dd7e in /usr/bin/clickhouse
2. Allocator<false, false>::realloc(void*, unsigned long, unsigned long, unsigned long) @ 0x865c784 in /usr/bin/clickhouse
3. DB::loadAtPosition(DB::ReadBuffer&, DB::Memory<Allocator<false, false> >&, char*&) @ 0x865bbfa in /usr/bin/clickhouse
4. DB::fileSegmentationEngineJSONEachRowImpl(DB::ReadBuffer&, DB::Memory<Allocator<false, false> >&, unsigned long) @ 0xf98a612 in /usr/bin/clickhouse
5. DB::ParallelParsingInputFormat::segmentatorThreadFunction(std::__1::shared_ptr<DB::ThreadGroupStatus>) @ 0xf9aed84 in /usr/bin/clickhouse
6. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ParallelParsingInputFormat::*)(std::__1::shared_ptr<DB::ThreadGroupStatus>), DB::ParallelParsingInputFormat*, std::__1::shared_ptr<DB::ThreadGroupStatus> >(void (DB::ParallelParsingInputFormat::*&&)(std::__1::shared_ptr<DB::ThreadGroupStatus>), DB::ParallelParsingInputFormat*&&, std::__1::shared_ptr<DB::ThreadGroupStatus>&&)::'lambda'()::operator()() @ 0xf8a9677 in /usr/bin/clickhouse
7. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x86415ed in /usr/bin/clickhouse
8. ? @ 0x86451a3 in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.1.2.15 (official build))
Code: 49. DB::Exception: Too large size (18446744071562077831) passed to allocator. It indicates an error.: data for INSERT was parsed from stdin
@inkrement inkrement added the bug Confirmed user-visible misbehaviour in official release label Jan 27, 2021
@alexey-milovidov
Copy link
Member

Does it help to set --input_format_parallel_parsing 0?

@alexey-milovidov alexey-milovidov added the st-need-info We need extra data to continue (waiting for response) label Jan 28, 2021
@inkrement
Copy link
Author

Indeed, the flag solved the problem!

@alexey-milovidov alexey-milovidov changed the title Error on Json Inport Error on JSON import Feb 8, 2021
@nikitamikhaylov nikitamikhaylov self-assigned this Feb 8, 2021
@nikitamikhaylov
Copy link
Member

nikitamikhaylov commented Feb 8, 2021

@inkrement Hello, maybe you can estimate the number of characters in a single JSON in your dataset? Maybe you have too large strings or simply many fields. I see a potential bug in code, but It could appear only on extremely big JSON's

@inkrement
Copy link
Author

Hi! the documents should be rather small (max 30 fields & max. 1000 chars each). However, I already recognized some broken documents (some other processes inferred at the output). If there is something broken, does the CH-parser detect it directly at the end of the current JSON document or could it be the case that it keeps reading? As the single-threaded version works fine, I guess this has to do with distribution of lines/documents.

@nikitamikhaylov
Copy link
Member

@inkrement Thanks! The parallelized version of parsers consists of several pieces and one of the is Segmentator which reads whole file and count the 'balance' of curly braces. If it is 0, then we have read a complete JSON object and can parse it. I think in your case with broken document the balance of braces was non zero for a long period of document, that's why the Segmentator kept reading the whole document into memory.

If there is something broken, does the CH-parser detect it directly at the end of the current JSON document or could it be the case that it keeps reading?

It detects the line in document where the error occurred, but your case is unusual.

@inkrement
Copy link
Author

Just to be sure: Are we talking about JSON or JSONEachRow?

@nikitamikhaylov
Copy link
Member

Just to be sure: Are we talking about JSON or JSONEachRow?

JSONEachRow, because JSON is not supported for input.

@nikitamikhaylov
Copy link
Member

It reproduces easily

python3 -c "for i in range(10):print('{{\"a\":\"{}\", \"b\":\"{}\"'.format('clickhouse'* 1000000, 'dbms' * 1000000))" > big_json.json
python3 -c "for i in range(1000):print('{{\"a\":\"{}\", \"b\":\"{}\"}}'.format('clickhouse'* 1000000, 'dbms' * 1000000))" > big_json.json
↳ $ ls -lha
14G Feb 11 13:25 big_json.json
↳ $ clickhouse-local --input_format_parallel_parsing=1 --max_memory_usage=0 -q "select count() from file('big_json.json', 'JSONEachRow', 'a String, b String')"
Code: 49, e.displayText() = DB::Exception: Too large size (18446744071562067983) passed to allocator. It indicates an error.: While executing File (version 21.3.1.1)

@inkrement
Copy link
Author

Ok, thanks! In this case, I guess my JSON is broken.. Thanks anyways (I'll close the issue for now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release st-need-info We need extra data to continue (waiting for response)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants