Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive Memory Usage for Large Json File #1516

Closed
ghost opened this issue Mar 14, 2019 · 16 comments
Closed

Excessive Memory Usage for Large Json File #1516

ghost opened this issue Mar 14, 2019 · 16 comments
Labels
state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@ghost
Copy link

ghost commented Mar 14, 2019

Parsing a 1.6 GB file takes up more than 10 GB of memory. I've also seen it go upwards of close to 20 GB of memory for a similar sized file.

Can generate a file with the below code, don't have anywhere to upload that large of a file.

#include <string>

int main() {

	nlohmann::json res;

	auto& arr = res["Tasks"];


	for ( int i = 0; i < 50000; ++i ) {
		nlohmann::json value;

		value["Name"] = "Some Name " + std::to_string( i );
		value["Date"] = "2256-01-01";
		value["Date2"] = "2256-01-01";
		value["Date2"] = "2256-01-01";
		value["Date4"] = "2256-01-01";

		{
			auto& nestedArr = value["Objects"];

			for ( int j = 0; j < 125; ++j ) {
				nlohmann::json value;
				float LO = -1024.0f;
				float HI = 1024.0f;

				value["AnotherName"] = "Another Name " + std::to_string( i ) + std::to_string( j );
				value["Value1"] = LO + static_cast <float> ( rand() ) / ( static_cast <float> ( RAND_MAX / ( HI - LO ) ) ) + 10000.0;
				value["Value2"] = LO + static_cast <float> ( rand() ) / ( static_cast <float> ( RAND_MAX / ( HI - LO ) ) );
				value["Date"] = "2018-01-01";
				nestedArr.push_back( value );
			}
		}

		{
			auto& nestedArr = value["Objects2"];

			for ( int j = 0; j < 173; ++j ) {
				nlohmann::json value;
				float LO = -1024.0f;
				float HI = 1024.0f;

				value["AnotherName"] = "Another Name " + std::to_string( i ) + std::to_string( j );
				value["Value1"] = LO + static_cast <float> ( rand() ) / ( static_cast <float> ( RAND_MAX / ( HI - LO ) ) ) + 10000.0;
				value["Value2"] = LO + static_cast <float> ( rand() ) / ( static_cast <float> ( RAND_MAX / ( HI - LO ) ) );
				value["Date"] = "2018-01-01";
				nestedArr.push_back( value );
			}
		}



		arr.push_back( value );
	}


	std::cout << res.dump();
}
@ghost ghost changed the title Excess Memory Usage for Large Json File Excessive Memory Usage for Large Json File Mar 14, 2019
@nlohmann
Copy link
Owner

Thanks for reporting! Indeed this library is not as memory-efficient as it could be. This is due to the fact that parsed values are by default stored in a DOM-like hierarchy, ready to be read and changed later on. For this, we use STL types like std::string, std::map, and std::vector which all bring some overhead.

I fear there is little we can do at the moment - the file has some 59 million values to store. If you just want to process the file (e.g., summing up some values or looking up a particular object) without the need of storing the whole value in memory, you can define a parser callback or a dedicated SAX parser and let their logics decide while parsing whether to store read values.

I created a (pretty-printed) JSON with your program above and opened it in some tools. It seems they also have some issues with memory...

  • Sublime Text: 9.0 GB
  • jq: 8.9 GB
  • Xcode: 22+ GB

@abrownsword
Copy link

A quick look at the code shows heavy array usage. It is probably worthwhile writing a program that runs through the whole DOM and adds up the std::vector::capacity() values for all the vectors. Since vectors grow via a capacity-doubling strategy, this alone could produce ~2x bloat. If that's the case some judicious use of reserve/resize might improve things. Is there a reserve call on json::array()? It might be worth adding to avoid all the intermediate allocations that can appear (depending on how you're measuring) to bloat memory footprint as well.

Other than that though, Neil is right in that there may not be a lot that can be done due to the use of a friendly DOM-like data structure built from the standard library. A 10x expansion is not uncommon in other languages as well (e.g. Python). The SAX approach can be used to build your own (presumably more efficient) data structure, or to just process data on-the-fly.

@nlohmann
Copy link
Owner

There currently is no reserve call for arrays, objects, or strings. This, however, would not help during parsing where the number of elements to come is unknown. A different situation comes with formats like CBOR or MessagePack where the size of arrays, objects, and strings are given before the actual values. Then, however, it would be unsafe to just take these values and call reserve without checks.

One way to reduce the overhead would be to add a shrink_to_fit function that (recursively) calls std::string::shrink_to_fit and std::vector::shrink_to_fit. I may conduct some experiments to find out whether this actually makes a difference.

@abrownsword
Copy link

It is possible to use resize/reserve to implement a different growth strategy for vectors that trades more frequent allocations for less wasted space. But yes, its not ideal when parsing incoming json.

I do suggest measuring before doing anything. It ought to be easy to traverse and use capacity to evaluate how much extra allocation has been done in a test case like the one in the OP.

@gregmarr
Copy link
Contributor

gregmarr commented Mar 26, 2019

Then, however, it would be unsafe to just take these values and call reserve without checks.

Are you worried about it allocation too much memory? Otherwise, safety shouldn't be an issue.

shrink_to_fit isn't guaranteed to do anything.

@nlohmann
Copy link
Owner

Are you worried about it allocation too much memory? Otherwise, safety shouldn't be an issue.

I added the reserve call once, but then OSS-Fuzz only took a few hours to generate an example with a CBOR file that announced an array with billions of elements which crashed the library.

shrink_to_fit isn't guaranteed to do anything.

I know. But I am curious :)

@gregmarr
Copy link
Contributor

Okay, that could be solved by putting an upper bound on the initial reserve. If we said that we'd support up to 1024 or 1000 or something like that, then that should be plenty. By that point, it's a fair way up the allocation curve.

@jaredgrubb
Copy link
Contributor

I really like the idea here, as I've had large-ish objects that I really want to compact because it'll be around a while.

A shrink_to_fit function would be nice to add, and doing it well will take some care. For example, imagine a vector of strings. Do you shrink the strings or the vector first? Does it matter? Does one of them help avoid fragmentation, or is that measurement completely non-portable?

I also think a shrink-while-parsing option would also be nice because it would allow you to read in larger objects than you might otherwise have been able to and (I think?) would be kinder to heap fragmentation.

By the way, I'm not concerned about whether shrink_to_fit has a guarantee. We should trust the STL to do the right thing, and if it doesn't, I don't think we should try to do better.

@nlohmann
Copy link
Owner

In fact, shrinking would be easy during parsing, because we have a dedicated end_array event in the SAX interface.

Intuitively, I would shrink bottom-up, so in a vector of strings, I would first shrink the strings and then the vector - this would also be the order when shrinking would be integrated into the parser. But we should definitely take measurements.

@abrownsword
Copy link

Trusting the STL is usually smart because a lot of effort has gone into writing and debugging it, but the fact remains that it is generic, i.e. by definition not specific to a particular use case. If you know more about your particular situations and needs, you can often improve upon its generic behaviour. Vector allocation lengths are one place where I've seen improvements over and over again -- that's why the reserve and capacity methods are there. But to do this you do absolutely need to "know more" about the specific situation.

I think an interesting case would be shrink_to_fit during parsing. The parser ought to be able to trim excess allocations as it progresses. This should be an option as you don't always want it to do this, but if you're loading a massive file it will likely be a large win in terms of memory footprint (although it could easily be slower because of the reallocation/copy that might happen if the memory allocator being used doesn't allow existing blocks to be shrunk).

@gregmarr
Copy link
Contributor

C++ memory allocators don't allow blocks to be shrunk. That's part of why shrink_to_fit doesn't generally do anything. The only sure-fire way to shrink is to copy the data into a new vector/string of the exactly right size. That trades memory usage for parsing time.

@jaredgrubb
Copy link
Contributor

C++ memory allocators don't allow blocks to be shrunk. That's part of why shrink_to_fit doesn't generally do anything.

Looking in libc++ implementation, the std::vector::shrink_to_fit does do something. The source for string was a little hard to follow but I did a quick manual test and its std::string does reallocate (or, in my test case, moves to the stack and releases the heap allocation).

@stale
Copy link

stale bot commented Apr 26, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Apr 26, 2019
@stale stale bot closed this as completed May 3, 2019
@Meerkov
Copy link

Meerkov commented May 28, 2019

I'd like to re-open this issue. I had it parse a 700mb file and it took 9gb. That's over 10x memory. My understanding is that strings default size are implementation independant, but generally start at 10-20 bytes.

So if your JSON contains many short strings, getting a 10x increase in memory is trivial, when you also take into account the rest of the DOM datastructure. I'd guess that most entries in a parsed json will need to change rarely, so shrink_to_fit on the strings and vectors would likely also come with potential speed improvements (more data could fit into the L1 and L2 caches).

@ghost
Copy link
Author

ghost commented May 29, 2019

For me std::string is 32 bytes on 64-bit system. It's initialize capacity is 14-15, which is embedded into the std::string structure using a clever implementation. The other STL structures also take up similar space without allocating anything on the heap. This is why it takes up so much memory, you can implement your own json type to try to reduce the memory as much as possible.

You can implement it using the sax parser. Or I think basic_json is just a template, you can plug your new types into, though I imagine they would then need to match the STL interface.

https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/input/json_sax.hpp#L145

@igormcoelho
Copy link

I just got into this same issue, some 1.6GB json wouldn't load (even with 10GB memory).
I tried other libraries, but couldn't find any solution, so I coded my own solution here: https://github.com/igormcoelho/vastjson/
If this is useful to someone in the future, it just indexes top-level entries as strings (that consumes less space), and put into nlohmann::json when necessary. It worked and I could finally process my big json (with around 2.3GB RAM)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

6 participants