Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What XML parsing library to use for parsing OVAL files #7108

Closed
HoussemNasri opened this issue Jun 7, 2023 · 4 comments
Closed

What XML parsing library to use for parsing OVAL files #7108

HoussemNasri opened this issue Jun 7, 2023 · 4 comments
Labels
question Further information is requested

Comments

@HoussemNasri
Copy link
Collaborator

HoussemNasri commented Jun 7, 2023

Question

I'm working on the OVAL consumption GSoC project and I need to select a library to parse OVAL files (which are essentially XML files). I wrote an ADR to compare the XML parsing libraries used in Uyuni and based on the result of the comparison, I find JAXB to be the most suitable for the project use case. However, I would like to hear your input on JAXB and whether you think I should use another library.

My biggest concern with JAXB is memory consumption. Unlike StAx parsers, where we can read one XML element at a time and write it to the database directly, with JAXB, we need to wait for it to parse the whole file and then we can store it in the database. Keep in mind that the space allocated by JAXB will be released once the parsed object is written to the database.

openSUSE OVAL files allocate around 250 MB at a maximum. I created a PoC app that parses an OVAL file of 284 MB with JAXB and the app took 517 MB of heap memory at peak time.

image

@HoussemNasri HoussemNasri added the question Further information is requested label Jun 7, 2023
@admd
Copy link
Contributor

admd commented Jun 9, 2023

@rjmateus @cbosdo @mackdk can you please help Houssem make this decision here?

@cbosdo
Copy link
Contributor

cbosdo commented Jun 13, 2023

For potentially big datasets like this I would prefer using streams: all DOM-based solutions are bound to explode sooner or later as it's just a matter of number of objects in the end. Even though StAX is not as convenient as the other ones I would go with it as it is a tradeoff between the plain (efficient) SAX and easy (hungry) DOM APIs. I'm not sure we have to care about the execution time, but the memory consumption can bite us hard.

@HoussemNasri
Copy link
Collaborator Author

HoussemNasri commented Jun 15, 2023

@cbosdo Thank you for your valuable input. Right now, I'm convinced of using StAX because as you said, with DOM, it's a matter of the number of objects. It's no longer a question of whether it will explode, but rather when it will explode. I will keep the issue open for a few more days in case someone has more input.

About the execution time, it takes around 6 seconds on my computer to parse a 250MB OVAL file. Assuming the OVAL data will be synced once a day, I think it's negligible. This is the DOM parser though, StAX should be slightly faster.

@HoussemNasri
Copy link
Collaborator Author

Decision: Use StAX because it provides a middle ground between the memory-hungry/easy-to-use DOM APIs and the efficient/complicated SAX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants