Excessive memory usage #57

lpil · 2018-01-30T17:25:29Z

Hello! We're using SweetXML in production and we've been having some excessive memory usage that we've not been able to debug.

In our latest text an 80MB XML file uses >9GB of memory, this causes the VM to crash.

We are parsing XML in a format like this:

<?xml version='1.0' encoding='utf-8'?>
<ArrayOfCommercialDetail xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <CommercialDetail>
    <Status>text</Status>
    <ActionDate>text</ActionDate>
    <MusicDetails>
      <Title>text</Title>
      <Arranger></Arranger>
      <Composer>text</Composer>
      <Duration>text</Duration>
    </MusicDetails>
    <ActionedBy>text</ActionedBy>
    <ClockNumber>text</ClockNumber>
    <VODFinalAction>text</VODFinalAction>
    <FinalAction>text</FinalAction>
    <CommercialRestrictions>
      <Restriction>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Restriction>
    </CommercialRestrictions>
    <VODFinalActionId>text</VODFinalActionId>
    <CommercialPresentations>
      <Presentation>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Presentation>
      <Presentation>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Presentation>
    </CommercialPresentations>
    <CommercialArtists>
      <Artist>
        <Name>text</Name>
        <Type>text</Type>
        <ID>text</ID>
      </Artist>
      <Artist>
        <Name>text</Name>
        <Type>text</Type>
        <ID>text</ID>
      </Artist>
    </CommercialArtists>
    <FinalActionId>text</FinalActionId>
    <StatusId>text</StatusId>
  </CommercialDetail>

  <!-- Many more CommercialDetail here... -->

</ArrayOfCommercialDetail>

We parse this XML like so:

    data =
      xpath(
        xml,
        ~x"//CommercialDetail"l,
        clock_number: ~x"./ClockNumber/text()"s,
        actioned_at: ~x"./ActionDate/text()"s,
        presentation_codes: ~x"./CommercialPresentations/Presentation/Code/text()"sl,
        restriction_codes: ~x"./CommercialRestrictions/Restriction/Code/text()"sl,
        status: ~x"./Status/text()"s,
        vod_final_action_id: ~x"./VODFinalActionId/text()"s,
        final_action_id: ~x"./FinalActionId/text()"s
      )

Here's the load charts while iterating over and parsing XML files, and then discarding the result. It spikes each time the XML is parsed

What are we doing wrong here?

Extra note: We re-wrote this code to use the streaming API which used slightly less memory. Most our XML will not have newlines in it so this seemed to not be the rather path for a solution, and we would expect lower memory usage from the eager and streaming API.

After digging into the source it seems that memory spikes when :xmerl_scan.string/1 is called.

Thanks,
Louis

The text was updated successfully, but these errors were encountered:

awetzel · 2018-01-30T19:35:08Z

Hello Louis,
Can you show me the code you use for streaming ? because yes, the memory footprint is mostly determined by the xmerl library of erlang when you do not use the streaming API : :xmerl_scan.string/1.
But in the case of streaming, you should be able to avoid construction of the whole tree with the discard option.

lpil · 2018-01-31T11:28:01Z

Hi @awetzel , thanks for looking into this.

Here's the streaming code. We've not used discard so I suspect it could be optimised.

data =
  xml_stream
  |> stream_tags(:CommercialDetail)
  |> Stream.map(fn {_, doc} ->
    %{
      clock_number: xpath(doc, ~x"./ClockNumber/text()"s),
      actioned_at: xpath( doc, ~x"./ActionDate/text()"s)),
      presentation_codes: xpath(doc, ~x"./CommercialPresentations/Presentation/Code/text()"sl),
      restriction_codes: xpath(doc, ~x"./CommercialRestrictions/Restriction/Code/text()"sl),
      status: xpath(doc, ~x"./Status/text()"s),
      vod_final_action_id: xpath(doc, ~x"./VODFinalActionId/text()"s),
      final_action_id: xpath(doc, ~x"./FinalActionId/text()"s)
    }
  end)
  |> Enum.to_list()

Cheers,
Louis

antoinereyt · 2018-02-22T09:07:26Z

Hi,

Thanks for pointing this issue, I just updated the documentation to mention this option.

For you specific case, you should use the discard option this way:

|> stream_tags(:"CommercialDetail", discard: [:"CommercialDetail"])

Can you keep me posted, so that I can close the issue ?

Thanks.

steffkes · 2018-02-22T09:58:19Z

@antoinereyt would you mind explaining about as to why this is?

from reading the code, i'd guess that it is meant to free the memory used for the current iteration - after that one is done and we're on our way to the next one .. is that somewhat correct?

antoinereyt · 2018-02-23T09:15:54Z

@awetzel do you have some details after your investigations ?

lpil · 2018-02-26T13:29:44Z

Hi @antoinereyt, I'm afraid we were unable to find a solution with the streaming code before your message there, so we rewrote to use another XML library to meet our deadline.

gmalkas · 2018-04-16T17:22:45Z

Hi @antoinereyt, just wanted to let you know we managed to reduce our memory consumption while parsing ~10MB XML files by up to 800MB thanks to the discard option.

We were really surprised by the need for this option since bounded memory consumption was the reason we used the streaming interface in the first place (we found out thanks to this issue as we shipped our code months ago before the doc mentioned the option).

I understand there is a trade-off and it's hard to avoid surprising behavior: either you discard tags by default but then the streaming output might differ from the non-streaming output for no obvious reason, or you do not discard by default to have a consistent output but memory usage blows up.

Assuming the main reason people end up using the streaming API is to get bounded memory usage, wouldn't it be fair to discard by default, with documentation warning of the consequences on the output?

Thanks for the work.

9mm · 2018-11-29T22:21:31Z

@lpil which one did you choose that supports xpath?

lpil · 2018-11-29T22:33:01Z

@9mm we used https://github.com/processone/fast_xml/

9mm · 2018-11-29T22:44:55Z

@lpil so I looked into that but how did you get that to support xpath?

lpil · 2018-11-29T23:20:49Z

I don't believe we used xpath though I've left the company so I can't be sure.

augnustin · 2020-10-01T10:01:52Z

I'm also having memory issues...

I don't understand the comment:

the streaming output might differ from the non-streaming output for no obvious reason

The :discard option is really undocumented. What does it do?

I read there:

It would be an interesting improvement if we could get :xmerl_xpath.string/2 to return an Elixir stream instead of a list. Maybe someday ...

I'm guessing this still hasn't been implemented, has it?

As far as I'm concerned, this doesn't feel much complicated: I have a huge list of <FICHE /> objects, for which each needs to be processed independently. Which means that all memory could be swapped at each iteration ... but it obviously isn't. 😢

    File.stream!(filename)
    |> SweetXml.stream_tags(:FICHE, discard: [:FICHE])
    |> Stream.map(fn {_, fiche} -> transform.(fiche) end)
    |> Enum.to_list()

Any ideas? Suggestions? Should I also move to another library?

lpil · 2020-10-07T10:24:27Z

Since I made this issue an excellent XML pull parser that used a very small amount of memor was released. It worked really well for us now. I've been searching for 15 minutes but I can't find it now (I've forgotten the name) but it exists!

thbar · 2020-10-07T10:30:59Z

@lpil by any chance, is it this one ? https://github.com/zadean/yaccety_sax (I haven't tested it yet, but it came across my radar)

lpil · 2020-10-07T10:33:41Z

Yes! It was head and shoulders better than the rest memory wise. A shame it's not well known.

augnustin mentioned this issue Oct 1, 2020

Get tag content at :end_element qcam/saxy#73

Closed

Shakadak self-assigned this Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory usage #57

Excessive memory usage #57

lpil commented Jan 30, 2018 •

edited

Loading

awetzel commented Jan 30, 2018

lpil commented Jan 31, 2018

antoinereyt commented Feb 22, 2018

steffkes commented Feb 22, 2018

antoinereyt commented Feb 23, 2018

lpil commented Feb 26, 2018

gmalkas commented Apr 16, 2018

9mm commented Nov 29, 2018

lpil commented Nov 29, 2018

9mm commented Nov 29, 2018

lpil commented Nov 29, 2018

augnustin commented Oct 1, 2020

lpil commented Oct 7, 2020 •

edited

Loading

thbar commented Oct 7, 2020

lpil commented Oct 7, 2020

Excessive memory usage #57

Excessive memory usage #57

Comments

lpil commented Jan 30, 2018 • edited Loading

awetzel commented Jan 30, 2018

lpil commented Jan 31, 2018

antoinereyt commented Feb 22, 2018

steffkes commented Feb 22, 2018

antoinereyt commented Feb 23, 2018

lpil commented Feb 26, 2018

gmalkas commented Apr 16, 2018

9mm commented Nov 29, 2018

lpil commented Nov 29, 2018

9mm commented Nov 29, 2018

lpil commented Nov 29, 2018

augnustin commented Oct 1, 2020

lpil commented Oct 7, 2020 • edited Loading

thbar commented Oct 7, 2020

lpil commented Oct 7, 2020

lpil commented Jan 30, 2018 •

edited

Loading

lpil commented Oct 7, 2020 •

edited

Loading