Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite strip_nastyhtml, strip_html in Qt #1341

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

tsteven4
Copy link
Collaborator

This fixes some bugs in strip_nastyhtml that could result in invalid html being produced.

When the html output of the new test and the old code was validated we used to get these errors:

Error: Bogus comment.

[At line 28, column 47](https://validator.w3.org/nu/#cl28c47)

eldescshort"><!   >The cache i
Error: Stray start tag html.

[From line 35, column 44; to line 35, column 49](https://validator.w3.org/nu/#l35c49)

desclong"><html><!--y

and actually produce valid html:
1. the replacement for "<body>", "<!   >", is invalid.
2. leaving an html tag in causes the html format output to be invalid.
@tsteven4 tsteven4 added the bug label Sep 14, 2024
@tsteven4
Copy link
Collaborator Author

It is strange this fails on jammy but not noble. It fails in the gpx reader which hasn't changed, but the file being read is new, and validates.

==205210== Invalid read of size 16
==205210==    at 0x4B772DF: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x4AF979C: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x4AF5002: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x4AF6770: QXmlStreamReader::readNext() (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x1D24E4: GpxFormat::read() (gpx.cc:1126)

@tsteven4
Copy link
Collaborator Author

Indeed 1.9.0 fails the same way reading the same file (reference/gc/GCGCA8_nasty.gpx)

@tsteven4
Copy link
Collaborator Author

==206079== Invalid read of size 16
==206079==    at 0x4B772DF: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF979C: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF5002: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF6770: QXmlStreamReader::readNext() (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x1CD474: GpxFormat::read() (gpx.cc:1002)
==206079==    by 0x24EB8F: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:267)
==206079==    by 0x24FD2D: run(char const*) (main.cc:406)
==206079==    by 0x251DF2: main (main.cc:815)
==206079==  Address 0x7949e58 is 520 bytes inside a block of size 530 alloc'd
==206079==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==206079==    by 0x4B49C53: QArrayData::allocate(QArrayData**, long long, long long, long long, QArrayData::AllocationOption) (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4B24219: QString::reallocData(long long, QArrayData::AllocationOption) (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AE733C: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF265A: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF6770: QXmlStreamReader::readNext() (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x1CD474: GpxFormat::read() (gpx.cc:1002)
==206079==    by 0x24EB8F: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:267)
==206079==    by 0x24FD2D: run(char const*) (main.cc:406)
==206079==    by 0x251DF2: main (main.cc:815)
```

@tsteven4
Copy link
Collaborator Author

Fails on jammy with Qt 6.3.1, passes with 6.3.2

@tsteven4
Copy link
Collaborator Author

tsteven4 commented Sep 15, 2024

6.3.1 failures are intermittent!

@robertlipe
Copy link
Collaborator

I have a variation of this very PR in a work tree in some directory in come computer somewhere that tries to solve the same problem.

Helpful, eh?

It's less slavish to the literal transliteration of the C code. It's probably less tolerant (or at least differently tolerant) to code that's malformed in different ways. I just broke down and regex-ed the heck out of the input.

I, too, noticed the absence of test coverage.

I didn't think that approach was totally awesome. (I withheld it from the merges for a reason, even if it was a bad one.) There may be ideas in it worth mining, though. Let me see if I can dig that up on Sunday.

@tsteven4
Copy link
Collaborator Author

After implementing nasty with regex I could imagine a "less slavish to the literal transliteration" implementation of strip_html that would be easier on the eyes.

@tsteven4
Copy link
Collaborator Author

Qt 6.2.4 from qt.io with debug_info, jammy:

==229990== Invalid read of size 16
==229990== at 0x52C3934: _mm_loadu_si128 (emmintrin.h:703)
==229990== by 0x52C3934: aeshash(unsigned char const*, unsigned long, unsigned long) (qhash.cpp:504)
==229990== by 0x523B13C: calculateHash (qhash.h:93)
==229990== by 0x523B13C: find (qhash.h:593)
==229990== by 0x523B13C: QHash<QStringView, QXmlStreamReaderPrivate::Entity>::find(QStringView const&) (qhash.h:1143)
==229990== by 0x52367B9: QXmlStreamReaderPrivate::parse() (qxmlstreamparser_p.h:866)
==229990== by 0x52381BE: QXmlStreamReader::readNext() (qxmlstream.cpp:609)
==229990== by 0x1D24E4: GpxFormat::read() (gpx.cc:1126)
==229990== by 0x25C407: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:263)
==229990== by 0x25D5A5: run(char const*) (main.cc:402)
==229990== by 0x25F66A: main (main.cc:811)
==229990== Address 0x8092208 is 520 bytes inside a block of size 530 alloc'd
==229990== at 0x4E050C5: malloc (vg_replace_malloc.c:442)
==229990== by 0x52914A0: allocateData (qarraydata.cpp:178)
==229990== by 0x52914A0: QArrayData::allocate(QArrayData**, long long, long long, long long, QArrayData::AllocationOption) (qarraydata.cpp:227)
==229990== by 0x5267C18: allocate (qarraydata.h:141)
==229990== by 0x5267C18: QString::reallocData(long long, QArrayData::AllocationOption) (qstring.cpp:2589)
==229990== by 0x5238B2C: reserve (qstring.h:1307)
==229990== by 0x5238B2C: QXmlStreamReaderPrivate::clearTextBuffer() (qxmlstream_p.h:423)
==229990== by 0x5233CEF: QXmlStreamReaderPrivate::parse() (qxmlstreamparser_p.h:148)
==229990== by 0x52381BE: QXmlStreamReader::readNext() (qxmlstream.cpp:609)
==229990== by 0x1D24E4: GpxFormat::read() (gpx.cc:1126)
==229990== by 0x25C407: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:263)
==229990== by 0x25D5A5: run(char const*) (main.cc:402)
==229990== by 0x25F66A: main (main.cc:811)

@tsteven4 tsteven4 marked this pull request as draft September 15, 2024 14:20
@tsteven4
Copy link
Collaborator Author

I'll get back to a strip_html regex implementation.

@GPSBabelDeveloper
Copy link
Collaborator

GPSBabelDeveloper commented Sep 15, 2024 via email

util.cc Outdated Show resolved Hide resolved
@tsteven4
Copy link
Collaborator Author

See if there's anything relevant worth salvaging in https://gist.github.com/robertlipe/0b8ef673af7c07e33de323d4b8bddb19?permalink_comment_id=5192732#gistcomment-5192732 I wasn't proud of strip_html, but I never liked the implementation we have, either.

On Sun, Sep 15, 2024 at 9:21 AM tsteven4 @.> wrote: I'll get back to a strip_html regex implementation. — Reply to this email directly, view it on GitHub <#1341 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3VAD4Y4MC3P5J7J7LBATDZWWJWLAVCNFSM6AAAAABOHGKQEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGYYTIOBUGE . You are receiving this because you are subscribed to this thread.Message ID: @.>

I think the current implementation methods are better. There are some expanded matches in your implementation that aren't included, although I did add striping of the html start tag in strip_nastyhtml. My try 1 (reverted) on strip_html analogous to yours never passed regression. Historically and currently these implementations are limited (as noted in a comment).

I think the regression failure is a Qt bug, perhaps unfixed. I did see 6.5.3 fail. It suspect Qt goes off to handle an entity, which causes a realloc, and then returns without realizing a reference to the old memory block is invalid. I can work around it by massaging the new test reference. Creating a repeatable test case might be hard. It seems to fail solidly on jammy with 6.2.4, but it is intermittent with never versions of Qt. Although one may exist I haven't found a relevant QTBUG report.

static const QRegularExpression re("(?:<(?<tag>[^ >]*).*?>)|(?:&(?<entity>.*?);)|(?<other>[^<&]+)|(?<fragment>.+)",
QRegularExpression::DotMatchesEverythingOption);
assert(re.isValid());
static const QRegularExpression newlinespace_re("\\n\\s*");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider changing behavior with newlinespace_re("\s*\n\s*");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants