rewrite strip_nastyhtml, strip_html in Qt #1341

tsteven4 · 2024-09-14T22:06:38Z

This fixes some bugs in strip_nastyhtml that could result in invalid html being produced.

When the html output of the new test and the old code was validated we used to get these errors:

Error: Bogus comment.

[At line 28, column 47](https://validator.w3.org/nu/#cl28c47)

eldescshort"><!   >The cache i

Error: Stray start tag html.

[From line 35, column 44; to line 35, column 49](https://validator.w3.org/nu/#l35c49)

desclong"><html><!--y

and actually produce valid html: 1. the replacement for "<body>", "<! >", is invalid. 2. leaving an html tag in causes the html format output to be invalid.

tsteven4 · 2024-09-15T00:44:45Z

It is strange this fails on jammy but not noble. It fails in the gpx reader which hasn't changed, but the file being read is new, and validates.

==205210== Invalid read of size 16
==205210==    at 0x4B772DF: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x4AF979C: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x4AF5002: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x4AF6770: QXmlStreamReader::readNext() (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==205210==    by 0x1D24E4: GpxFormat::read() (gpx.cc:1126)

tsteven4 · 2024-09-15T00:47:09Z

Indeed 1.9.0 fails the same way reading the same file (reference/gc/GCGCA8_nasty.gpx)

tsteven4 · 2024-09-15T00:49:39Z

==206079== Invalid read of size 16
==206079==    at 0x4B772DF: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF979C: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF5002: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF6770: QXmlStreamReader::readNext() (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x1CD474: GpxFormat::read() (gpx.cc:1002)
==206079==    by 0x24EB8F: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:267)
==206079==    by 0x24FD2D: run(char const*) (main.cc:406)
==206079==    by 0x251DF2: main (main.cc:815)
==206079==  Address 0x7949e58 is 520 bytes inside a block of size 530 alloc'd
==206079==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==206079==    by 0x4B49C53: QArrayData::allocate(QArrayData**, long long, long long, long long, QArrayData::AllocationOption) (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4B24219: QString::reallocData(long long, QArrayData::AllocationOption) (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AE733C: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF265A: ??? (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x4AF6770: QXmlStreamReader::readNext() (in /usr/lib/x86_64-linux-gnu/libQt6Core.so.6.2.4)
==206079==    by 0x1CD474: GpxFormat::read() (gpx.cc:1002)
==206079==    by 0x24EB8F: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:267)
==206079==    by 0x24FD2D: run(char const*) (main.cc:406)
==206079==    by 0x251DF2: main (main.cc:815)
```

tsteven4 · 2024-09-15T01:18:37Z

Fails on jammy with Qt 6.3.1, passes with 6.3.2

tsteven4 · 2024-09-15T01:26:12Z

6.3.1 failures are intermittent!

robertlipe · 2024-09-15T10:16:16Z

I have a variation of this very PR in a work tree in some directory in come computer somewhere that tries to solve the same problem.

Helpful, eh?

It's less slavish to the literal transliteration of the C code. It's probably less tolerant (or at least differently tolerant) to code that's malformed in different ways. I just broke down and regex-ed the heck out of the input.

I, too, noticed the absence of test coverage.

I didn't think that approach was totally awesome. (I withheld it from the merges for a reason, even if it was a bad one.) There may be ideas in it worth mining, though. Let me see if I can dig that up on Sunday.

tsteven4 · 2024-09-15T12:28:14Z

After implementing nasty with regex I could imagine a "less slavish to the literal transliteration" implementation of strip_html that would be easier on the eyes.

tsteven4 · 2024-09-15T13:12:51Z

Qt 6.2.4 from qt.io with debug_info, jammy:

==229990== Invalid read of size 16
==229990== at 0x52C3934: _mm_loadu_si128 (emmintrin.h:703)
==229990== by 0x52C3934: aeshash(unsigned char const*, unsigned long, unsigned long) (qhash.cpp:504)
==229990== by 0x523B13C: calculateHash (qhash.h:93)
==229990== by 0x523B13C: find (qhash.h:593)
==229990== by 0x523B13C: QHash<QStringView, QXmlStreamReaderPrivate::Entity>::find(QStringView const&) (qhash.h:1143)
==229990== by 0x52367B9: QXmlStreamReaderPrivate::parse() (qxmlstreamparser_p.h:866)
==229990== by 0x52381BE: QXmlStreamReader::readNext() (qxmlstream.cpp:609)
==229990== by 0x1D24E4: GpxFormat::read() (gpx.cc:1126)
==229990== by 0x25C407: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:263)
==229990== by 0x25D5A5: run(char const*) (main.cc:402)
==229990== by 0x25F66A: main (main.cc:811)
==229990== Address 0x8092208 is 520 bytes inside a block of size 530 alloc'd
==229990== at 0x4E050C5: malloc (vg_replace_malloc.c:442)
==229990== by 0x52914A0: allocateData (qarraydata.cpp:178)
==229990== by 0x52914A0: QArrayData::allocate(QArrayData**, long long, long long, long long, QArrayData::AllocationOption) (qarraydata.cpp:227)
==229990== by 0x5267C18: allocate (qarraydata.h:141)
==229990== by 0x5267C18: QString::reallocData(long long, QArrayData::AllocationOption) (qstring.cpp:2589)
==229990== by 0x5238B2C: reserve (qstring.h:1307)
==229990== by 0x5238B2C: QXmlStreamReaderPrivate::clearTextBuffer() (qxmlstream_p.h:423)
==229990== by 0x5233CEF: QXmlStreamReaderPrivate::parse() (qxmlstreamparser_p.h:148)
==229990== by 0x52381BE: QXmlStreamReader::readNext() (qxmlstream.cpp:609)
==229990== by 0x1D24E4: GpxFormat::read() (gpx.cc:1126)
==229990== by 0x25C407: run_reader(Vecs::fmtinfo_t&, QString const&) (main.cc:263)
==229990== by 0x25D5A5: run(char const*) (main.cc:402)
==229990== by 0x25F66A: main (main.cc:811)

This reverts commit b0440f7.

This reverts commit 40fe2ef.

This reverts commit 677da95.

tsteven4 · 2024-09-15T14:21:04Z

I'll get back to a strip_html regex implementation.

GPSBabelDeveloper · 2024-09-15T17:39:50Z

See if there's anything relevant worth salvaging in https://gist.github.com/robertlipe/0b8ef673af7c07e33de323d4b8bddb19?permalink_comment_id=5192732#gistcomment-5192732 I wasn't *proud* of strip_html, but I never liked the implementation we have, either.

…

On Sun, Sep 15, 2024 at 9:21 AM tsteven4 ***@***.***> wrote: I'll get back to a strip_html regex implementation. — Reply to this email directly, view it on GitHub <#1341 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3VAD4Y4MC3P5J7J7LBATDZWWJWLAVCNFSM6AAAAABOHGKQEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGYYTIOBUGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

util.cc

like p, param, pre.

tsteven4 · 2024-09-16T00:05:06Z

See if there's anything relevant worth salvaging in https://gist.github.com/robertlipe/0b8ef673af7c07e33de323d4b8bddb19?permalink_comment_id=5192732#gistcomment-5192732 I wasn't proud of strip_html, but I never liked the implementation we have, either.
…
On Sun, Sep 15, 2024 at 9:21 AM tsteven4 @.> wrote: I'll get back to a strip_html regex implementation. — Reply to this email directly, view it on GitHub <#1341 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3VAD4Y4MC3P5J7J7LBATDZWWJWLAVCNFSM6AAAAABOHGKQEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGYYTIOBUGE . You are receiving this because you are subscribed to this thread.Message ID: @.>

I think the current implementation methods are better. There are some expanded matches in your implementation that aren't included, although I did add striping of the html start tag in strip_nastyhtml. My try 1 (reverted) on strip_html analogous to yours never passed regression. Historically and currently these implementations are limited (as noted in a comment).

I think the regression failure is a Qt bug, perhaps unfixed. I did see 6.5.3 fail. It suspect Qt goes off to handle an entity, which causes a realloc, and then returns without realizing a reference to the old memory block is invalid. I can work around it by massaging the new test reference. Creating a repeatable test case might be hard. It seems to fail solidly on jammy with 6.2.4, but it is intermittent with never versions of Qt. Although one may exist I haven't found a relevant QTBUG report.

tsteven4 · 2024-09-16T12:04:32Z

util.cc

+  static const QRegularExpression re("(?:<(?<tag>[^ >]*).*?>)|(?:&(?<entity>.*?);)|(?<other>[^<&]+)|(?<fragment>.+)",
+                                     QRegularExpression::DotMatchesEverythingOption);
+  assert(re.isValid());
+  static const QRegularExpression newlinespace_re("\\n\\s*");


consider changing behavior with newlinespace_re("\s*\n\s*");

tsteven4 added 2 commits September 14, 2024 12:34

rewrite strip_html with QString.

ff59506

rewrite strip_nasty_html in Qt.

28970bb

and actually produce valid html: 1. the replacement for "<body>", "<! >", is invalid. 2. leaving an html tag in causes the html format output to be invalid.

tsteven4 added the bug label Sep 14, 2024

cleanup comment xstrdup

cb110d2

use regex for strip_html

677da95

tsteven4 added 5 commits September 15, 2024 07:53

strip_html deletes other tags

40fe2ef

fix strip_html img tag handling

b0440f7

Revert "fix strip_html img tag handling"

06c9727

This reverts commit b0440f7.

Revert "strip_html deletes other tags"

f5eeaeb

This reverts commit 40fe2ef.

Revert "use regex for strip_html"

a10ab75

This reverts commit 677da95.

tsteven4 marked this pull request as draft September 15, 2024 14:20

tsteven4 added 3 commits September 15, 2024 14:44

implement strip_html using QRegularExpressionMatchIterator.

7281389

a little cleanup

8e722ae

remove obsolete include

08bb082

tsteven4 commented Sep 15, 2024

View reviewed changes

util.cc Outdated Show resolved Hide resolved

take care to distinguish tags with common roots

06b8027

like p, param, pre.

tsteven4 commented Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite strip_nastyhtml, strip_html in Qt #1341

rewrite strip_nastyhtml, strip_html in Qt #1341

tsteven4 commented Sep 14, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024 •

edited

Loading

robertlipe commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

GPSBabelDeveloper commented Sep 15, 2024 via email

tsteven4 commented Sep 16, 2024

tsteven4 Sep 16, 2024

rewrite strip_nastyhtml, strip_html in Qt #1341

Are you sure you want to change the base?

rewrite strip_nastyhtml, strip_html in Qt #1341

Conversation

tsteven4 commented Sep 14, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024 • edited Loading

robertlipe commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

tsteven4 commented Sep 15, 2024

GPSBabelDeveloper commented Sep 15, 2024 via email

tsteven4 commented Sep 16, 2024

tsteven4 Sep 16, 2024

Choose a reason for hiding this comment

tsteven4 commented Sep 15, 2024 •

edited

Loading