-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite strip_nastyhtml, strip_html in Qt #1341
base: master
Are you sure you want to change the base?
Conversation
and actually produce valid html: 1. the replacement for "<body>", "<! >", is invalid. 2. leaving an html tag in causes the html format output to be invalid.
It is strange this fails on jammy but not noble. It fails in the gpx reader which hasn't changed, but the file being read is new, and validates.
|
Indeed 1.9.0 fails the same way reading the same file (reference/gc/GCGCA8_nasty.gpx) |
|
Fails on jammy with Qt 6.3.1, passes with 6.3.2 |
6.3.1 failures are intermittent! |
I have a variation of this very PR in a work tree in some directory in come computer somewhere that tries to solve the same problem. Helpful, eh? It's less slavish to the literal transliteration of the C code. It's probably less tolerant (or at least differently tolerant) to code that's malformed in different ways. I just broke down and regex-ed the heck out of the input. I, too, noticed the absence of test coverage. I didn't think that approach was totally awesome. (I withheld it from the merges for a reason, even if it was a bad one.) There may be ideas in it worth mining, though. Let me see if I can dig that up on Sunday. |
After implementing nasty with regex I could imagine a "less slavish to the literal transliteration" implementation of strip_html that would be easier on the eyes. |
Qt 6.2.4 from qt.io with debug_info, jammy: ==229990== Invalid read of size 16 |
I'll get back to a strip_html regex implementation. |
See if there's anything relevant worth salvaging in
https://gist.github.com/robertlipe/0b8ef673af7c07e33de323d4b8bddb19?permalink_comment_id=5192732#gistcomment-5192732
I wasn't *proud* of strip_html, but I never liked the implementation we
have, either.
…On Sun, Sep 15, 2024 at 9:21 AM tsteven4 ***@***.***> wrote:
I'll get back to a strip_html regex implementation.
—
Reply to this email directly, view it on GitHub
<#1341 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3VAD4Y4MC3P5J7J7LBATDZWWJWLAVCNFSM6AAAAABOHGKQEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGYYTIOBUGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
like p, param, pre.
I think the current implementation methods are better. There are some expanded matches in your implementation that aren't included, although I did add striping of the html start tag in strip_nastyhtml. My try 1 (reverted) on strip_html analogous to yours never passed regression. Historically and currently these implementations are limited (as noted in a comment). I think the regression failure is a Qt bug, perhaps unfixed. I did see 6.5.3 fail. It suspect Qt goes off to handle an entity, which causes a realloc, and then returns without realizing a reference to the old memory block is invalid. I can work around it by massaging the new test reference. Creating a repeatable test case might be hard. It seems to fail solidly on jammy with 6.2.4, but it is intermittent with never versions of Qt. Although one may exist I haven't found a relevant QTBUG report. |
static const QRegularExpression re("(?:<(?<tag>[^ >]*).*?>)|(?:&(?<entity>.*?);)|(?<other>[^<&]+)|(?<fragment>.+)", | ||
QRegularExpression::DotMatchesEverythingOption); | ||
assert(re.isValid()); | ||
static const QRegularExpression newlinespace_re("\\n\\s*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider changing behavior with newlinespace_re("\s*\n\s*");
This fixes some bugs in strip_nastyhtml that could result in invalid html being produced.
When the html output of the new test and the old code was validated we used to get these errors: