Incorrect link rewriting #265

rgaudin · 2022-07-01T11:43:18Z

There is a bad link in the article users/241039/oleh-prypin (entry idx=58950210) in the ZIM file:

<a href="//meta.stackoverflow.com/../../users/17034/hans-passant?tab=answers&amp;sort=votes#user-tab-answers">

The corresponding link in the current version of https://stackoverflow.com/users/241039/oleh-prypin is

<a href="//meta.stackoverflow.com/users/17034/hans-passant?tab=answers&amp;sort=votes#user-tab-answers">

The scraper added unjustified "../.." in an absolute URL belonging to a different domain.

Of course zimcheck shouldn't crash because of it and I will fix that. But now you have early indication to the bug in the scraper.

Originally posted by @veloman-yunkan in openzim/zim-tools#305 (comment)

The text was updated successfully, but these errors were encountered:

rgaudin · 2022-07-06T11:55:58Z

@TheCrazyT how is this relevant to this ticket? Did you answer here by mistake?

Regarding BS not being threadsafe it's probably not the issue as BS is instantiated each time in the rewriter which is called by the rewrote() Jinja filter in templates. We may well be leaking around this but BS not being threadsafe is probably not at play

Regarding this ticket now, I appears that above link is actually correct, although it should probably not have been tampered with to begin with as it is an external link. Problem is probably the same-scheme // that was not taken care of.

TheCrazyT · 2022-07-06T11:56:05Z

Well guess the problem is that it just substitudes the current url, but ignores to_root,that already is ../.. .

sotoki/src/sotoki/utils/html.py

Line 257 in 841a968

uri_path = re.sub(r"^(\.\.?/)+", "", uri.path)

sotoki/src/sotoki/renderer.py

Line 203 in 841a968

to_root="../../",

TheCrazyT · 2022-07-06T11:57:53Z

@TheCrazyT how is this relevant to this ticket? Did you answer here by mistake?

Regarding BS not being threadsafe it's probably not the issue as BS is instantiated each time in the rewriter which is called by the rewrote() Jinja filter in templates. We may well be leaking around this but BS not being threadsafe is probably not at play

Regarding this ticket now, I appears that above link is actually correct, although it should probably not have been tampered with to begin with as it is an external link. Problem is probably the same-scheme // that was not taken care of.

Sorry the comment about beautifulsoup was a mistake, thats why I deleted it and hoped you won't notice 😀

TheCrazyT · 2022-07-06T12:15:46Z

Ah now i see ... the detection if it is a relative url is actually already wrong.

(rewrite_relative_link should never be used in those cases)

URL's with "//" are supposed to be urls with the same protocol as the original site. they are no relative urls. openzim#265

kelson42 · 2022-07-06T15:43:10Z

@rgaudin This kind of primitive should be in scraperlib and tested... if not already available in a stable lib.

rgaudin · 2022-07-06T17:03:30Z

@TheCrazyT your commit looks good; can you submit a PR ?

rgaudin added the bug label Jul 1, 2022

rgaudin added this to the 2.1.0 milestone Jul 1, 2022

TheCrazyT added a commit to TheCrazyT/sotoki that referenced this issue Jul 6, 2022

fixing incorrect relative url detection

d9400e1

URL's with "//" are supposed to be urls with the same protocol as the original site. they are no relative urls. openzim#265

TheCrazyT mentioned this issue Jul 6, 2022

fixing incorrect relative url detection #266

Merged

rgaudin linked a pull request Jul 6, 2022 that will close this issue

fixing incorrect relative url detection #266

Merged

rgaudin closed this as completed in #266 Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect link rewriting #265

Incorrect link rewriting #265

rgaudin commented Jul 1, 2022

rgaudin commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022 •

edited

Loading

kelson42 commented Jul 6, 2022

rgaudin commented Jul 6, 2022

Incorrect link rewriting #265

Incorrect link rewriting #265

Comments

rgaudin commented Jul 1, 2022

rgaudin commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022 • edited Loading

kelson42 commented Jul 6, 2022

rgaudin commented Jul 6, 2022

TheCrazyT commented Jul 6, 2022 •

edited

Loading