Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect link rewriting #265

Closed
rgaudin opened this issue Jul 1, 2022 · 6 comments · Fixed by #266
Closed

Incorrect link rewriting #265

rgaudin opened this issue Jul 1, 2022 · 6 comments · Fixed by #266
Labels
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Jul 1, 2022

There is a bad link in the article users/241039/oleh-prypin (entry idx=58950210) in the ZIM file:

<a href="//meta.stackoverflow.com/../../users/17034/hans-passant?tab=answers&amp;sort=votes#user-tab-answers">

The corresponding link in the current version of https://stackoverflow.com/users/241039/oleh-prypin is

<a href="//meta.stackoverflow.com/users/17034/hans-passant?tab=answers&amp;sort=votes#user-tab-answers">

The scraper added unjustified "../.." in an absolute URL belonging to a different domain.

Of course zimcheck shouldn't crash because of it and I will fix that. But now you have early indication to the bug in the scraper.

Originally posted by @veloman-yunkan in openzim/zim-tools#305 (comment)

@rgaudin rgaudin added the bug label Jul 1, 2022
@rgaudin rgaudin added this to the 2.1.0 milestone Jul 1, 2022
@rgaudin
Copy link
Member Author

rgaudin commented Jul 6, 2022

@TheCrazyT how is this relevant to this ticket? Did you answer here by mistake?

Regarding BS not being threadsafe it's probably not the issue as BS is instantiated each time in the rewriter which is called by the rewrote() Jinja filter in templates. We may well be leaking around this but BS not being threadsafe is probably not at play

Regarding this ticket now, I appears that above link is actually correct, although it should probably not have been tampered with to begin with as it is an external link. Problem is probably the same-scheme // that was not taken care of.

@TheCrazyT
Copy link
Contributor

Well guess the problem is that it just substitudes the current url, but ignores to_root,that already is ../.. .

uri_path = re.sub(r"^(\.\.?/)+", "", uri.path)

to_root="../../",

@TheCrazyT
Copy link
Contributor

@TheCrazyT how is this relevant to this ticket? Did you answer here by mistake?

Regarding BS not being threadsafe it's probably not the issue as BS is instantiated each time in the rewriter which is called by the rewrote() Jinja filter in templates. We may well be leaking around this but BS not being threadsafe is probably not at play

Regarding this ticket now, I appears that above link is actually correct, although it should probably not have been tampered with to begin with as it is an external link. Problem is probably the same-scheme // that was not taken care of.

Sorry the comment about beautifulsoup was a mistake, thats why I deleted it and hoped you won't notice 😀

@TheCrazyT
Copy link
Contributor

TheCrazyT commented Jul 6, 2022

Ah now i see ... the detection if it is a relative url is actually already wrong.

(rewrite_relative_link should never be used in those cases)

TheCrazyT added a commit to TheCrazyT/sotoki that referenced this issue Jul 6, 2022
URL's with "//" are supposed to be urls with the same protocol as the original site.
they are no relative urls.
openzim#265
@kelson42
Copy link
Contributor

kelson42 commented Jul 6, 2022

@rgaudin This kind of primitive should be in scraperlib and tested... if not already available in a stable lib.

@rgaudin
Copy link
Member Author

rgaudin commented Jul 6, 2022

@TheCrazyT your commit looks good; can you submit a PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants