-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect link rewriting #265
Comments
@TheCrazyT how is this relevant to this ticket? Did you answer here by mistake? Regarding BS not being threadsafe it's probably not the issue as BS is instantiated each time in the rewriter which is called by the Regarding this ticket now, I appears that above link is actually correct, although it should probably not have been tampered with to begin with as it is an external link. Problem is probably the same-scheme |
Well guess the problem is that it just substitudes the current url, but ignores to_root,that already is ../.. . sotoki/src/sotoki/utils/html.py Line 257 in 841a968
Line 203 in 841a968
|
Sorry the comment about beautifulsoup was a mistake, thats why I deleted it and hoped you won't notice 😀 |
Ah now i see ... the detection if it is a relative url is actually already wrong. (rewrite_relative_link should never be used in those cases) |
URL's with "//" are supposed to be urls with the same protocol as the original site. they are no relative urls. openzim#265
@rgaudin This kind of primitive should be in scraperlib and tested... if not already available in a stable lib. |
@TheCrazyT your commit looks good; can you submit a PR ? |
There is a bad link in the article
users/241039/oleh-prypin
(entry idx=58950210) in the ZIM file:The corresponding link in the current version of https://stackoverflow.com/users/241039/oleh-prypin is
The scraper added unjustified "../.." in an absolute URL belonging to a different domain.
Of course zimcheck shouldn't crash because of it and I will fix that. But now you have early indication to the bug in the scraper.
Originally posted by @veloman-yunkan in openzim/zim-tools#305 (comment)
The text was updated successfully, but these errors were encountered: