Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put urls in lowercase and as a result, the urls are no longer valid #40

Open
Abdoulkadir-ali opened this issue Jun 2, 2023 · 3 comments
Milestone

Comments

@Abdoulkadir-ali
Copy link

Hello,
So here's a little issue.
Basically USP put all URL in lowercases, and as a result if the urls has some uppercase caracter, it no longer finds it.

Here's an example 👍
`from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.distriartisan.fr')`

The sitemaps urls are like this :
"https://www.distriartisan.fr/media/sitemap/sitemapProduitsAll_1.xml"

However in the logs it's written like this :
2023-06-02 12:42:12,823 INFO usp.fetch_parse [7776/MainThread]: Parsing sitemap from URL https://www.distriartisan.fr/media/sitemap/sitemapproduitsall_1.xml...
2023-06-02 12:42:12,826 ERROR usp.fetch_parse [7776/MainThread]: Parsing sitemap from URL https://www.distriartisan.fr/media/sitemap/sitemapproduitsall_1.xml failed: Unsupported root element 'html'.`

@Abdoulkadir-ali
Copy link
Author

You can close it, I saw your reply on another post.

For those who come after me, HTML is not Case senstivie while XML is.

@fvermaut
Copy link

@Abdoulkadir-ali what's the other post? I have the same issue and don't see how to solve it. It just seems to lowercase the urls that are declared in the robots.txt

@Abdoulkadir-ali
Copy link
Author

You should parse it yourself with beautifulsoup @fvermaut

@freddyheppell freddyheppell modified the milestones: v1.0, 0.6 Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants