Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Http status code "303 See Other" can return a non-absolute url, which should be resolved against the current page that is crawled #93

Closed
brbog opened this issue Jul 9, 2022 · 0 comments · Fixed by #94
Assignees
Labels
Milestone

Comments

@brbog
Copy link
Collaborator

brbog commented Jul 9, 2022

While trying to crawl a site that requires form based authentication without passing any form of authentication information an error was thrown org.apache.hc.client5.http.ClientProtocolException: Target host is not specified.

The root cause for this was that the returned status code was 303 and the url in the header was /inloggen instead of an absolute url. The url normalizer's filter()-method in turn transforms this input to http://inloggen which is a problem for the http client (hence "target host is not specified").

Simply resolving the returned url against the current web url (which is a seed url and always absolute) avoids throwing the confusing exception. Later on the login form is crawled without throwing errors, but at least while debugging or when looking at the result it is immediately clear that some kind of authentication is needed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants