This is a a Scrapy middleware that uses autologin http-api to maintain a logged-in state for a scrapy spider.
Autologin middleware uses autologin to make all requests while being
logged in. It uses autologin to get cookies, detects logouts and tries
to avoid them in the future. A single authorization domain for the spider
is assumed. Autologin middleware also puts autologin_active
into
request.meta
, which is True
only if we are logged in
(and to False
if domain is skipped or login failed).
If requests are made via splash
(and SPLASH_URL
is set),
autologin middleware passes it to autologin,
and this splash instance is also used to obtain login cookies.
It works on python 2.7 and python 3, and requires at least scrapy 1.1. Install with pip:
pip install autologin-middleware
Include the autologin middleware into the project settings and specify autologin url:
AUTOLOGIN_URL = 'http://127.0.0.1:8089' AUTOLOGIN_ENABLED = True DOWNLOADER_MIDDLEWARES['autologin_middleware.AutologinMiddleware'] = 605
Cookie support is also required. There are currently several options:
scrapy cookie middleware (
COOKIES_ENABLED = True
), but autologin middleware requires access to cookies, so you need to enable a custom cookie middleware:DOWNLOADER_MIDDLEWARES = { 'autologin_middleware.AutologinMiddleware': 605, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None, 'autologin_middleware.ExposeCookiesMiddleware': 700, }
scrapy-splash cookie middleware (
scrapy_splash.SplashCookiesMiddleware
)any other middleware that gets cookies from
request.cookies
and setsresponse.cookiejar
like scrapy-splash middleware, or exposes them inresponse.flags
likeExposeCookiesMiddleware
.
Optional but highly recommended: avoid logouts - see optional settings and "Avoiding logouts" section below. The reason why it's important is that logout detection is less robust, on some sites the old cookies might still work so you'll be using multiple sessions, etc. If you don't ever logout, you avoid all this problems.
There are some optional settings:
AUTOLOGIN_COOKIES
: pass auth cookies after manual login (format isname=value; name2=value2
).AUTOLOGIN_LOGOUT_URL
: pass url substring to avoid.AUTOLOGIN_CHECK_LOGOUT
: set toFalse
in order to disable automatic logout detection: it remembers cookies obtained during login and checks them on each response to see if any disappeared. This can be problematic for sites that set a lot of cookies on login, so this is an option to disable it. If you disable it, you must rely on avoiding logout links withlink_looks_like_logout
(see below), or setting a customAUTOLOGIN_LOGOUT_URL
.AUTOLOGIN_USERNAME
,AUTOLOGIN_PASSWORD
,AUTOLOGIN_LOGIN_URL
,AUTOLOGIN_EXTRA_JS
are passed to autologin and override values from stored credentials.AUTOLOGIN_LOGIN_URL
is a relative url, and can be omitted if it is the same as the start url.AUTOLOGIN_EXTRA_JS
is required only if you want to use theextra_js
feature of the autologin.
It is also possible to override some settings per-request via corresponding
lower-case keys in request.meta
: autologin_username
,
autologin_password
, autologin_extra_js
, autologin_login_url
and
autologin_logout_url
.
Autologin middleware passes the following settings to the autologin:
SPLASH_URL
, USER_AGENT
, HTTP_PROXY
, HTTPS_PROXY
, so they
are used for autologin requests.
There is also an utility autologin_middleware.link_looks_like_logout
for checking if a links looks like a logout link: you can use it in the
spider to avoid logout links. Logouts are handled
by the autologin middleware by default
(unless AUTOLOGIN_CHECK_LOGOUT
is False
),
but avoiding logout links can be beneficial for two reasons:
- no time is waster retrying requests that were logged out
- in some cases, logout urls can be unique, and the spider will be logging
out continuously (for example,
/logout?sid=UNIQUE_ID
).
There is no special API: autologin middleware just ensures that all requests are
made while being logged in. As mentioned in the "Configuration" section above,
you can override some settings on the per-request basis in reqeuest.meta
.
Autologin response is available in response.meta['autologin_response']
,
if we made requests to autologin while processing this request.
You might want to use the "status"
field of the autologin response
to do some bookkeeping.
Middleware also always puts "autologin_active"
into response.meta
,
which is True
only if we are logged in (and False
if domain is skipped
or login failed).
Additionally, you set skip_autologin
key in request.meta
to True
in order to completely disable middleware for this request.
Autologin middleware supports splash via scrapy-splash, but correctly settings everything up can be tricky.
First, you need to specify the following settings (check scrapy-splash docs for more details):
SPLASH_URL = 'http://127.0.0.1:8050' SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' DOWNLOADER_MIDDLEWARES = { 'autologin_middleware.AutologinMiddleware': 605, 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
Second, you need to make requests to splash and pass cookies with
splash:init_cookies(splash.args.cookies)
, and return them in the
cookies
field using splash:get_cookies()
. If you are already using
a splash script (execute
endpoint), modify your script accordingly.
But if you just want to crawl using splash, you can use
autologin_middleware.splash.splash_request
instead of scrapy.Request
.
It has a minimal lua script that passes cookies and returns html, so you won't
need to change anything else in you spider.
You need to start autologin-http-api
(from
autologin),
and splash (the easiest option is to run
docker run -p 8050:8050 --net=host scrapinghub/splash
).
Run tests with tox:
tox
When using Docker to run Splash on OS X and Windows, it will start on a non-default address, so you need to specify it when running tests, for example:
SPLASH_URL=http://192.168.99.100:8050 tox
License is MIT.