This is a sina weibo spider built by nghuyong largely tailored to run on WWBP's servers by Mingyang Li.
A detailed explanation, written by nghuyong, can be found at 微博爬虫总结:构建单机千万级别的微博爬虫系统.
Description of data structure can be found at 数据字段说明与示例.
The original repo by nghuyong has 3 branches:
Branch | Structure | Posts per Day |
---|---|---|
simple | single account | 100,000 |
master | account pool | 1,000,000 |
senior | distributed pool | 10,000,000 |
- Clone thre repo. Install dependencies.
git clone [email protected]:nghuyong/WeiboSpider.git cd WeiboSpider pip install -r requirements.txt
- Install
phantomjs
,mongodb
, andredis
. Start the latter two. - Write down the usernames and passwords of some Sina Weibo accounts in
sina/account_build/account.txt
. Follow the format indicated inaccount_sample.txt
. - Populate the account pool by running
python sina/account_build/login.py
. - Populate URLs to start scraping with by issuing
python sina/redis_init.py
. - Run scraper by running
scrapy crawl weibo_spider
.
Posts, user profiles, and user relationships (and comments optionally) are stored in the MongoDB.
With the default setting, 16GB memory, 8-core CPU, Ubuntu, and 36 processes, we are hitting an average of 2,000 posts per second.