WeiboSpider

This is a sina weibo spider built by nghuyong largely tailored to run on WWBP's servers by Mingyang Li.

Description of data structure can be found at 数据字段说明与示例.

Other Branches

The original repo by nghuyong has 3 branches:

Clone thre repo. Install dependencies.

git clone [email protected]:nghuyong/WeiboSpider.git
cd WeiboSpider
pip install -r requirements.txt

Install phantomjs, mongodb, and redis. Start the latter two.
Write down the usernames and passwords of some Sina Weibo accounts in sina/account_build/account.txt. Follow the format indicated in account_sample.txt.
Populate the account pool by running python sina/account_build/login.py.
Populate URLs to start scraping with by issuing python sina/redis_init.py.
Run scraper by running scrapy crawl weibo_spider.

Posts, user profiles, and user relationships (and comments optionally) are stored in the MongoDB.

With the default setting, 16GB memory, 8-core CPU, Ubuntu, and 36 processes, we are hitting an average of 2,000 posts per second.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
images		images
sina		sina
.gitignore		.gitignore
README.md		README.md
data_stracture.md		data_stracture.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg