VIDEO: https://loom.com/share/f2733285d0fe49ecad06a923072a30ef
The project focus on scrapping data from any website using a cluster of headful or headless browsers by using Docker containers running Celery, Redis, Selenium Grid, PostgreSQL and Django Rest Framework.
As a starting point, this project focused on handling web scrapping on https://www.ingresso.com/ for getting movie theatres (DONE) and movie sessions (not reached yet).
After that initial test and implementation phase, I'm expanding the natural modularity of the project to scrape Twitter's data as https://twitter.com/search?q=.eth&src=typed_query&f=user, focusing on extracting accounts with names ending with .eth, and for each Twitter account, recording their full name, username, their bio, followers count and following count.
Easy to adapt on docker-compose file to handle more sessions and browser nodes using Chrome or other available selenium grid browsers. Also, this project can act as a framework for testing websites when configuring the right tasks and database models. As is, I'm also changing the project name and set up to become more generic as a web-scrapping tool.
- Docker
- Nginx (only prod)
- Gunicorn (only prod)
- Celery - worker and flower dashboard
- Redis as a broker and result backend for Celery and cache for Django
- Selenium Grid - Hub and Chrome Node
- PostgreSQL or possibly others SQL Databases
- Python
- Packages:
- Django
- Django Rest Framework
- More Details on requirements.txt
Do you have Docker Installed?
For every command you can check the makefile or follow the instructions below:
-
docker-compose build --no-cache
-
docker-compose up -d
-
To trigger a web scrapping task OPEN http://localhost:8000/api/v1/webscraper/1/
-
(Optional) To see what is happening inside the container, head to http://localhost:4444/. The default password for the VNC is "secret".
-
http://localhost:5555/
Tasks Dashboard: http://localhost:5555/tasks -
docker-compose down -v
-
username: superuser password: password admin url: <http://localhost:8000/admin>
Thank you for your time, and I hope we can talk in near future.
https://stackoverflow.com/questions/51508150/standard-init-linux-go190-exec-user-process-caused-no-such-file-or-directory
Use notepad++, go to edit -> EOL conversion -> change from CRLF to LF.
update: For VScode users: you can change CRLF to LF by clicking on CRLF present on lower right side in the status bar