Skip to content

Falamarcao/fancy-web-scrapping

Repository files navigation

Web-Scrapping Tool

VIDEO: https://loom.com/share/f2733285d0fe49ecad06a923072a30ef

The project focus on scrapping data from any website using a cluster of headful or headless browsers by using Docker containers running Celery, Redis, Selenium Grid, PostgreSQL and Django Rest Framework.

As a starting point, this project focused on handling web scrapping on https://www.ingresso.com/ for getting movie theatres (DONE) and movie sessions (not reached yet).

After that initial test and implementation phase, I'm expanding the natural modularity of the project to scrape Twitter's data as https://twitter.com/search?q=.eth&src=typed_query&f=user, focusing on extracting accounts with names ending with .eth, and for each Twitter account, recording their full name, username, their bio, followers count and following count.

Easy to adapt on docker-compose file to handle more sessions and browser nodes using Chrome or other available selenium grid browsers. Also, this project can act as a framework for testing websites when configuring the right tasks and database models. As is, I'm also changing the project name and set up to become more generic as a web-scrapping tool.

Getting Started

Dependencies

  • Docker
  • Nginx (only prod)
  • Gunicorn (only prod)
  • Celery - worker and flower dashboard
  • Redis as a broker and result backend for Celery and cache for Django
  • Selenium Grid - Hub and Chrome Node
  • PostgreSQL or possibly others SQL Databases
  • Python
  • Packages:

Installing

Do you have Docker Installed?

For every command you can check the makefile or follow the instructions below:

Building and Running Development environment (DEV)

Help: If you have any problem please e-mail me or contact me on LinkedIn

Author: Marco Maschio

Thank you for your time, and I hope we can talk in near future.

Common issue with Docker and Windows

https://stackoverflow.com/questions/51508150/standard-init-linux-go190-exec-user-process-caused-no-such-file-or-directory

Use notepad++, go to edit -> EOL conversion -> change from CRLF to LF.

update: For VScode users: you can change CRLF to LF by clicking on CRLF present on lower right side in the status bar