This project was constructed in order to get products informations like price and product description from amazon website using scraping. There is a class ScraperAmazon
that gathers functions that will get the data we want and transform it in a pandas DataFrame.
If you'd like to run this scraping everyday so you could, for example, compare prices based on a period, you could make use of a DAG indicating a schedule for it to run. There is also an example of a DAG in this project for it.
Amazon url structure to each product is like the URL below, so what you need to change in BOOKS
is the product:
https://www.amazon.com.br/dp/<product>/
You can get the product as shown below:
You can clone this repository using the code below:
git clone https://github.com/camila-marquess/webscraping.git
You need to install some libraries in order to use this code:
pip install beautifulsoup4
pip install pandas
pip install requests
Before running Airflow, make sure you have installed docker in your OS. If you do not, follow this steps based on your OS: Installing Docker Compose.
In order to start Airflow you have to run:
docker-compose up -d
Then you can visualize the Airflow UI by accessing localhost:8080
on your browser. The default login and password are: airflow
.
In order to stop the containers, you can run:
docker-compose down