WebScrapper

A repository for our final project for Computer Architecture and Operating Systems class in Binus International.

What is a Web Scraper

A Web Scraper is an application used for harvesting data from sites through the use of HTTP, storing the data found on a local file or database.

Details

We have 2 main files, scrapper and scrapper_threading

scrapper_threading uses threadpool in order to create threads to make the process go faster than scrapper which doesn't use any threading. The number of threads used is determined by the number of cores the current CPU has, with a maximum of 32 cores being created at any given time.

For example:

If a CPU has a single core, then we will make 5 threads, if a CPU has more than or equal to 32 cores, we will only make 32 threads in order to avoid using too many resources on the higher core machines.

Team Members:

Fauzan
Stanlly

This project uses

Socket for making a http request
Threading for "parallelizing" the http request, tho it's not really a parallelism because we're only using Thread
http.client HTTPResponse for parsing the http response
BeautifulSoup4+lxml for parsing the http content

How to Run the Project

Install the dependencies. When requirements.txt make sure the CMD/Shell is in the folder that contains the file.

pip install -r requirements.txt
or
pip install beautifulsoup4 lxml

Run either scapper.py or scrapper_threading.py

python scrapper.py
or
python scrapper_threading.py

Comparing Non-Threaded vs Threaded

On Fauzan's 50Mbps connection and 6 Core Machine:

Non-Threaded:
Threaded:

On Stanlly's 60Mbps connection and 8 Core Machine:

Non-Threaded:
Threaded:

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
resources		resources
.gitignore		.gitignore
README.md		README.md
ThreadPool.py		ThreadPool.py
requirements.txt		requirements.txt
scrapper.py		scrapper.py
scrapper_threading.py		scrapper_threading.py
socket_utils.py		socket_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScrapper

What is a Web Scraper

Details

For example:

Team Members:

This project uses

How to Run the Project

Comparing Non-Threaded vs Threaded

About

Releases

Packages

Contributors 2

Languages

fauzanardh/WebScrapper

Folders and files

Latest commit

History

Repository files navigation

WebScrapper

What is a Web Scraper

Details

For example:

Team Members:

This project uses

How to Run the Project

Comparing Non-Threaded vs Threaded

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages