Twitch Chat Scrape

The app was created using Python. Main goal was to improve knowledge about creating NLP machine learning models and data scraping.

General info

Data Scrape

As Twitch.tv platform doesn't provide intuitive API, the idea was to use selenium package in Python in order to scrape data from live chat.

The code allows to run the browser with chat in window or headless mode. This allows user to decide whether it is usefull/necessary to have insight into current state of the chat.

Dataset

To train the ML model, Sentiment140 dataset was chosen. Although old (2010), it is the biggest dataset of this type as it consists of 1.6 million tweets with Positive/Negative annotations. Unfortunately the dataset doesn't provide Neutral annotations, so the results can be not always valid or objective.

Setup

During development process, Python 3.11.4 version was used.

It is recommended to create new virtual environment

python -m venv .venv

and activate it:

. .venv\Scripts\activate

To run the project you need to install required packages, which are included in requirements.txt file

pip install -r requirements.txt

Launch

To launch the project you can run pipeline using:

sh scripts/run_pipeline.sh "my_data" "my_model" "xqc" 2 0 "my_texts" "my_results"

Arguments in order:

data_filename - name that will be given to downloaded dataset file - string
model_name - name that will be given to created ml model - string
stream_name - name of the stream, that will launch twitch.tv/stream_name chat - string
delay - delay in seconds, that pauses program to wait for page to load - float
headless - bool value determining headless/windowed browser mode - int (0/1)
texts_filename - name that will be given to scraped texts file - string
results_filename - name that will be given to results file - string

Each of the files can also be run independently, e.g.:

python src/data_ingestion.py --data_filename="my_data"

python src/model_training.py --data_filename="my_data" --model_name="my_model"

python src/scrape_text.py --stream_name="xqc" --delay=2 --headless=0 --texts_filename="my_texts"

python src/classify_text.py --model_name="my_model" --texts_filename="my_texts" --results_filename="my_results"

Screens

License

This project is licensed under the terms of the MIT license. You can check out the full license here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Twitch Chat Scrape

Table of contents

General info

Data Scrape

Dataset

Setup

Launch

Screens

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Twitch Chat Scrape

Table of contents

General info

Data Scrape

Dataset

Setup

Launch

Screens

License