The app was created using Python. Main goal was to improve knowledge about creating NLP machine learning models and data scraping.
As Twitch.tv platform doesn't provide intuitive API, the idea was to use selenium package in Python in order to scrape data from live chat.
The code allows to run the browser with chat in window or headless mode. This allows user to decide whether it is usefull/necessary to have insight into current state of the chat.
To train the ML model, Sentiment140 dataset was chosen. Although old (2010), it is the biggest dataset of this type as it consists of 1.6 million tweets with Positive/Negative annotations. Unfortunately the dataset doesn't provide Neutral annotations, so the results can be not always valid or objective.
During development process, Python 3.11.4 version was used.
It is recommended to create new virtual environment
python -m venv .venv
and activate it:
. .venv\Scripts\activate
To run the project you need to install required packages, which are included in requirements.txt
file
pip install -r requirements.txt
To launch the project you can run pipeline using:
sh scripts/run_pipeline.sh "my_data" "my_model" "xqc" 2 0 "my_texts" "my_results"
Arguments in order:
- data_filename - name that will be given to downloaded dataset file - string
- model_name - name that will be given to created ml model - string
- stream_name - name of the stream, that will launch twitch.tv/stream_name chat - string
- delay - delay in seconds, that pauses program to wait for page to load - float
- headless - bool value determining headless/windowed browser mode - int (0/1)
- texts_filename - name that will be given to scraped texts file - string
- results_filename - name that will be given to results file - string
Each of the files can also be run independently, e.g.:
python src/data_ingestion.py --data_filename="my_data"
python src/model_training.py --data_filename="my_data" --model_name="my_model"
python src/scrape_text.py --stream_name="xqc" --delay=2 --headless=0 --texts_filename="my_texts"
python src/classify_text.py --model_name="my_model" --texts_filename="my_texts" --results_filename="my_results"
This project is licensed under the terms of the MIT license. You can check out the full license here