Introduction

In this repository, I provided a full solution to the challenge provided by Visable during the working student hiring process.

The challenge is to train a model to classify German search queries from a provided dataset and deploy the final model using FastApi to make a REST API. In addition, the deployment process should be dockerized.

Model

In 'Visable_coding_challenge.py' there is a full documentation of the code and the models used. I will provide a full description of my process here.

Data importing, inspection, and preprocessing:

After importing the data, I begin to see how many entries and classes are in the data set, and how many Null values are in it.
The data has null values, non-German characters, and numbers in the text column.
I made sure using reg-ex that all the characters in the dataset are German characterless. Then I used the SpaCy German model to do lemmatization and remove punctuation and numbers.
Dropped all the null values from the dataset.
Now the data is ready for feature extraction

TF-IDF:

I used TF-IDF as a feature extractor. I used it as it's known for its good results in text classification applications. After this step, the data is ready to be split into train and test sets for test sets to be 20% of the whole data

Models:

First, I used the Multinominal Naive Bayes model. I always take the simplest way to do the tasks, and then complex as we move. Naive Bayes is known for its good results in classification tasks and provides good results. As seen in the photo below.

But, I wanted to increase the accuracy of the model, so I moved to the Random Forest classifier, it gives higher results than Naive Bayes, as seen from the photo below.

But then I got the idea of Fine-tuning the BERT German Base model on the dataset and then using its embeddings on the Random Forest classifier. That's what I have done and it gets higher results as expected from the photo below.

Setup

For rerunning the models you should make sure that your environment has python==3.10.2 then run these commands:

pip3 install -r requirements.txt
python -m spacy download de_core_news_sm
pip3 install transformers[torch]
python3 Visable_coding_challenge.py

Also, there is a docker image for testing the API, you can run it using this command:

docker run -d --name api -p 8000:8000 visable_german

This drive Link has essential models and checkpoints for the deployment to run as they are larger than the github limit, especially, 'fine_tuned_german_bert' folder:
- link>> https://drive.google.com/drive/folders/1sO0i3lIITBa1wlMQYhm-LRokjhm-R799?usp=sharing

Contact

Hisham Ali

Email: [email protected]

Mobile: +49 178 8953931

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
trained_models		trained_models
README.md		README.md
Visable_coding_challenge.py		Visable_coding_challenge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Model

Data importing, inspection, and preprocessing:

TF-IDF:

Models:

Setup

Contact

About

Releases

Packages

Languages

hishammadcor/HisAli753

Folders and files

Latest commit

History

Repository files navigation

Introduction

Model

Data importing, inspection, and preprocessing:

TF-IDF:

Models:

Setup

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages