Detecting Fake Jobposting 😄

Group members

Name	Student number
魏麟懿 Linyi Wei	1901212647
庞博 Bo Pang	1901212498
赵舒婷 Shuting Zhao	1901212679

PART 1 Introduction

1.1 Motivation

Recently, there's a news that a finance major pretended to be a interviewer of CICC or CITIC to get answesrs of written examination from those interviewees. With such scams keeping emerging, abilities to identfy fake jobposting becomes more important. In the past years, people distinguish fake jobposting by intuition. For example, abnormal high wage may suggest fake jobposting. While nowadays big data techique enable us to process these jobposting data and identify fake ones more reliable using model.
Our goal of this project is to train classifiers to recognize fake or real jobposting using features like salary_range, benefits, required_experience, required_education and so on.

1.2 Data Sources

Dataset of real and fake job postings created by Shivam Bansal:
https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
Our data preprocessing also referred to this notebook:
https://www.kaggle.com/nikitaalbert/is-this-job-for-real

1.3 Data Description

There are 18 variables initially contains 17 features and one target label.

Variables	Description
job_id	Unique Job ID
title	The title of the job ad entry.Most likely unique for each entry.
location	Geographical location of the job ad：Country, State, City.
department	Corporate department (e.g. sales).Most likely unique for each posting.
salary_range	Indicative salary range (e.g. 50,000− 60,000).From an initial glance of the head, we see its blank; However, in subsequent analysis, we see that it is in format MIN-MAX.
company_profile	A brief company description.
description	The details description of the job ad.
requirements	Enlisted requirements for the job opening.
benefits	Enlisted offered benefits by the employer.
telecommuting	True for telecommuting positions.
has_company_logo	True if company logo is present.
has_questions	True if screening questions are present.
employment_type	Full-type, Part-time, Contract, etc.
required_experience	Executive, Entry level, Intern, etc.
required_education	Doctorate, Master’s Degree, Bachelor, etc.
industry	Automotive, IT, Health care, Real estate, etc.
function	Consulting, Engineering, Research, Sales etc.
fraudulent	target - Classification attribute.

1.4 Framework

Part 2 Exploratory Data Analysis

2.1 The distribution of category labels

Let's take a look at how many counts of real and fake posts there are, in relation to the top unique values of a feature. From the graph, we can see that:

Fraudulent posts are mostly not posted as telecommuting ones, like real posts.
Fraudulent posts mostly do not contain a company logo, unlike real posts.
Fraudulent posts have an equal mix of either having a questionnaire or not, like real posts.
Fraudulent posts are mostly full-time, like real posts.
Fraudulent posts, also do not specify the required experience and education necessary, like real posts.

2.2 The distribution of text features' length

We want to see whether the length of 'company_profile', 'description', 'requirements', 'benefits' can be used to detect fake jobposting. Fake jobposting posts similar-length description, requirements and benefit to make them more reliable. While there's differences in company profile. Fake jobpostings are not tend to post short & long company profile.

Part 3 Data Preprocessing

Our data preprocessing also referred to this notebook (mainly graphs):
https://www.kaggle.com/nikitaalbert/is-this-job-for-real

3.1 Feature trimming

a. Delete 6 features for reasons as follows:

'job_id' is unique for each sample and useless for detecting fake job.
'title' is nearly unique and difficult to deal with.
'location' is also difficult to handle and we want to find the generality of fake jobposting wherever it is.
'department' is just like job title with many different values.
'industry' and 'function' is also difficult to deal with and uselless for finding generality.

b. Replace text features with length

To avoid text analysis and simplify the model, use the length of 'company_profile'，'description'，'requirements'，'benefits' to replace the text.

c. Divide 'salary_range' into two features: 'salary_range_min' and 'salary_range_max'

d. Map ordinal feature: 'required_education'，'required_experience'

Map required education level including 'Bachelor's Degree', 'High School or equivalent'.etc with integer from 0 to 10. Map required experience including 'Mid-Senior level','Associate'.etc with integer from 0 to 5.

e. Use one-hot encoding to encode nominal features

Employment type is encoded to 4 columns to represent 5 different types.

3.2 Feature Selection

After feature trimming, there are 17 features. In the meantime, one-hot excoding may cause multicollinearity. So we need select important features.
We choose KNN to do sequential background selection.

We finally choose 6 features ('required_education', 'salary_range_min', 'employment_type_Other', 'employment_type_Part-time', 'employment_type_Temporary', 'employment_type_Unknown') which perform a relative high accuracy over 0.975.
After feature selection, training accuracy is 0.979 and test accuracy is 0.977 with only no more than 0.5 percent differences to that before. Six features is reasonable to be selected.

3.3 Feature correlation

After feature selection based on KNN, we also test the correlation of these features. And the results are shown below.

Part 4 Model training, evaluation, and hyperparameter tuning in cross validation

For the y variable and the six X variables obtained in 3.2, we used LR / SVM / Tree three classification methods to fit the model, and called the GridSearchCV package to carry out Cross-Validation.
As mentioned in our motivation section, many fake JD senders will often ask the delivery person to solve many additional problems, and these answers are often used for profit. Our research goal is to avoid the loss of time, energy and personal information exposure caused by delivering information to fake JD. In our sample, we use y = 0 (true JD) as the positive label parameter in the code, so we use PRE as the model cross-validation screening and evaluation standard. The larger the PRE value, the smaller the FP/P ratio, and the easier it is for job seekers to avoid false JD.
Listed below are the optimal parameters and corresponding confusion_matrix we obtained for different models.

4.1 Logistic Regression

C = 0.001:

Precision: 0.951
Recall: 1.000
F1: 0.975

4.2 SVM

(We only run linear kernel due to the CPU limitation.)

C = 0.001:

Precision: 0.951
Recall: 1.000
F1: 0.975

4.3 Decision Tree

max_depth = 14:

Precision: 0.957
Recall: 0.997
F1: 0.977

Among these 3 methods, Decision Tree provides the best results with PRE= 95.7%. Notice that both LR and SVM give the same PRE as randomly trusting every JD, so we guess there may be some problems in KNN feature selection. This is why we try the PCA method in next part.

Part 5 Redo the process in PCA dimensionality-reduced data

In this part, we use PCA method to replace the KNN method in 3.2. We set components n=2 for better program velocity. After that, we conduct LR/SVM/Tree algorithms again. Notice in code we use pipeline method to package them together. The following table illustrates the final results for our models.

Model Type	PRE*	REC	F1-score
PCA_LR	0.884	0.936	0.909
PCA_SVM(rbf)	0.972	0.919	0.945
PCA_Tree(max_depth=14)	0.974*	0.875	0.922

We can see that for Logistic Regression the PCA method is worse than the KNN, but SVM and Decision Tree show the different, which means PCA covers some additional info than KNN. (PS: We only have two X features now, so we use better Grid -Search hyper parameters. Hence the increase in SVM may due to this cross-validation change.)
Notice that PCA-Tree, with 0.974 PRE, is the best method among them. We draw an ROC curve to deeply analyze this model.

Part 6 Apply more advanced model: Random forest, Bagging and Adaboost

In this part, we use both PCA dimensionality-reduced data and KNN feature selection data. These data are both put into training Random forest, Bagging and Adaboost these three model.
In order to show our results more succinctly, the following table is used to show the results.

Model Type	PRE*	REC	F1-score
KNN_RF	0.991	0.988	0.989
KNN_Bagging	0.993	0.987	0.990
KNN_Adaboost	0.994	0.976	0.985
PCA_RF	0.943	0.941	0.942
PCA_Bagging	0.946	0.943	0.945
PCA_Adaboost	0.848	0.960	0.901

We can see the models based on KNN data shows a higher Precision, which are better models. The reason behind this is that we just use 2 pca components. But part3.3 show that the feature correlation is very weak, which means 2 pca components are not enough to represent the whole feature and explain the results.
As we mentioned above, we care more about the precision index. In the models based on KNN data, the adaboost shows the best results.

Part7 Generally Comparison

Generally speaking, the following table shows all the traing model results based both PCA and KNN data.

Model Type	PRE*	REC	F1-score
KNN_LR	0.951	1.000	0.975
KNN_SVM(line)	0.951	1.000	0.975
KNN_Tree(max_depth=14)	0.957	0.997	0.977
PCA_LR	0.884	0.936	0.909
PCA_SVM(rbf)	0.972	0.919	0.945
PCA_Tree(max_depth=14)	0.974*	0.875	0.922
KNN_RF	0.991	0.988	0.989
KNN_Bagging	0.993	0.987	0.990
KNN_Adaboost	0.994	0.976	0.985
PCA_RF	0.943	0.941	0.942
PCA_Bagging	0.946	0.943	0.945
PCA_Adaboost	0.848	0.960	0.901

Based on the table, Adaboost based on KNN data shows the best results.

Part8 Upsampling

Because in our raw dara, there are many true job description, which means you selec the job casually and without any consideration, the possibility that you choose the fake job will be very low. So our raw data is very imbalanced. We do the upsampling to see if our model works.
Firstly, do the upsampling;
Secondly, do the KNN feature selection. Based on the picture below, we choose the 6 features.

Finally, do the model training above and show the results.

Model Type	PRE*	REC	F1-score
KNN_LR	0.627	0.347	0.447
KNN_SVM(line)	0.631	0.355	0.455
KNN_Tree(max_depth=15)	0.892	0.860	0.876
PCA_LR	0.558	0.534	0.546
PCA_SVM(rbf)	0.553	0.547	0.550
PCA_Tree(max_depth=15)	0.914	0.827	0.868
KNN_RF	0.991	0.988	0.989
KNN_Bagging	0.993	0.987	0.990
KNN_Adaboost	0.994	0.976	0.985
PCA_RF	1.000	0.927	0.962
PCA_Bagging	1.000	0.927	0.962
PCA_Adaboost	0.677	0.614	0.644

Random Forest and Adaboost based on upsampling pca data show the best results. After upsampling, we can see the precision index of all the training models all tend to decline. It's reasonable. Because we enlarge the possibility of choosing the fake job. The models above verifies the efficiency of our prediction.

Part9 Explain the behind logistics of our prediction

Based on the picture of the tree classifier, we can see the salary range and required education play important roles in prediction. That's because for the fake job, they usually don't care your education(which shows your working ability) and salary range(which is about the company's labor cost). This two items is the most two important factors that the company cares about. This is why these two factor play important roles in prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
1.data cleaning		1.data cleaning
2.Exploratory Data Analysis		2.Exploratory Data Analysis
3.Data Preprocessing		3.Data Preprocessing
4.Model training & Upsampling		4.Model training & Upsampling
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
The total code.ipynb		The total code.ipynb
fake_job_postings.csv		fake_job_postings.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Fake Jobposting 😄

Group members

PART 1 Introduction

1.1 Motivation

1.2 Data Sources

1.3 Data Description

1.4 Framework

Part 2 Exploratory Data Analysis

2.1 The distribution of category labels

2.2 The distribution of text features' length

Part 3 Data Preprocessing

3.1 Feature trimming

a. Delete 6 features for reasons as follows:

b. Replace text features with length

c. Divide 'salary_range' into two features: 'salary_range_min' and 'salary_range_max'

d. Map ordinal feature: 'required_education'，'required_experience'

e. Use one-hot encoding to encode nominal features

3.2 Feature Selection

3.3 Feature correlation

Part 4 Model training, evaluation, and hyperparameter tuning in cross validation

4.1 Logistic Regression

C = 0.001:

4.2 SVM

C = 0.001:

4.3 Decision Tree

max_depth = 14:

Part 5 Redo the process in PCA dimensionality-reduced data

Part 6 Apply more advanced model: Random forest, Bagging and Adaboost

Part7 Generally Comparison

Part8 Upsampling

Part9 Explain the behind logistics of our prediction

About

Releases

Packages

Contributors 3

Languages

Linyi-Wei/2020MLF-PROJECT

Folders and files

Latest commit

History

Repository files navigation

Detecting Fake Jobposting 😄

Group members

PART 1 Introduction

1.1 Motivation

1.2 Data Sources

1.3 Data Description

1.4 Framework

Part 2 Exploratory Data Analysis

2.1 The distribution of category labels

2.2 The distribution of text features' length

Part 3 Data Preprocessing

3.1 Feature trimming

a. Delete 6 features for reasons as follows:

b. Replace text features with length

c. Divide 'salary_range' into two features: 'salary_range_min' and 'salary_range_max'

d. Map ordinal feature: 'required_education'，'required_experience'

e. Use one-hot encoding to encode nominal features

3.2 Feature Selection

3.3 Feature correlation

Part 4 Model training, evaluation, and hyperparameter tuning in cross validation

4.1 Logistic Regression

C = 0.001:

4.2 SVM

C = 0.001:

4.3 Decision Tree

max_depth = 14:

Part 5 Redo the process in PCA dimensionality-reduced data

Part 6 Apply more advanced model: Random forest, Bagging and Adaboost

Part7 Generally Comparison

Part8 Upsampling

Part9 Explain the behind logistics of our prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages