Name | Student number |
---|---|
魏麟懿 Linyi Wei | 1901212647 |
庞博 Bo Pang | 1901212498 |
赵舒婷 Shuting Zhao | 1901212679 |
Recently, there's a news that a finance major pretended to be a interviewer of CICC or CITIC to get answesrs of written examination from those interviewees. With such scams keeping emerging, abilities to identfy fake jobposting becomes more important. In the past years, people distinguish fake jobposting by intuition. For example, abnormal high wage may suggest fake jobposting. While nowadays big data techique enable us to process these jobposting data and identify fake ones more reliable using model.
Our goal of this project is to train classifiers to recognize fake or real jobposting
using features like salary_range, benefits, required_experience, required_education and so on.
Dataset of real and fake job postings created by Shivam Bansal:
https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
Our data preprocessing also referred to this notebook:
https://www.kaggle.com/nikitaalbert/is-this-job-for-real
There are 18 variables initially contains 17 features and one target label.
Variables | Description |
---|---|
job_id | Unique Job ID |
title | The title of the job ad entry.Most likely unique for each entry. |
location | Geographical location of the job ad:Country, State, City. |
department | Corporate department (e.g. sales).Most likely unique for each posting. |
salary_range | Indicative salary range (e.g. 50,000− 60,000).From an initial glance of the head, we see its blank; However, in subsequent analysis, we see that it is in format MIN-MAX. |
company_profile | A brief company description. |
description | The details description of the job ad. |
requirements | Enlisted requirements for the job opening. |
benefits | Enlisted offered benefits by the employer. |
telecommuting | True for telecommuting positions. |
has_company_logo | True if company logo is present. |
has_questions | True if screening questions are present. |
employment_type | Full-type, Part-time, Contract, etc. |
required_experience | Executive, Entry level, Intern, etc. |
required_education | Doctorate, Master’s Degree, Bachelor, etc. |
industry | Automotive, IT, Health care, Real estate, etc. |
function | Consulting, Engineering, Research, Sales etc. |
fraudulent | target - Classification attribute. |
Let's take a look at how many counts of real and fake posts there are, in relation to the top unique values of a feature. From the graph, we can see that:
- Fraudulent posts are mostly not posted as telecommuting ones, like real posts.
- Fraudulent posts mostly do not contain a company logo, unlike real posts.
- Fraudulent posts have an equal mix of either having a questionnaire or not, like real posts.
- Fraudulent posts are mostly full-time, like real posts.
- Fraudulent posts, also do not specify the required experience and education necessary, like real posts.
We want to see whether the length of 'company_profile', 'description', 'requirements', 'benefits' can be used to detect fake jobposting. Fake jobposting posts similar-length description, requirements and benefit to make them more reliable. While there's differences in company profile. Fake jobpostings are not tend to post short & long company profile.
Our data preprocessing also referred to this notebook (mainly graphs):
https://www.kaggle.com/nikitaalbert/is-this-job-for-real
- 'job_id' is unique for each sample and useless for detecting fake job.
- 'title' is nearly unique and difficult to deal with.
- 'location' is also difficult to handle and we want to find the generality of fake jobposting wherever it is.
- 'department' is just like job title with many different values.
- 'industry' and 'function' is also difficult to deal with and uselless for finding generality.
- To avoid text analysis and simplify the model, use the length of 'company_profile','description','requirements','benefits' to replace the text.
- Map required education level including 'Bachelor's Degree', 'High School or equivalent'.etc with integer from 0 to 10. Map required experience including 'Mid-Senior level','Associate'.etc with integer from 0 to 5.
- Employment type is encoded to 4 columns to represent 5 different types.
After feature trimming, there are 17 features. In the meantime, one-hot excoding may cause multicollinearity. So we need select important features.
We choose KNN to do sequential background selection.
We finally choose 6 features ('required_education', 'salary_range_min', 'employment_type_Other', 'employment_type_Part-time', 'employment_type_Temporary', 'employment_type_Unknown') which perform a relative high accuracy over 0.975.
After feature selection, training accuracy is 0.979 and test accuracy is 0.977 with only no more than 0.5 percent differences to that before. Six features is reasonable to be selected.
After feature selection based on KNN, we also test the correlation of these features. And the results are shown below.
For the y variable and the six X variables obtained in 3.2, we used LR / SVM / Tree three classification methods to fit the model, and called the GridSearchCV package to carry out Cross-Validation.
As mentioned in our motivation section, many fake JD senders will often ask the delivery person to solve many additional problems, and these answers are often used for profit. Our research goal is to avoid the loss of time, energy and personal information exposure caused by delivering information to fake JD. In our sample, we use y = 0 (true JD) as the positive label parameter in the code, so we use PRE as the model cross-validation screening and evaluation standard. The larger the PRE value, the smaller the FP/P ratio, and the easier it is for job seekers to avoid false JD.
Listed below are the optimal parameters and corresponding confusion_matrix we obtained for different models.
Precision: 0.951
Recall: 1.000
F1: 0.975
(We only run linear kernel due to the CPU limitation.)
Precision: 0.951
Recall: 1.000
F1: 0.975
Precision: 0.957
Recall: 0.997
F1: 0.977
Among these 3 methods, Decision Tree provides the best results with PRE= 95.7%. Notice that both LR and SVM give the same PRE as randomly trusting every JD, so we guess there may be some problems in KNN feature selection. This is why we try the PCA method in next part.
In this part, we use PCA method to replace the KNN method in 3.2. We set components n=2 for better program velocity. After that, we conduct LR/SVM/Tree algorithms again. Notice in code we use pipeline method to package them together. The following table illustrates the final results for our models.
Model Type | PRE* | REC | F1-score |
---|---|---|---|
PCA_LR | 0.884 | 0.936 | 0.909 |
PCA_SVM(rbf) | 0.972 | 0.919 | 0.945 |
PCA_Tree(max_depth=14) | 0.974* | 0.875 | 0.922 |
We can see that for Logistic Regression the PCA method is worse than the KNN, but SVM and Decision Tree show the different, which means PCA covers some additional info than KNN. (PS: We only have two X features now, so we use better Grid -Search hyper parameters. Hence the increase in SVM may due to this cross-validation change.)
Notice that PCA-Tree, with 0.974 PRE, is the best method among them. We draw an ROC curve to deeply analyze this model.
In this part, we use both PCA dimensionality-reduced data and KNN feature selection data. These data are both put into training Random forest, Bagging and Adaboost these three model.
In order to show our results more succinctly, the following table is used to show the results.
Model Type | PRE* | REC | F1-score |
---|---|---|---|
KNN_RF | 0.991 | 0.988 | 0.989 |
KNN_Bagging | 0.993 | 0.987 | 0.990 |
KNN_Adaboost | 0.994 | 0.976 | 0.985 |
PCA_RF | 0.943 | 0.941 | 0.942 |
PCA_Bagging | 0.946 | 0.943 | 0.945 |
PCA_Adaboost | 0.848 | 0.960 | 0.901 |
We can see the models based on KNN data shows a higher Precision, which are better models. The reason behind this is that we just use 2 pca components. But part3.3 show that the feature correlation is very weak, which means 2 pca components are not enough to represent the whole feature and explain the results.
As we mentioned above, we care more about the precision index. In the models based on KNN data, the adaboost shows the best results.
Generally speaking, the following table shows all the traing model results based both PCA and KNN data.
Model Type | PRE* | REC | F1-score |
---|---|---|---|
KNN_LR | 0.951 | 1.000 | 0.975 |
KNN_SVM(line) | 0.951 | 1.000 | 0.975 |
KNN_Tree(max_depth=14) | 0.957 | 0.997 | 0.977 |
PCA_LR | 0.884 | 0.936 | 0.909 |
PCA_SVM(rbf) | 0.972 | 0.919 | 0.945 |
PCA_Tree(max_depth=14) | 0.974* | 0.875 | 0.922 |
KNN_RF | 0.991 | 0.988 | 0.989 |
KNN_Bagging | 0.993 | 0.987 | 0.990 |
KNN_Adaboost | 0.994 | 0.976 | 0.985 |
PCA_RF | 0.943 | 0.941 | 0.942 |
PCA_Bagging | 0.946 | 0.943 | 0.945 |
PCA_Adaboost | 0.848 | 0.960 | 0.901 |
Based on the table, Adaboost based on KNN data shows the best results.
Because in our raw dara, there are many true job description, which means you selec the job casually and without any consideration, the possibility that you choose the fake job will be very low. So our raw data is very imbalanced. We do the upsampling to see if our model works.
Firstly, do the upsampling;
Secondly, do the KNN feature selection. Based on the picture below, we choose the 6 features.
Finally, do the model training above and show the results.
Model Type | PRE* | REC | F1-score |
---|---|---|---|
KNN_LR | 0.627 | 0.347 | 0.447 |
KNN_SVM(line) | 0.631 | 0.355 | 0.455 |
KNN_Tree(max_depth=15) | 0.892 | 0.860 | 0.876 |
PCA_LR | 0.558 | 0.534 | 0.546 |
PCA_SVM(rbf) | 0.553 | 0.547 | 0.550 |
PCA_Tree(max_depth=15) | 0.914 | 0.827 | 0.868 |
KNN_RF | 0.991 | 0.988 | 0.989 |
KNN_Bagging | 0.993 | 0.987 | 0.990 |
KNN_Adaboost | 0.994 | 0.976 | 0.985 |
PCA_RF | 1.000 | 0.927 | 0.962 |
PCA_Bagging | 1.000 | 0.927 | 0.962 |
PCA_Adaboost | 0.677 | 0.614 | 0.644 |
Random Forest and Adaboost based on upsampling pca data show the best results. After upsampling, we can see the precision index of all the training models all tend to decline. It's reasonable. Because we enlarge the possibility of choosing the fake job. The models above verifies the efficiency of our prediction.