Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



35 Commits

Repository files navigation

Project Organization

├──          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets.
│   └── raw            <- The original, immutable data dump.
├── models             <- Trained and serialized models, model predictions, or model summaries.
├── notebooks          <- Jupyter notebooks with steps for training and evaluating models.
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
└── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.

Project based on the cookiecutter data science project template. #cookiecutterdatascience



In the context of marketplaces, an algorithm is needed to predict if an item listed is new or used.

Your tasks involve the data analysis, designing, processing and modeling of a machine learning solution to predict if an item is new or used and then evaluate the model over held-out test data.

To assist in that task a dataset is provided in MLA_100k_checked_v3.jsonlines.

For the evaluation, you will use the accuracy metric in order to get a result of 0.86 as minimum. Additionally, you will have to choose an appropiate secondary metric and also elaborate an argument on why that metric was chosen.

The deliverables are:

  • The file, including all the code needed to define and evaluate a model.
  • A document with an explanation on the criteria applied to choose the features, the proposed secondary metric and the performance achieved on that metrics.
  • Optionally, you can deliver an EDA analysis with other formart like .ipynb

Resumed Conclusions

  1. You will find our first selected columns at section 2.1, then you can check our definitive columns after treatment and feature engineering.

  2. We didn't predict our classes (0 and 1), but we decided to predict the probability for our binary classification problem, since it's more meaningful (literally, we calculate the probability to be class 0 or 1). Thus, we didn't calculate accuracy, precision, recall, F1-score, Kappa or other metrics. For probability evaluation, we opted to use mean squared error, log loss and Brier score (lower is better). We also used the ROC curve to evaluate the model and calculated ROC AUC score (higher is better).

  3. Our metrics only make sense if we compare between models. We compared four models.

    (a) Our first one is our baseline, we used logistic regressions with no parameters and got a bad result with a score of 0.69. We mostly used it because applying a linear model can help to get insights from the data;

    (b) Our second one is more complex and with less interpretability, we used an ensemble of non-linear hierarchical tree models, called XGBoost. We got a ROC AUC of 0.89, which is better than our baseline model and it's a good result, with high computational cost though;

    (c) For our third one, we first used embeddings (neural networks) to encode our categorical features (category and seller city) with high cardinality which we couldn't do One Hot Encoding (due to computational cost) or Label Encoding (since unique values are independent from each other). After that, we just applied a Logistic Regression. Impressively, we got a ROC AUC of 0.9 with a simple linear model for binary classification;

    (d) For our fourth one we also used embeddings for encoding, but then used a XGBoost. We got a ROC AUC of 0.93, as we expected to get a better result from the previous one;

    (e) Finally, for our fifth model we also used embeddings for encoding, but then used a four-layers Neural Network. We got the same results as the last model;

  • REMARKS: Personally, I think the embeddings encoding with Logistic Regression is the best model, because it's simpler and has more interpretability. Occam's Razor principle states that other things equal, explanations that posit fewer entities, or fewer kinds of entities, are to be preferred to explanations that posit more.
Metrics XGBoost Emb_Logistic Emb_XGBoost Emb_NNet
0 mean_squared_error_test 0.36 0.36 0.32 0.32
1 Roc_auc 0.89 0.90 0.93 0.93
2 Brier_error 0.13 0.12 0.10 0.10
3 Logloss 0.40 0.39 0.34 0.35
- The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error. 
Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. 
This is called the Root Mean Squared Error (or RMSE).

- Logistic loss (or log loss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. 
The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.
It heavily penalizes predicted probabilities far away from their expected value.

- The Brier score calculates the mean squared error between predicted probabilities and the expected values. 
It`s gentler than log loss but still penalizes proportional to the distance from the expected value.

- Area Under ROC Curve (or ROC AUC for short) is a performance metric for binary classification problems. 
The AUC represents a model’s ability to discriminate between positive and negative classes. 
An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random. 
A ROC Curve is a plot of the true positive rate and the false positive rate for a given set of probability predictions at different thresholds used to map the probabilities to class labels. 
The area under the curve is then the approximate integral under the ROC Curve.
The area under ROC curve that summarizes the likelihood of the model predicting a higher probability for true positive cases than true negative cases.



  • numpy
  • pandas
  • re
  • matplotlib
  • seaborn
  • embedding_encoder
  • sklearn
  • xgboost
  • keras


  • Python version 3.9
  • Git


  • VS Studio
  • Jupyter IPython


  • Github


1.Load Data

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
dfs = pd.read_json('MLA_100k_checked_v3.jsonlines', lines=True)
dfs = dfs.rename(columns = {'tags':'tag'})
dfs = dfs.rename(columns = {'id':'Id'})

1.1.Get features from dictionary columns

# Get region
dfs['seller_country'] = dfs.apply(lambda x : x['seller_address']['country']['name'], axis = 1)
dfs['seller_state'] = dfs.apply(lambda x : x['seller_address']['state']['name'], axis = 1)
dfs['seller_city'] = dfs.apply(lambda x : x['seller_address']['city']['name'], axis = 1)
# Transform id (named as descriptions) column to get data
import ast
def str_to_dict(column):
    for i in range(len(column)):
            column[i] = ast.literal_eval(column[i][0])

# get data from descriptions and shipping 
dfs = pd.concat([dfs, dfs["descriptions"].apply(pd.Series)], axis=1)
dfs = pd.concat([dfs, dfs["shipping"].apply(pd.Series)], axis=1)
pd.set_option('display.max_columns', None)
seller_address warranty sub_status condition deal_ids base_price shipping non_mercado_pago_payment_methods seller_id variations site_id listing_type_id price attributes buying_mode tag listing_source parent_item_id coverage_areas category_id descriptions last_updated international_delivery_mode pictures Id official_store_id differential_pricing accepts_mercadopago original_price currency_id thumbnail title automatic_relist date_created secure_thumbnail stop_time status video_id catalog_product_id subtitle initial_quantity start_time permalink sold_quantity available_quantity seller_country seller_state seller_city 0 id local_pick_up methods tags free_shipping mode dimensions free_methods
0 {'country': {'name': 'Argentina', 'id': 'AR'},... None [] new [] 80.0 {'local_pick_up': True, 'methods': [], 'tags':... [{'description': 'Transferencia bancaria', 'id... 8208882349 [] MLA bronze 80.0 [] buy_it_now [dragged_bids_and_visits] MLA6553902747 [] MLA126406 {'id': 'MLA4695330653-912855983'} 2015-09-05T20:42:58.000Z none [{'size': '500x375', 'secure_url': 'https://a2... MLA4695330653 NaN NaN True NaN ARS Auriculares Samsung Originales Manos Libres Ca... False 2015-09-05T20:42:53.000Z 2015-11-04 20:42:53 active None NaN NaN 1 2015-09-05 20:42:53 0 1 Argentina Capital Federal San CristĂłbal NaN MLA4695330653-912855983 True [] [] False not_specified None NaN
1 {'country': {'name': 'Argentina', 'id': 'AR'},... NUESTRA REPUTACION [] used [] 2650.0 {'local_pick_up': True, 'methods': [], 'tags':... [{'description': 'Transferencia bancaria', 'id... 8141699488 [] MLA silver 2650.0 [] buy_it_now [] MLA7727150374 [] MLA10267 {'id': 'MLA7160447179-930764806'} 2015-09-26T18:08:34.000Z none [{'size': '499x334', 'secure_url': 'https://a2... MLA7160447179 NaN NaN True NaN ARS Cuchillo Daga Acero CarbĂłn Casco Yelmo Solinge... False 2015-09-26T18:08:30.000Z 2015-11-25 18:08:30 active None NaN NaN 1 2015-09-26 18:08:30 0 1 Argentina Capital Federal Buenos Aires NaN MLA7160447179-930764806 True [] [] False me2 None NaN
2 {'country': {'name': 'Argentina', 'id': 'AR'},... None [] used [] 60.0 {'local_pick_up': True, 'methods': [], 'tags':... [{'description': 'Transferencia bancaria', 'id... 8386096505 [] MLA bronze 60.0 [] buy_it_now [dragged_bids_and_visits] MLA6561247998 [] MLA1227 {'id': 'MLA7367189936-916478256'} 2015-09-09T23:57:10.000Z none [{'size': '375x500', 'secure_url': 'https://a2... MLA7367189936 NaN NaN True NaN ARS Antigua Revista Billiken, N° 1826, Año 1954 False 2015-09-09T23:57:07.000Z 2015-11-08 23:57:07 active None NaN NaN 1 2015-09-09 23:57:07 0 1 Argentina Capital Federal Boedo NaN MLA7367189936-916478256 True [] [] False me2 None NaN
3 {'country': {'name': 'Argentina', 'id': 'AR'},... None [] new [] 580.0 {'local_pick_up': True, 'methods': [], 'tags':... [{'description': 'Transferencia bancaria', 'id... 5377752182 [] MLA silver 580.0 [] buy_it_now [] None [] MLA86345 {'id': 'MLA9191625553-932309698'} 2015-10-05T16:03:50.306Z none [{'size': '441x423', 'secure_url': 'https://a2... MLA9191625553 NaN NaN True NaN ARS Alarma Guardtex Gx412 Seguridad Para El Automo... False 2015-09-28T18:47:56.000Z 2015-12-04 01:13:16 active None NaN NaN 1 2015-09-28 18:47:56 0 1 Argentina Capital Federal Floresta NaN MLA9191625553-932309698 True [] [] False me2 None NaN
4 {'country': {'name': 'Argentina', 'id': 'AR'},... MI REPUTACION. [] used [] 30.0 {'local_pick_up': True, 'methods': [], 'tags':... [{'description': 'Transferencia bancaria', 'id... 2938071313 [] MLA bronze 30.0 [] buy_it_now [dragged_bids_and_visits] MLA3133256685 [] MLA41287 {'id': 'MLA7787961817-902981678'} 2015-08-28T13:37:41.000Z none [{'size': '375x500', 'secure_url': 'https://a2... MLA7787961817 NaN NaN True NaN ARS Serenata - Jennifer Blake False 2015-08-24T22:07:20.000Z 2015-10-23 22:07:20 active None NaN NaN 1 2015-08-24 22:07:20 0 1 Argentina Buenos Aires Tres de febrero NaN MLA7787961817-902981678 True [] [] False not_specified None NaN
# Get payment methods from dict
def convertCol(x,key,i):
        return x[i][key]
        return ''
for key in ['description']: #['description','id','type'] -- only description is interesting
    for i in range(0,13):
        dfs[f'payment_{key}{i}'] = dfs['non_mercado_pago_payment_methods'].apply(lambda x: convertCol(x,key,i))
# Create a boolean column for each payment method 
lista_c = []
for i in range(0,13):
    lista = dfs[f'payment_description{i}'].unique()

desc_uniques = set(lista_c)
{'Acordar con el comprador',
 'American Express',
 'Cheque certificado',
 'Contra reembolso',
 'Giro postal',
 'Mastercard Maestro',
 'Tarjeta de crédito',
 'Transferencia bancaria',
 'Visa Electron'}
# Rename column for an improved dataframe (#TODO: Use apply for performance)
for col in desc_uniques:
    col_name=col.replace(' ','_')
    dfs[col_name] = dfs.isin([col]).any(axis=1)

# drop older columns
dfs = dfs.drop(dfs.loc[:, 'payment_description0':'payment_description12'], axis = 1)
import numpy as np
dfs = dfs.applymap(lambda x: x if x else np.nan)
dfs = dfs.dropna(how='all', axis=1)

2.Data Transformation

2.1.Change type and filter columns


  • warranty (good and new products have different kind of warranties)
  • sub_status (when a product ad is suspended might be due it's condition)
  • base_price (price are different when used or new)
  • seller_id (different sellers might sell used or new items)
  • price (price again)
  • buying_mode (type of buying might implicate something)
  • parent_item_id (might have correlation between similar products)
  • last_updated (we'll check)
  • id (we'll check)
  • official_store_id (different stores sells different items and conditions)
  • original_price (price again)
  • currency_id (type of payment and currency might be due to the kind of seller and products)
  • title (keep title to find product)
  • automatic_relist (we'll check')
  • stop_time (time might influece)
  • status (status might influece)
  • video_id fica (we'll check, but might be videos for used products)
  • initial_quantity (a good feature, used products have low counts)
  • start_time (time again)
  • sold_quantity (quantity again)
  • available_quantity (quantity again)
  • seller_country, state, city (used or new ads might have imbalanced distribution between regions)
  • local_pick_up (being new or used might influence if local pickup is available)
  • free_shipping (big sellers for new products might be more capable of assuming free shipping)
  • Contra_reembolso fica (payment methods matters)
  • Giro_postal (stays)
  • mode fica (don't know what it is but it's full, 'not_specified' might be more common on used products)
  • tags (we'll check about tags)
  • tag (we'll check about tag)
  • date_created
  • category


  • Cheque_certificado
  • Mastercard_Maestro
  • Diners
  • Transferencia_bancaria
  • accepts_mercadopago
  • MercadoPago
  • Efectivo
  • Tarjeta_de_crĂ©dito
  • American_Express
  • MasterCard
  • Visa_Electron
  • Visa
  • Acordar_con_el_comprador


  • seller_address (too specific)
  • deals_ids (nothing relevant, we checked)
  • shipping (nothing relevant, we checked)
  • non_mercad_pago_etc (transformed)
  • site_id (too specific)
  • listin_type_id sai
  • description (nothing relevant, we checked, turned out to be id)
  • international_delivery_mode
  • pictures (nothing relevant, we checked)
  • thumbnail (nothing relevant, we checked)
  • secure_thumbnail (nothing relevant, we checked)
  • permalink (nothing relevant, we checked)
  • free_methods (nothing relevant, we checked)


  • variations
  • attributes
  • dimension
# Rename columns
dfs = dfs.rename(columns = {'id':'descr_id', 'Id': 'id'})

# Reorder columns
dfs = dfs[['title', 'condition', 'warranty','initial_quantity', 'available_quantity', 'sold_quantity',
                'sub_status', 'buying_mode', 'original_price', 'base_price', 'price', 'currency_id',
                'seller_country', 'seller_state', 'seller_city', 'Giro_postal',  
                'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
                'Contra_reembolso','Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 'Transferencia_bancaria', 'Tarjeta_de_crédito',
                'Mastercard_Maestro', 'MasterCard', 'Visa_Electron', 'Visa', 'Diners', 'American_Express',
                'status', 'automatic_relist',
                'accepts_mercadopago', 'MercadoPago', 
                'id', 'descr_id', 'deal_ids', 'parent_item_id', 'category_id', 'seller_id', 'official_store_id', 'video_id',
                'date_created', 'start_time', 'last_updated', 'stop_time']]
True    97781
Name: accepts_mercadopago, dtype: int64
True    720
Name: MercadoPago, dtype: int64
# Merge columns about same subjects
dfs['accepts_mercadopago'] = dfs['accepts_mercadopago'].fillna(dfs['MercadoPago'])
True    647
Name: MasterCard, dtype: int64
dfs['MasterCard'] = dfs['Mastercard_Maestro'].fillna(dfs['MercadoPago'])
dfs['Visa'] = dfs['Visa_Electron'].fillna(dfs['Visa'])
True    24638
Name: Tarjeta_de_crédito, dtype: int64
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Visa'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['MasterCard'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Diners'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['American_Express'])
dfs['Tarjeta_de_crédito'] = dfs['Tarjeta_de_crédito'].fillna(dfs['Visa'])
True    25928
Name: Tarjeta_de_crédito, dtype: int64
dfs = dfs.rename(columns = {'Tarjeta_de_crédito':'Aceptan_Tarjeta'})
# Drop used columns
dfs = dfs.drop(columns=['MercadoPago', 'Mastercard_Maestro', 'Visa_Electron'])
dfs = dfs.drop(columns=['Visa', 'MasterCard', 'Diners', 'American_Express'])
# Treat columns to access data
def try_join(l):
        return ','.join(map(str, l))
    except TypeError:
        return np.nan

dfs['sub_status'] = try_join(dfs['sub_status'])
dfs['tags'] = try_join(dfs['tags'])
Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
       'original_price', 'base_price', 'price', 'currency_id',
       'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
       'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
       'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'id', 'descr_id', 'deal_ids',
       'parent_item_id', 'category_id', 'seller_id', 'official_store_id',
       'video_id', 'date_created', 'start_time', 'last_updated', 'stop_time'],
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 42 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   title                     100000 non-null  object        
 1   condition                 100000 non-null  object        
 2   warranty                  39103 non-null   object        
 3   initial_quantity          100000 non-null  int64         
 4   available_quantity        100000 non-null  int64         
 5   sold_quantity             16920 non-null   float64       
 6   sub_status                100000 non-null  object        
 7   buying_mode               100000 non-null  object        
 8   original_price            143 non-null     float64       
 9   base_price                100000 non-null  float64       
 10  price                     100000 non-null  float64       
 11  currency_id               100000 non-null  object        
 12  seller_country            99997 non-null   object        
 13  seller_state              99997 non-null   object        
 14  seller_city               99996 non-null   object        
 15  Giro_postal               1665 non-null    object        
 16  free_shipping             3016 non-null    object        
 17  local_pick_up             79561 non-null   object        
 18  mode                      100000 non-null  object        
 19  tags                      100000 non-null  object        
 20  tag                       75090 non-null   object        
 21  Contra_reembolso          648 non-null     object        
 22  Acordar_con_el_comprador  7991 non-null    object        
 23  Cheque_certificado        460 non-null     object        
 24  Efectivo                  67059 non-null   object        
 25  Transferencia_bancaria    51469 non-null   object        
 26  Aceptan_Tarjeta           25928 non-null   object        
 27  status                    100000 non-null  object        
 28  automatic_relist          4697 non-null    object        
 29  accepts_mercadopago       97781 non-null   object        
 30  id                        100000 non-null  object        
 31  descr_id                  41 non-null      object        
 32  deal_ids                  240 non-null     object        
 33  parent_item_id            76989 non-null   object        
 34  category_id               100000 non-null  object        
 35  seller_id                 100000 non-null  int64         
 36  official_store_id         818 non-null     float64       
 37  video_id                  2985 non-null    object        
 38  date_created              100000 non-null  object        
 39  start_time                100000 non-null  datetime64[ns]
 40  last_updated              100000 non-null  object        
 41  stop_time                 100000 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(5), int64(3), object(32)
memory usage: 32.0+ MB
# Transform some columns to boolean type
dfs[['Giro_postal', 'free_shipping', 'local_pick_up', 'Contra_reembolso', 
     'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 
     'Transferencia_bancaria', 'Aceptan_Tarjeta', 'automatic_relist']] = dfs[['Giro_postal', 'free_shipping', 'local_pick_up', 'Contra_reembolso', 
                                                          'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo', 
                                                          'Transferencia_bancaria', 'Aceptan_Tarjeta', 'automatic_relist']].notna()
# Transform type of all columns
dfs = dfs.astype({'title':'str',
                  'condition': 'category', #bool
                  'warranty': 'category',
                  'initial_quantity': 'float', #int
                  'available_quantity': 'float', #int
                  'sold_quantity': 'float', #int
                  'sub_status': 'category', #bool?
                  'buying_mode': 'category',
                  'original_price': 'float',
                  'base_price': 'float',
                  'price': 'float',
                  'currency_id': 'category',
                  'seller_country': 'category',
                  'seller_state': 'category',
                  'seller_city': 'category',
                  'Giro_postal': 'bool',
                  'free_shipping': 'bool',
                  'local_pick_up': 'bool',
                  'mode': 'category',
                  'tags': 'category', #bool?
                  #'tag': 'category',
                  'Contra_reembolso': 'bool',
                  'Acordar_con_el_comprador': 'bool',
                  'Cheque_certificado': 'bool',
                  'Efectivo': 'bool',
                  'Transferencia_bancaria': 'bool',
                  'Aceptan_Tarjeta': 'bool',
                  'id': 'category',
                  'descr_id': 'category',
                  #'deal_ids': 'category',
                  'parent_item_id': 'category',
                  'category_id': 'category',
                  'seller_id': 'category',
                  'official_store_id': 'category',
                  'video_id': 'category',
                  #'date_created': 'datetime',
                  # 'start_time': 'datetime',
                  # 'last_updated': 'datetime',
                  # 'stop_time': 'datetime',
                  'status': 'category', #bool?
                  'automatic_relist': 'bool'
Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
       'original_price', 'base_price', 'price', 'currency_id',
       'seller_country', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'tags', 'tag',
       'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
       'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'id', 'descr_id', 'deal_ids',
       'parent_item_id', 'category_id', 'seller_id', 'official_store_id',
       'video_id', 'date_created', 'start_time', 'last_updated', 'stop_time'],
# Check missing values
import numpy as np
import pandas as pd

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

Your selected dataframe has 42 columns and 100000 Rows.
There are 13 columns that have missing values.
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
descr_id 0 99959 100.0 99959 100.0 category
original_price 0 99857 99.9 99857 99.9 float64
deal_ids 0 99760 99.8 99760 99.8 object
official_store_id 0 99182 99.2 99182 99.2 category
video_id 0 97015 97.0 97015 97.0 category
sold_quantity 0 83080 83.1 83080 83.1 float64
warranty 0 60897 60.9 60897 60.9 category
tag 0 24910 24.9 24910 24.9 object
parent_item_id 0 23011 23.0 23011 23.0 category
accepts_mercadopago 0 2219 2.2 2219 2.2 object
seller_city 0 4 0.0 4 0.0 category
seller_country 0 3 0.0 3 0.0 category
seller_state 0 3 0.0 3 0.0 category
dfs = dfs.drop(columns = 'seller_country') # We can drop Country column, it's always Argentina
dfs['seller_city'] = dfs['seller_city'].fillna(dfs['seller_city'].mode()[0])
dfs['seller_state'] = dfs['seller_state'].fillna(dfs['seller_state'].mode()[0])
Argentina    99997
Name: seller_country, dtype: int64


'Capital Federal'
dfs['accepts_mercadopago'] = dfs['accepts_mercadopago'].fillna(False)
dfs['sold_quantity'] = dfs['sold_quantity'].fillna(0) # Is it ok to fill sold_quantity with 0? [VALIDATE]
dfs['warranty'] = dfs['warranty'].replace(r'^\s*$', np.nan, regex=True)
import pandas as pd
df_temp1 = dfs[dfs['warranty'].isnull()]
df_temp1['warranty'] = False

df_temp2 = dfs[~dfs['warranty'].isnull()]
df_temp2['warranty'] = True

frames = [df_temp1, df_temp2]
dfs = pd.concat(frames)
dfs = dfs.astype({'warranty':'bool'})
False    60897
True     39103
Name: warranty, dtype: int64
display('number of sold_quantity', dfs.sold_quantity.nunique())
'number of sold_quantity'

def get_value_per_cat():
    flag = dfs.select_dtypes(include=['category']).shape[1]
    i = 0

    while i <= flag:
        i = i+1

{'condition': 2}
{'sub_status': 1}
{'buying_mode': 3}
{'currency_id': 2}
{'seller_state': 24}
{'seller_city': 3655}
{'mode': 4}
{'tags': 1}
{'status': 4}
{'id': 100000}
{'descr_id': 41}
{'parent_item_id': 76989}
{'category_id': 10907}
{'seller_id': 35915}
{'official_store_id': 198}
{'video_id': 2077}
Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'sub_status', 'buying_mode',
       'original_price', 'base_price', 'price', 'currency_id', 'seller_state',
       'seller_city', 'Giro_postal', 'free_shipping', 'local_pick_up', 'mode',
       'tags', 'tag', 'Contra_reembolso', 'Acordar_con_el_comprador',
       'Cheque_certificado', 'Efectivo', 'Transferencia_bancaria',
       'Aceptan_Tarjeta', 'status', 'automatic_relist', 'accepts_mercadopago',
       'id', 'descr_id', 'deal_ids', 'parent_item_id', 'category_id',
       'seller_id', 'official_store_id', 'video_id', 'date_created',
       'start_time', 'last_updated', 'stop_time'],
import re
dfs['sub_status'] = dfs['sub_status'].str.replace('nan,','')
dfs['sub_status'] = dfs['sub_status'].str.replace(',nan','')

# We concluded this column is useless: every row has the same count of the same value ('suspended')
dfs = dfs.drop('sub_status', axis=1)

100000    1
Name: sub_status, dtype: int64

(100000, 41)
# dfs['tags'] = dfs['tags'].str.replace('nan,','')
# dfs['tags'] = dfs['tags'].str.replace(',nan','')

# from ast import literal_eval
# dfs['tags'] = dfs['tags'].apply(lambda x: literal_eval(str(x)))

# def deduplicate(column):
#     flag = len(column)
#     i = 0
#     while i <= flag:
#         try:
#             # 1. Convert into list of tuples
#             tpls = [tuple(x) for x in column[i]]
#             # 2. Create dictionary with empty values and
#             # 3. convert back to a list (dups removed)
#             dct = list(dict.fromkeys(tpls))
#             # 4. Convert list of tuples to list of lists
#             dup_free = [list(x) for x in lst]
#             # Print everything
#             column[i] = list(map(''.join, dup_free))
#             # [[1, 1], [0, 1], [0, 1], [1, 1]]
#             i = i+1
#         except:
#             return
# deduplicate(dfs['tags'])
# display(dfs['tags'].value_counts().value_counts())
# display(dfs.shape)
# display(dfs['tag'].value_counts().value_counts())

# Other useless colums -- all rows have the same values
dfs = dfs.drop('tags', axis=1)
dfs = dfs.drop('tag', axis=1)        
display('dataframe shape', dfs.shape)
display('unique ids',
display('number of sellers', dfs.seller_id.nunique())
display('number of categories', dfs.category_id.nunique())

#Drop useless column
dfs = dfs.drop(['id'], axis=1)
'dataframe shape'

(100000, 38)

'unique ids'


'number of sellers'


'number of categories'

Your selected dataframe has 37 columns and 100000 Rows.
There are 6 columns that have missing values.
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
descr_id 0 99959 100.0 99959 100.0 category
original_price 0 99857 99.9 99857 99.9 float64
deal_ids 0 99760 99.8 99760 99.8 object
official_store_id 0 99182 99.2 99182 99.2 category
video_id 0 97015 97.0 97015 97.0 category
parent_item_id 0 23011 23.0 23011 23.0 category
dfs = dfs.dropna(axis=1) # drop all columns with missing values (we checked and they are not necessary or have too many missing values to imput properly)
from matplotlib import pyplot as plt
# Deal with datetimes to create new features
dfs['year_start'] = pd.to_datetime(dfs['start_time']).dt.year.astype('category')
dfs['month_start'] = pd.to_datetime(dfs['start_time']).dt.month.astype('category')
dfs['year_stop'] = pd.to_datetime(dfs['stop_time']).dt.year.astype('category')
dfs['month_stop'] = pd.to_datetime(dfs['stop_time']).dt.month.astype('category')
dfs['week_day'] = pd.to_datetime(dfs['stop_time']).dt.weekday.astype('category')
#dfs['days_active'] = (dfs['start_time'] - dfs['stop_time']).dt.days
dfs['days_active'] = [int(i.days) for i in (dfs.stop_time - dfs.start_time)]
dfs['days_active'] = dfs['days_active'].astype('int')
dfs = dfs.reset_index(drop=True)

#dfs = dfs.drop(['date_created', 'start_time', 'last_updated', 'stop_time'], axis=1)
boxplot = dfs.boxplot(column=['days_active'], showfliers=False)
plt.savefig('days_active.png', bbox_inches='tight', dpi = 300)



3.1.Logistic Regression

# empty list to read list from a file
selected_features = []

# open file and read the content in a list
with open(r'selected_features.txt', 'r') as fp:
    for line in fp:
        # remove linebreak from a current name
        # linebreak is the last character of each line
        x = line[:-1]

        # add current item to the list

# display list
['base_price', 'seller_id', 'available_quantity', 'seller_state', 'price', 'week_day', 'sold_quantity', 'mode', 'Transferencia_bancaria', 'category_id', 'Aceptan_Tarjeta', 'seller_city', 'initial_quantity', 'warranty', 'automatic_relist']
from sklearn import preprocessing

# Encode categorical columns to pass through model
mylist = list(dfs.select_dtypes(include=['category']).columns)
dfs[mylist] = dfs[mylist].apply(preprocessing.LabelEncoder().fit_transform)
dfs['log_price'] = np.log(dfs['price'] + 1)
dfs['log_base_price'] = np.log(dfs['base_price'] + 1)
import statsmodels.formula.api as fsm
import matplotlib.pyplot as plt
import seaborn as sns
model = fsm.logit(formula = 'condition ~ log_price' , data = dfs)
fit =

dfs['pred_baseline'] = fit.predict()

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_baseline', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'available_quantity', y = 'pred_baseline', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[1])
plt.savefig('logistic_baseline_plot.png', bbox_inches='tight', dpi = 300)
Optimization terminated successfully.
         Current function value: 0.681740
         Iterations 4


import matplotlib.pyplot as plt
import statsmodels.formula.api as fsm
model = fsm.logit(formula = 'condition ~ log_price : mode * seller_state', data = dfs)
fit =
dfs['pred_m1'] = fit.predict()

fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_m1', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'available_quantity', y = 'pred_m1', hue = 'Aceptan_Tarjeta', size = 'base_price', ax = ax[1])
plt.savefig('logistic_tarjeta_plot.png', bbox_inches='tight', dpi = 300)
Optimization terminated successfully.
         Current function value: 0.690247
         Iterations 4


fig,ax = plt.subplots(1,2, figsize = (16,8))
sns.scatterplot(data = dfs, x = 'initial_quantity', y = 'pred_m1', hue = 'warranty', size = 'base_price', ax = ax[0])
sns.scatterplot(data = dfs, x = 'sold_quantity', y = 'pred_m1', hue = 'warranty', size = 'base_price', ax = ax[1])
plt.savefig('logistic_warranty_plot.png', bbox_inches='tight', dpi = 300)


from sklearn.metrics import f1_score

threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(dfs['pred_m1'] < threshold, 0, 1)
    f1 = f1_score(dfs['condition'], pred_label)
df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)
plt.savefig('logistic_baseline_threshold.png', bbox_inches='tight', dpi = 300)


from sklearn.metrics import cohen_kappa_score, precision_score, roc_curve
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score

threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(dfs['pred_m1'] < threshold, 0, 1)
    score = cohen_kappa_score(dfs['condition'], pred_label)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == max(df_score['score_score'])]
bt = df_score[df_score['score_score'] == max(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == max(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Kappa: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)
plt.savefig('logistic_kappa_threshold.png', bbox_inches='tight', dpi = 300)


from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(dfs['condition'], dfs['pred_baseline'])

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)

sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.savefig('logistic_baseline_roc_curve.png', bbox_inches='tight', dpi = 300)


from sklearn.metrics import roc_curve
#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(dfs['condition'], dfs['pred_m1'])

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, aspect=1)

sns.lineplot(x = fpr, y = fpr, ax = ax)
sns.lineplot(x = fpr, y = tpr, ax = ax)
plt.savefig('logistic_kappa_roc_curve.png', bbox_inches='tight', dpi = 300)

3.2.Model: XGBoost

import os
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import cohen_kappa_score, precision_score
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, roc_auc_score

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)

scaled_features = dfs.copy()
col_names = ['warranty', 'initial_quantity', 'available_quantity', 'sold_quantity',
       'base_price', 'price', 'Giro_postal', 'free_shipping', 'local_pick_up',
       'Contra_reembolso', 'Acordar_con_el_comprador', 'Cheque_certificado',
       'Efectivo', 'Transferencia_bancaria', 'Aceptan_Tarjeta',
       'automatic_relist', 'accepts_mercadopago', 'days_active']

features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features

X = scaled_features.drop(columns=['condition'], axis=1)
#X = dfs.drop(columns='condition')
y = scaled_features.condition

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=7)
Y_train = Y_train
Y_test = Y_test

full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), X_train.columns)], remainder='passthrough')

encoder =
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

# train the model
model = xgb.XGBClassifier(n_estimators= 200,
                             max_depth= 30,                         # Lower ratios avoid over-fitting. Default is 6.
                             objective = 'binary:logistic',         # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.  
                             #num_class = 2,                        # Use if softprob is set.
                             reg_lambda = 10,                       # Larger ratios avoid over-fitting. Default is 1.
                             gamma = 0.3,                           # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
                             alpha = 1,                             # Larger ratios avoid over-fitting. Default is 0.
                             learning_rate= 0.10,                   # Lower ratios avoid over-fitting. Default is 3.
                             colsample_bytree= 0.7,                 # Lower ratios avoid over-fitting.
                             scale_pos_weight = 1,                  # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
                             subsample = 0.1,                       # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
                             min_child_weight = 3,                  # Larger ratios avoid over-fitting. Default is 1.
                             missing = np.nan,                      # Deal with missing values
                             num_parallel_tree = 2,                 # Parallel trees constructed during each iteration. Default is 1.
                             importance_type = 'weight',
                             eval_metric = 'auc',
                             #use_label_encoder = True,
                             #enable_categorical = True,
                             verbosity = 1,
                             nthread = -1,                          # Set -1 to use all threads.
                             #use_rmm = True,                       # Use GPU if available
                             tree_method = 'auto', # auto           # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
                             #gradient_based = True,                # If True you can set subsample as low as 0.1. Only use with gpu_hist 

# fit model     , Y_train.values.ravel(),
          # early_stopping_rounds=20

# check best ntree limit

# extract the training set predictions
preds_train = model.predict(X_train_enc,
# extract the test set predictions
preds_test = model.predict(X_test_enc,

# save model
output_dir = "models"
if not os.path.exists(output_dir):
# save in JSON format
# save in text format

/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/ UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)


/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/ UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.

CPU times: user 12min 45s, sys: 2.88 s, total: 12min 48s
Wall time: 1min 56s
# extract the test set predictions
preds_test = model.predict_proba(X_test_enc,
/home/ggnicolau/miniconda3/envs/jupyter-1/lib/python3.10/site-packages/xgboost/ UserWarning: ntree_limit is deprecated, use `iteration_range` or model slicing instead.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    f1 = f1_score(Y_test, pred_label)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    score = brier_score_loss(Y_test, pred_label)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(Y_test, preds_test[:,1])
roc = roc_auc_score(Y_test, preds_test[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

lw = 2
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("XGBoost Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('xgboost_roc_curve.png', bbox_inches='tight', dpi = 300)



Best Threshold=0.505019, G-Mean=0.810


CPU times: user 1.8 s, sys: 629 ms, total: 2.43 s
Wall time: 1.65 s
# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, preds_test[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(Y_test, preds_test[:,1])))
print("Brier_error = {}".format(brier_score_loss(Y_test, preds_test[:,1])))
print("Logloss_test = {}".format(log_loss(Y_test, preds_test[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.3640702047156597
Roc_auc = 0.8888785004424654
Brier_error = 0.13254711396170238
Logloss_test = 0.4085390232688165
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(Y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.490, Roc_auc=0.81209
# evaluate each threshold
scores = [brier_score_loss(Y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.505, Roc_auc=0.19180

Using patsy to combine features (couldn't run due to hardware limitations)

# %%time
# import xgboost as xgb
# from sklearn.metrics import cohen_kappa_score
# from sklearn.metrics import matthews_corrcoef
# from sklearn.metrics import f1_score
# from sklearn.model_selection import train_test_split
# import patsy
# # Selecting features I've found and using patsy to automatic interact between features.
# y, X = patsy.dmatrices('condition ~ Aceptan_Tarjeta + category_id + Efectivo + Transferencia_bancaria + automatic_relist + available_quantity + \
#                        base_price + warranty + sold_quantity + free_shipping + initial_quantity + local_pick_up + mode + \
#                        price + seller_id + seller_city + seller_state+ \
#                        year_start + month_start + year_stop  + month_stop + week_day + days_active', data = dfs)

# # Display patsy features
# #display(X)

# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

# D_train = xgb.DMatrix(X_train, label=Y_train)#, enable_categorical=True)
# D_test = xgb.DMatrix(X_test, label=Y_test)#, enable_categorical=True)

# param = {
#     'eta': 0.10,                      # Lower ratios avoid over-fitting. Default is 3.
#     'max_depth': 30,                  # Lower ratios avoid over-fitting. Default is 6.
#     "min_child_weight": 3,            # Larger ratios avoid over-fitting. Default is 1.
#     "gamma": 0.3,                     # Larger values avoid over-fitting. Default is 0. 
#     "colsample_bytree" : 0.7,         # Lower ratios avoid over-fitting. Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
#     "scale_pos_weight": 1,            # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
#     "reg_lambda": 10,                 # Larger ratios avoid over-fitting. Default is 1.
#     "alpha": 1,                       # Larger ratios avoid over-fitting. Default is 0.
#     'subsample':0.5,                  # Lower ratios avoid over-fitting. Default 1. 0.5 recommended.
#     'num_parallel_tree': 2,           # Parallel trees constructed during each iteration. Default is 1.
#     'objective': 'multi:softprob',    # Default is reg:squarederror. 'multi:softprob' for multiclass.  
#     'num_class': 2,                   # Use if softprob is set.
#     'verbosity':1,
#     'eval_metric': 'auc',
#     'use_rmm':False,                   # Use GPU if available
#     'nthread':-1,                      # Set -1 to use all threads.
#     'tree_method': 'auto',             # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
#     'gradient_based': False,           # If True you can set subsample as low as 0.1. Only use with gpu_hist 
# } 

# steps = 200  # The number of training iterations

# model = xgb.train(param, D_train, steps)
# import numpy as np
# from sklearn.metrics import precision_score, recall_score, accuracy_score

# preds = model.predict(D_test)
# best_preds = np.asarray([np.argmax(line) for line in preds])

# print("Precision = {}".format(precision_score(Y_test, best_preds)))
# print("Recall = {}".format(recall_score(Y_test, best_preds)))
# print("f1 = {}".format(f1_score(Y_test, best_preds)))
# print("kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
# print("matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
# #print("mean_squared_error_train = {}".format(mean_squared_error(Y_train, best_preds)))
# # print("mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
# print("logloss_test = {}".format(log_loss(Y_test, best_preds)))
# #print("logloss_train = {}".format(log_loss(Y_train, best_preds)))

# # from xgboost import plot_importance
# # import matplotlib.pyplot as pyplot
# # plot_importance(model)
# #
# from sklearn.metrics import roc_auc_score

# best_preds = np.where(preds_test < bt, 0, 1)

# print("Roc_auc = {}".format(roc_auc_score(Y_test, best_preds)))
# print("Precision = {}".format(precision_score(Y_test, best_preds)))
# print("Recall = {}".format(recall_score(Y_test, best_preds)))
# print("F1 = {}".format(f1_score(Y_test, best_preds)))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, best_preds)))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, best_preds)))
# print("Mean_squared_error_test = {}".format(mean_squared_error(Y_test, best_preds)))
# print("Logloss_test = {}".format(log_loss(Y_test, best_preds)))

3.3.Embeddings Encoding + Logistic Regression

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames
Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'buying_mode', 'base_price',
       'price', 'currency_id', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'Contra_reembolso',
       'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
       'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'category_id', 'seller_id',
       'date_created', 'start_time', 'last_updated', 'stop_time', 'year_start',
       'month_start', 'year_stop', 'month_stop', 'week_day', 'days_active'],
dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
       'seller_city', 'mode', 'status', 'category_id', 'seller_id',
       'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']

X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
categorical_high = ["seller_city", "category_id"] #"seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])
    return make_pipeline(imputer, processor, LogisticRegression(max_iter=1000)) #RandomForestRegressor() #XGBClassifier()

embeddings_pipeline = build_pipeline("embeddings"), y_train)
CPU times: user 5min 16s, sys: 1min 16s, total: 6min 32s
Wall time: 1min 12s

                                                           Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
                                                            'mode', 'status',
                                                 ('scale', StandardScaler(),
                                                  Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
                ('logisticregression', LogisticRegression(max_iter=1000))])
y_pred_proba = embeddings_pipeline.predict_proba(X_test) #.decision_function(X_test) 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(y_pred_proba[:,1] < threshold, 0, 1)
    f1 = f1_score(y_test, pred_label)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(y_pred_proba[:,1] < threshold, 0, 1)
    score = brier_score_loss(y_test, pred_label)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,1])
roc = roc_auc_score(y_test, y_pred_proba[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

lw = 2
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Embeddings + Logistic Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_logistic_roc_curve.png', bbox_inches='tight', dpi = 300)



Best Threshold=0.508526, G-Mean=0.823


CPU times: user 2.02 s, sys: 1.08 s, total: 3.1 s
Wall time: 1.78 s
# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, y_pred_proba[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, y_pred_proba[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, y_pred_proba[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, y_pred_proba[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.3568355720384346
Roc_auc = 0.9007658597305785
Brier_error = 0.12733162547199683
Logloss_test = 0.40211995678282203
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold): # higher is better
    return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(y_pred_proba[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.509, Roc_auc=0.82327
# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(y_pred_proba[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Brier=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.509, Brier=0.17665

3.4.Embeddings Encoding + XGBoost

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames
#dfs = pd.read_parquet('cleaned_data_haha.parquet.gzip')
Index(['title', 'condition', 'warranty', 'initial_quantity',
       'available_quantity', 'sold_quantity', 'buying_mode', 'base_price',
       'price', 'currency_id', 'seller_state', 'seller_city', 'Giro_postal',
       'free_shipping', 'local_pick_up', 'mode', 'Contra_reembolso',
       'Acordar_con_el_comprador', 'Cheque_certificado', 'Efectivo',
       'Transferencia_bancaria', 'Aceptan_Tarjeta', 'status',
       'automatic_relist', 'accepts_mercadopago', 'category_id', 'seller_id',
       'date_created', 'start_time', 'last_updated', 'stop_time', 'year_start',
       'month_start', 'year_stop', 'month_stop', 'week_day', 'days_active'],
dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
       'seller_city', 'mode', 'status', 'category_id', 'seller_id',
       'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']

X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb

categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])

    return make_pipeline(imputer, processor, xgb.XGBClassifier(n_estimators= 200,
                                                                 max_depth= 30,                         # Lower ratios avoid over-fitting. Default is 6.
                                                                 objective = 'binary:logistic',         # Default is reg:squarederror. 'multi:softprob' for multiclass and get proba.  
                                                                 #num_class = 2,                        # Use if softprob is set.
                                                                 reg_lambda = 10,                       # Larger ratios avoid over-fitting. Default is 1.
                                                                 gamma = 0.3,                           # Larger values avoid over-fitting. Default is 0. # Values from 0.3 to 0.8 if you have many columns (especially if you did one-hot encoding), or 0.8 to 1 if you only have a few columns.
                                                                 alpha = 1,                             # Larger ratios avoid over-fitting. Default is 0.
                                                                 learning_rate= 0.10,                   # Lower ratios avoid over-fitting. Default is 3.
                                                                 colsample_bytree= 0.7,                 # Lower ratios avoid over-fitting.
                                                                 scale_pos_weight = 1,                  # Default is 1. Control balance of positive and negative weights, for unbalanced classes.
                                                                 subsample = 0.1,                       # Lower ratios avoid over-fitting. Default 1. 0.5 recommended. # 0.1 if using GPU.
                                                                 min_child_weight = 3,                  # Larger ratios avoid over-fitting. Default is 1.
                                                                 missing = np.nan,                      # Deal with missing values
                                                                 num_parallel_tree = 2,                 # Parallel trees constructed during each iteration. Default is 1.
                                                                 importance_type = 'weight',
                                                                 eval_metric = 'auc',
                                                                 use_label_encoder = False,             # True is 
                                                                 #enable_categorical = True,
                                                                 verbosity = 1,
                                                                 nthread = -1,                          # Set -1 to use all threads.
                                                                 #use_rmm = True,                       # Use GPU if available
                                                                 tree_method = 'auto', # auto           # 'gpu_hist'. Default is auto: analyze the data and chooses the fastest.
                                                                 #gradient_based = True,
                                                                )) #RandomForestClassifier() #LogisticRegression())
embeddings_pipeline = build_pipeline("embeddings"), y_train)
embedding_preds = embeddings_pipeline.predict(X_test) 
CPU times: user 18min 11s, sys: 15 s, total: 18min 26s
Wall time: 3min 6s
# Check accuracy for classes
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score, matthews_corrcoef

print("Accuracy = {}".format(accuracy_score(y_test, embedding_preds)))
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, embedding_preds)))
print("Precision = {}".format(precision_score(y_test, embedding_preds)))
print("Recall = {}".format(recall_score(y_test, embedding_preds)))
print("F1 = {}".format(f1_score(y_test, embedding_preds)))
print("Kappa_score = {}".format(cohen_kappa_score(y_test, embedding_preds)))
print("Matthews_corrcoef = {}".format(matthews_corrcoef(y_test, embedding_preds)))
Accuracy = 0.85885
Balanced accuracy = 0.8587227386116755
Precision = 0.839697904478247
Recall = 0.8571118349619978
F1 = 0.8483155123314169
Kappa_score = 0.7163577766348136
Matthews_corrcoef = 0.7164897258171989
# Check target column balance
0    53758
1    46242
Name: condition, dtype: int64
embeddings_pipeline = build_pipeline("embeddings"), y_train)
CPU times: user 15min 37s, sys: 8.47 s, total: 15min 45s
Wall time: 2min 24s
# Check probabilities score
embedding_preds = embeddings_pipeline.predict_proba(X_test) 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(embedding_preds[:,1] < threshold, 0, 1)
    f1 = f1_score(y_test, pred_label)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(embedding_preds[:,1] < threshold, 0, 1)
    score = brier_score_loss(y_test, pred_label)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, embedding_preds[:,1])
roc = roc_auc_score(y_test, embedding_preds[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

lw = 2
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Embeddings + XGBoost Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_xgboost_curve.png', bbox_inches='tight', dpi = 300)



Best Threshold=0.405652, G-Mean=0.865


CPU times: user 2.06 s, sys: 613 ms, total: 2.67 s
Wall time: 1.83 s
# best_preds_score = np.where(embedding_preds < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, embedding_preds[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, embedding_preds[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, embedding_preds[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, embedding_preds[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.31567197986751955
Roc_auc = 0.9358383924037762
Brier_error = 0.09964879887347967
Logloss_test = 0.3227270290231638
best_preds_score = np.where(embedding_preds < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, best_preds_score[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, best_preds_score[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, best_preds_score[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, best_preds_score[:,1])))
# print("Precision = {}".format(precision_score(Y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(Y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(Y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(Y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(Y_test, preds_test[:,1])))
mean_squared_error_test = 0.36939139134527754
Roc_auc = 0.8636920061833471
Brier_error = 0.13645
Logloss_test = 4.712875409194756
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(embedding_preds[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.406, Roc_auc=0.86562
# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(embedding_preds[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Brier=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.528, Brier=0.13625

Model: Embeddings encoding + Neural Networks

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from embedding_encoder import EmbeddingEncoder
from embedding_encoder.utils.compose import ColumnTransformerWithNames
dfs.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns
Index(['initial_quantity', 'available_quantity', 'sold_quantity', 'base_price',
       'price', 'days_active'],
Index(['condition', 'buying_mode', 'currency_id', 'seller_state',
       'seller_city', 'mode', 'status', 'category_id', 'seller_id',
       'year_start', 'month_start', 'year_stop', 'month_stop', 'week_day'],
# Split train and test
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64', 'category', 'bool']

X = dfs.select_dtypes(include=numerics).drop(columns=['condition'], axis=1)

dfs['condition'] = dfs['condition'].replace('new', 0)
dfs['condition'] = dfs['condition'].replace('used', 1)
y = dfs.condition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
import keras
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import Dropout

categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])

    def twoLayerFeedForward():
        model = Sequential()
        model.add(keras.layers.Dense(300, activation=tf.nn.relu)) #input_dim=300
        model.add(keras.layers.Dense(128, activation=tf.nn.relu))
        model.add(keras.layers.Dense(64, activation=tf.nn.relu))
        model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
        model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
        return model

    # clf = KerasClassifier(TwoLayerFeedForward(), epochs=100, batch_size=500, verbose=0)
    model = KerasClassifier(twoLayerFeedForward, verbose=1, validation_split=0.15, shuffle=True, epochs=100, batch_size=512) #batch_size=32
    return make_pipeline(imputer, processor, model) #RandomForestClassifier() #LogisticRegression())
2022-08-03 18:14:41.595116: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-08-03 18:14:41.595137: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
embeddings_pipeline = build_pipeline("embeddings")
history =, y_train)
/tmp/ipykernel_700621/ DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras ( instead. See for help migrating.
  model = KerasClassifier(twoLayerFeedForward, verbose=1, validation_split=0.15, shuffle=True, epochs=100, batch_size=512) #batch_size=32
2022-08-03 18:14:43.959308: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-08-03 18:14:43.959359: W tensorflow/stream_executor/cuda/] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-03 18:14:43.959388: I tensorflow/stream_executor/cuda/] kernel driver does not appear to be running on this host (brspobitanl1727): /proc/driver/nvidia/version does not exist
2022-08-03 18:14:43.959671: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Epoch 1/100
150/150 [==============================] - 1s 4ms/step - loss: 0.3302 - accuracy: 0.8555 - val_loss: 0.4014 - val_accuracy: 0.8274
Epoch 2/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2948 - accuracy: 0.8732 - val_loss: 0.3881 - val_accuracy: 0.8350
Epoch 3/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2824 - accuracy: 0.8800 - val_loss: 0.3828 - val_accuracy: 0.8364
Epoch 4/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2750 - accuracy: 0.8824 - val_loss: 0.3810 - val_accuracy: 0.8387
Epoch 5/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2694 - accuracy: 0.8867 - val_loss: 0.3842 - val_accuracy: 0.8399
Epoch 6/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2638 - accuracy: 0.8890 - val_loss: 0.3749 - val_accuracy: 0.8391
Epoch 7/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2601 - accuracy: 0.8903 - val_loss: 0.3817 - val_accuracy: 0.8413
Epoch 8/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2559 - accuracy: 0.8921 - val_loss: 0.3821 - val_accuracy: 0.8379
Epoch 9/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2502 - accuracy: 0.8957 - val_loss: 0.3781 - val_accuracy: 0.8457
Epoch 10/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2470 - accuracy: 0.8964 - val_loss: 0.3804 - val_accuracy: 0.8427
Epoch 11/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2415 - accuracy: 0.8997 - val_loss: 0.3774 - val_accuracy: 0.8443
Epoch 12/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2406 - accuracy: 0.9004 - val_loss: 0.3862 - val_accuracy: 0.8428
Epoch 13/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2356 - accuracy: 0.9015 - val_loss: 0.3810 - val_accuracy: 0.8388
Epoch 14/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2286 - accuracy: 0.9060 - val_loss: 0.3822 - val_accuracy: 0.8407
Epoch 15/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2257 - accuracy: 0.9068 - val_loss: 0.3956 - val_accuracy: 0.8415
Epoch 16/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2221 - accuracy: 0.9087 - val_loss: 0.4000 - val_accuracy: 0.8402
Epoch 17/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2181 - accuracy: 0.9107 - val_loss: 0.3942 - val_accuracy: 0.8445
Epoch 18/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2096 - accuracy: 0.9142 - val_loss: 0.4040 - val_accuracy: 0.8404
Epoch 19/100
150/150 [==============================] - 1s 4ms/step - loss: 0.2074 - accuracy: 0.9143 - val_loss: 0.4052 - val_accuracy: 0.8436
Epoch 20/100
150/150 [==============================] - 1s 3ms/step - loss: 0.2030 - accuracy: 0.9177 - val_loss: 0.4251 - val_accuracy: 0.8426
Epoch 21/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1960 - accuracy: 0.9199 - val_loss: 0.4199 - val_accuracy: 0.8422
Epoch 22/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1944 - accuracy: 0.9209 - val_loss: 0.4563 - val_accuracy: 0.8390
Epoch 23/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1916 - accuracy: 0.9237 - val_loss: 0.4386 - val_accuracy: 0.8420
Epoch 24/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1847 - accuracy: 0.9251 - val_loss: 0.4574 - val_accuracy: 0.8372
Epoch 25/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1788 - accuracy: 0.9272 - val_loss: 0.4759 - val_accuracy: 0.8353
Epoch 26/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1738 - accuracy: 0.9300 - val_loss: 0.4750 - val_accuracy: 0.8450
Epoch 27/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1708 - accuracy: 0.9308 - val_loss: 0.4869 - val_accuracy: 0.8407
Epoch 28/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1645 - accuracy: 0.9334 - val_loss: 0.4733 - val_accuracy: 0.8411
Epoch 29/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1609 - accuracy: 0.9350 - val_loss: 0.4808 - val_accuracy: 0.8321
Epoch 30/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1557 - accuracy: 0.9373 - val_loss: 0.5059 - val_accuracy: 0.8384
Epoch 31/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1507 - accuracy: 0.9401 - val_loss: 0.4927 - val_accuracy: 0.8382
Epoch 32/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1430 - accuracy: 0.9423 - val_loss: 0.5239 - val_accuracy: 0.8348
Epoch 33/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1410 - accuracy: 0.9434 - val_loss: 0.5344 - val_accuracy: 0.8355
Epoch 34/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1390 - accuracy: 0.9442 - val_loss: 0.5711 - val_accuracy: 0.8362
Epoch 35/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1318 - accuracy: 0.9470 - val_loss: 0.5636 - val_accuracy: 0.8361
Epoch 36/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1295 - accuracy: 0.9484 - val_loss: 0.5880 - val_accuracy: 0.8398
Epoch 37/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1216 - accuracy: 0.9520 - val_loss: 0.6103 - val_accuracy: 0.8346
Epoch 38/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1163 - accuracy: 0.9540 - val_loss: 0.6112 - val_accuracy: 0.8316
Epoch 39/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1173 - accuracy: 0.9536 - val_loss: 0.6456 - val_accuracy: 0.8292
Epoch 40/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1127 - accuracy: 0.9547 - val_loss: 0.6430 - val_accuracy: 0.8373
Epoch 41/100
150/150 [==============================] - 1s 3ms/step - loss: 0.1061 - accuracy: 0.9580 - val_loss: 0.6648 - val_accuracy: 0.8347
Epoch 42/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1020 - accuracy: 0.9607 - val_loss: 0.7315 - val_accuracy: 0.8348
Epoch 43/100
150/150 [==============================] - 1s 4ms/step - loss: 0.1013 - accuracy: 0.9598 - val_loss: 0.6618 - val_accuracy: 0.8333
Epoch 44/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0938 - accuracy: 0.9637 - val_loss: 0.7261 - val_accuracy: 0.8273
Epoch 45/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0941 - accuracy: 0.9627 - val_loss: 0.7338 - val_accuracy: 0.8279
Epoch 46/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0913 - accuracy: 0.9640 - val_loss: 0.8022 - val_accuracy: 0.8339
Epoch 47/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0849 - accuracy: 0.9672 - val_loss: 0.7733 - val_accuracy: 0.8305
Epoch 48/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0839 - accuracy: 0.9679 - val_loss: 0.8097 - val_accuracy: 0.8351
Epoch 49/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0822 - accuracy: 0.9686 - val_loss: 0.8593 - val_accuracy: 0.8363
Epoch 50/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0792 - accuracy: 0.9694 - val_loss: 0.8464 - val_accuracy: 0.8343
Epoch 51/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0766 - accuracy: 0.9709 - val_loss: 0.8365 - val_accuracy: 0.8360
Epoch 52/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0683 - accuracy: 0.9743 - val_loss: 0.9086 - val_accuracy: 0.8327
Epoch 53/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0726 - accuracy: 0.9721 - val_loss: 0.9122 - val_accuracy: 0.8352
Epoch 54/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0626 - accuracy: 0.9765 - val_loss: 0.9309 - val_accuracy: 0.8290
Epoch 55/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0742 - accuracy: 0.9711 - val_loss: 0.9134 - val_accuracy: 0.8314
Epoch 56/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0594 - accuracy: 0.9771 - val_loss: 0.9703 - val_accuracy: 0.8296
Epoch 57/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0592 - accuracy: 0.9777 - val_loss: 0.9761 - val_accuracy: 0.8267
Epoch 58/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0619 - accuracy: 0.9764 - val_loss: 0.9635 - val_accuracy: 0.8291
Epoch 59/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0585 - accuracy: 0.9777 - val_loss: 0.9953 - val_accuracy: 0.8311
Epoch 60/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0535 - accuracy: 0.9805 - val_loss: 1.0472 - val_accuracy: 0.8281
Epoch 61/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0503 - accuracy: 0.9809 - val_loss: 1.0811 - val_accuracy: 0.8307
Epoch 62/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0492 - accuracy: 0.9814 - val_loss: 1.1155 - val_accuracy: 0.8359
Epoch 63/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0521 - accuracy: 0.9813 - val_loss: 1.1467 - val_accuracy: 0.8324
Epoch 64/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0465 - accuracy: 0.9823 - val_loss: 1.1086 - val_accuracy: 0.8286
Epoch 65/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0453 - accuracy: 0.9834 - val_loss: 1.1806 - val_accuracy: 0.8213
Epoch 66/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0457 - accuracy: 0.9829 - val_loss: 1.1553 - val_accuracy: 0.8266
Epoch 67/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0521 - accuracy: 0.9806 - val_loss: 1.1109 - val_accuracy: 0.8237
Epoch 68/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0488 - accuracy: 0.9822 - val_loss: 1.1458 - val_accuracy: 0.8236
Epoch 69/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0400 - accuracy: 0.9856 - val_loss: 1.2181 - val_accuracy: 0.8319
Epoch 70/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0411 - accuracy: 0.9846 - val_loss: 1.2346 - val_accuracy: 0.8304
Epoch 71/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0433 - accuracy: 0.9839 - val_loss: 1.1918 - val_accuracy: 0.8281
Epoch 72/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0376 - accuracy: 0.9864 - val_loss: 1.3038 - val_accuracy: 0.8265
Epoch 73/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0361 - accuracy: 0.9870 - val_loss: 1.3390 - val_accuracy: 0.8274
Epoch 74/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0379 - accuracy: 0.9862 - val_loss: 1.2512 - val_accuracy: 0.8244
Epoch 75/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0419 - accuracy: 0.9852 - val_loss: 1.3643 - val_accuracy: 0.8255
Epoch 76/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0421 - accuracy: 0.9846 - val_loss: 1.2699 - val_accuracy: 0.8267
Epoch 77/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0398 - accuracy: 0.9858 - val_loss: 1.3021 - val_accuracy: 0.8292
Epoch 78/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0304 - accuracy: 0.9898 - val_loss: 1.3497 - val_accuracy: 0.8275
Epoch 79/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0407 - accuracy: 0.9856 - val_loss: 1.3319 - val_accuracy: 0.8291
Epoch 80/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0430 - accuracy: 0.9869 - val_loss: 1.3290 - val_accuracy: 0.8302
Epoch 81/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0337 - accuracy: 0.9891 - val_loss: 1.3899 - val_accuracy: 0.8301
Epoch 82/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0315 - accuracy: 0.9892 - val_loss: 1.3707 - val_accuracy: 0.8276
Epoch 83/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0336 - accuracy: 0.9875 - val_loss: 1.3784 - val_accuracy: 0.8274
Epoch 84/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0345 - accuracy: 0.9875 - val_loss: 1.4005 - val_accuracy: 0.8295
Epoch 85/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0307 - accuracy: 0.9893 - val_loss: 1.3823 - val_accuracy: 0.8269
Epoch 86/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0401 - accuracy: 0.9862 - val_loss: 1.4838 - val_accuracy: 0.8297
Epoch 87/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0352 - accuracy: 0.9872 - val_loss: 1.4347 - val_accuracy: 0.8322
Epoch 88/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0284 - accuracy: 0.9894 - val_loss: 1.4827 - val_accuracy: 0.8289
Epoch 89/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0369 - accuracy: 0.9873 - val_loss: 1.4705 - val_accuracy: 0.8270
Epoch 90/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0313 - accuracy: 0.9889 - val_loss: 1.5390 - val_accuracy: 0.8243
Epoch 91/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0290 - accuracy: 0.9895 - val_loss: 1.4780 - val_accuracy: 0.8302
Epoch 92/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0281 - accuracy: 0.9900 - val_loss: 1.5518 - val_accuracy: 0.8297
Epoch 93/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0284 - accuracy: 0.9901 - val_loss: 1.5659 - val_accuracy: 0.8321
Epoch 94/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0306 - accuracy: 0.9893 - val_loss: 1.4831 - val_accuracy: 0.8287
Epoch 95/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0359 - accuracy: 0.9867 - val_loss: 1.5319 - val_accuracy: 0.8230
Epoch 96/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0336 - accuracy: 0.9881 - val_loss: 1.5192 - val_accuracy: 0.8311
Epoch 97/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0279 - accuracy: 0.9902 - val_loss: 1.4872 - val_accuracy: 0.8316
Epoch 98/100
150/150 [==============================] - 1s 3ms/step - loss: 0.0269 - accuracy: 0.9910 - val_loss: 1.5875 - val_accuracy: 0.8327
Epoch 99/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0270 - accuracy: 0.9902 - val_loss: 1.4886 - val_accuracy: 0.8284
Epoch 100/100
150/150 [==============================] - 1s 4ms/step - loss: 0.0332 - accuracy: 0.9882 - val_loss: 1.5127 - val_accuracy: 0.8242
CPU times: user 7min 20s, sys: 19.6 s, total: 7min 39s
Wall time: 1min 51s
# from keras.utils.vis_utils import plot_model
# plot_model(model, to_file='model.png')

# import matplotlib.pyplot as plt

# plt.plot(history[0]['accuracy'])
# plt.plot(history[0]['val_accuracy'])
# plt.title('model accuracy')
# plt.ylabel('accuracy')
# plt.xlabel('epoch')
# plt.legend(['train', 'val'], loc='upper left')
import keras
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import Dropout

categorical_high = ["seller_city", "category_id"] # "seller_id"
numeric = X.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']).columns#.drop(columns=['condition'], axis=1)
categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "year_stop", "month_start", "year_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status", "week_day", "month_stop", "month_start"] + list(X.select_dtypes(include=['bool']).columns)
#categorical_low = ["buying_mode", "currency_id", "seller_state", "mode", "status"] + list(X.select_dtypes(include=['bool']).columns)

def build_pipeline(mode: str):
    if mode == "embeddings":
        high_cardinality_encoder = EmbeddingEncoder(task="classification") #regression
        high_cardinality_encoder = OrdinalEncoder()
    one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
    scaler = StandardScaler()
    imputer = ColumnTransformerWithNames([("numeric", SimpleImputer(strategy="mean"), numeric), ("categorical", SimpleImputer(strategy="most_frequent"), categorical_low+categorical_high)])
    processor = ColumnTransformer([("one_hot", one_hot_encoder, categorical_low), (mode, high_cardinality_encoder, categorical_high), ("scale", scaler, numeric)])

    def threeLayerFeedForward():
        model = Sequential()     

        model.add(keras.layers.Dense(300, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) #input_dim=df_train.shape[1])) #16

        model.add(keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) #8
        model.add(keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer='glorot_uniform')) # None
        model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid, kernel_initializer='glorot_uniform')) # nn.softmax if multiclass

        optimizer =  tf.keras.optimizers.Adamax(
                         learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
#         tf.keras.optimizers.Adam(
#              learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
#              name='Adam'
#          )         

        model.compile(optimizer= optimizer, # 'adam' # SGD()
                      loss='binary_crossentropy', # categorical_crossentropy if multilabel
        return model

    es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    # clf = KerasClassifier(TwoLayerFeedForward(), epochs=100, batch_size=500, verbose=0)
    model = KerasClassifier(threeLayerFeedForward, verbose=1, validation_split=0.05, shuffle=True, epochs=100, batch_size=512, callbacks=[es_callback]) #batch_size=32
    return make_pipeline(imputer, processor, model) #RandomForestClassifier() #LogisticRegression())
embeddings_pipeline = build_pipeline("embeddings")
history =, y_train)
/tmp/ipykernel_700621/ DeprecationWarning: KerasClassifier is deprecated, use Sci-Keras ( instead. See for help migrating.
  model = KerasClassifier(threeLayerFeedForward, verbose=1, validation_split=0.05, shuffle=True, epochs=100, batch_size=512, callbacks=[es_callback]) #batch_size=32

Epoch 1/100
167/167 [==============================] - 1s 5ms/step - loss: 0.4341 - accuracy: 0.7957 - val_loss: 0.4002 - val_accuracy: 0.8182
Epoch 2/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3534 - accuracy: 0.8501 - val_loss: 0.3905 - val_accuracy: 0.8218
Epoch 3/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3447 - accuracy: 0.8528 - val_loss: 0.3935 - val_accuracy: 0.8253
Epoch 4/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3396 - accuracy: 0.8553 - val_loss: 0.3886 - val_accuracy: 0.8260
Epoch 5/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3367 - accuracy: 0.8573 - val_loss: 0.3838 - val_accuracy: 0.8269
Epoch 6/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3344 - accuracy: 0.8592 - val_loss: 0.3834 - val_accuracy: 0.8278
Epoch 7/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3307 - accuracy: 0.8597 - val_loss: 0.3816 - val_accuracy: 0.8267
Epoch 8/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3289 - accuracy: 0.8610 - val_loss: 0.3795 - val_accuracy: 0.8269
Epoch 9/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3264 - accuracy: 0.8620 - val_loss: 0.3776 - val_accuracy: 0.8307
Epoch 10/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3238 - accuracy: 0.8642 - val_loss: 0.3795 - val_accuracy: 0.8298
Epoch 11/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3241 - accuracy: 0.8641 - val_loss: 0.3786 - val_accuracy: 0.8322
Epoch 12/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3193 - accuracy: 0.8656 - val_loss: 0.3772 - val_accuracy: 0.8327
Epoch 13/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3202 - accuracy: 0.8663 - val_loss: 0.3740 - val_accuracy: 0.8342
Epoch 14/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3167 - accuracy: 0.8672 - val_loss: 0.3747 - val_accuracy: 0.8331
Epoch 15/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3158 - accuracy: 0.8680 - val_loss: 0.3710 - val_accuracy: 0.8358
Epoch 16/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3159 - accuracy: 0.8697 - val_loss: 0.3698 - val_accuracy: 0.8360
Epoch 17/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3130 - accuracy: 0.8697 - val_loss: 0.3688 - val_accuracy: 0.8369
Epoch 18/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3109 - accuracy: 0.8706 - val_loss: 0.3679 - val_accuracy: 0.8384
Epoch 19/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3099 - accuracy: 0.8713 - val_loss: 0.3657 - val_accuracy: 0.8378
Epoch 20/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3077 - accuracy: 0.8724 - val_loss: 0.3652 - val_accuracy: 0.8380
Epoch 21/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3076 - accuracy: 0.8719 - val_loss: 0.3635 - val_accuracy: 0.8387
Epoch 22/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3064 - accuracy: 0.8734 - val_loss: 0.3649 - val_accuracy: 0.8364
Epoch 23/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3036 - accuracy: 0.8738 - val_loss: 0.3636 - val_accuracy: 0.8407
Epoch 24/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3024 - accuracy: 0.8753 - val_loss: 0.3615 - val_accuracy: 0.8422
Epoch 25/100
167/167 [==============================] - 1s 5ms/step - loss: 0.3012 - accuracy: 0.8756 - val_loss: 0.3643 - val_accuracy: 0.8382
Epoch 26/100
167/167 [==============================] - 1s 4ms/step - loss: 0.3007 - accuracy: 0.8766 - val_loss: 0.3591 - val_accuracy: 0.8456
Epoch 27/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2979 - accuracy: 0.8764 - val_loss: 0.3595 - val_accuracy: 0.8420
Epoch 28/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2971 - accuracy: 0.8765 - val_loss: 0.3581 - val_accuracy: 0.8442
Epoch 29/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2968 - accuracy: 0.8770 - val_loss: 0.3579 - val_accuracy: 0.8398
Epoch 30/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2935 - accuracy: 0.8793 - val_loss: 0.3575 - val_accuracy: 0.8436
Epoch 31/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2926 - accuracy: 0.8785 - val_loss: 0.3597 - val_accuracy: 0.8416
Epoch 32/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2921 - accuracy: 0.8803 - val_loss: 0.3559 - val_accuracy: 0.8458
Epoch 33/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2904 - accuracy: 0.8804 - val_loss: 0.3551 - val_accuracy: 0.8444
Epoch 34/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2901 - accuracy: 0.8802 - val_loss: 0.3555 - val_accuracy: 0.8418
Epoch 35/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2880 - accuracy: 0.8816 - val_loss: 0.3516 - val_accuracy: 0.8458
Epoch 36/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2873 - accuracy: 0.8814 - val_loss: 0.3551 - val_accuracy: 0.8451
Epoch 37/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2876 - accuracy: 0.8815 - val_loss: 0.3571 - val_accuracy: 0.8458
Epoch 38/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2853 - accuracy: 0.8826 - val_loss: 0.3512 - val_accuracy: 0.8473
Epoch 39/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2844 - accuracy: 0.8829 - val_loss: 0.3523 - val_accuracy: 0.8462
Epoch 40/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2830 - accuracy: 0.8829 - val_loss: 0.3554 - val_accuracy: 0.8520
Epoch 41/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2828 - accuracy: 0.8842 - val_loss: 0.3530 - val_accuracy: 0.8511
Epoch 42/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2798 - accuracy: 0.8853 - val_loss: 0.3543 - val_accuracy: 0.8473
Epoch 43/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2806 - accuracy: 0.8859 - val_loss: 0.3523 - val_accuracy: 0.8478
Epoch 44/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2795 - accuracy: 0.8861 - val_loss: 0.3570 - val_accuracy: 0.8473
Epoch 45/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2773 - accuracy: 0.8858 - val_loss: 0.3496 - val_accuracy: 0.8476
Epoch 46/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2770 - accuracy: 0.8867 - val_loss: 0.3506 - val_accuracy: 0.8551
Epoch 47/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2772 - accuracy: 0.8869 - val_loss: 0.3527 - val_accuracy: 0.8484
Epoch 48/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2747 - accuracy: 0.8870 - val_loss: 0.3520 - val_accuracy: 0.8518
Epoch 49/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2734 - accuracy: 0.8881 - val_loss: 0.3575 - val_accuracy: 0.8500
Epoch 50/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2733 - accuracy: 0.8882 - val_loss: 0.3517 - val_accuracy: 0.8544
Epoch 51/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2728 - accuracy: 0.8884 - val_loss: 0.3537 - val_accuracy: 0.8542
Epoch 52/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2730 - accuracy: 0.8887 - val_loss: 0.3493 - val_accuracy: 0.8507
Epoch 53/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2696 - accuracy: 0.8904 - val_loss: 0.3528 - val_accuracy: 0.8511
Epoch 54/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2701 - accuracy: 0.8887 - val_loss: 0.3534 - val_accuracy: 0.8478
Epoch 55/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2705 - accuracy: 0.8895 - val_loss: 0.3549 - val_accuracy: 0.8531
Epoch 56/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2691 - accuracy: 0.8902 - val_loss: 0.3511 - val_accuracy: 0.8529
Epoch 57/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2680 - accuracy: 0.8900 - val_loss: 0.3499 - val_accuracy: 0.8536
Epoch 58/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2666 - accuracy: 0.8908 - val_loss: 0.3526 - val_accuracy: 0.8531
Epoch 59/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2661 - accuracy: 0.8922 - val_loss: 0.3504 - val_accuracy: 0.8520
Epoch 60/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2647 - accuracy: 0.8920 - val_loss: 0.3479 - val_accuracy: 0.8538
Epoch 61/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2633 - accuracy: 0.8918 - val_loss: 0.3533 - val_accuracy: 0.8536
Epoch 62/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2654 - accuracy: 0.8919 - val_loss: 0.3530 - val_accuracy: 0.8524
Epoch 63/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2640 - accuracy: 0.8922 - val_loss: 0.3489 - val_accuracy: 0.8533
Epoch 64/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2622 - accuracy: 0.8923 - val_loss: 0.3552 - val_accuracy: 0.8502
Epoch 65/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2603 - accuracy: 0.8938 - val_loss: 0.3524 - val_accuracy: 0.8547
Epoch 66/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2614 - accuracy: 0.8938 - val_loss: 0.3515 - val_accuracy: 0.8576
Epoch 67/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8942 - val_loss: 0.3507 - val_accuracy: 0.8544
Epoch 68/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8946 - val_loss: 0.3518 - val_accuracy: 0.8542
Epoch 69/100
167/167 [==============================] - 1s 4ms/step - loss: 0.2595 - accuracy: 0.8949 - val_loss: 0.3519 - val_accuracy: 0.8553
Epoch 70/100
167/167 [==============================] - 1s 5ms/step - loss: 0.2569 - accuracy: 0.8956 - val_loss: 0.3536 - val_accuracy: 0.8569
CPU times: user 6min 8s, sys: 16.6 s, total: 6min 25s
Wall time: 1min 37s
# extract the test set predictions
preds_test = history.predict_proba(X_test)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import cohen_kappa_score, brier_score_loss 
from sklearn.metrics import matthews_corrcoef, mean_squared_error, log_loss
from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from numpy import sqrt, argmax, argmin

# Plot F1-Score and Threshold
threshold_list = np.linspace(0.05, 0.95, 200)

f1_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    f1 = f1_score(y_test, pred_label)

df_f1 = pd.DataFrame({'threshold':threshold_list, 'f1_score': f1_list})
df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]
bt = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['threshold'].values[0]
f1 = df_f1[df_f1['f1_score'] == max(df_f1['f1_score'])]['f1_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ F-1: " + str(round(f1, 2))
sns.lineplot(data=df_f1, x='threshold', y='f1_score').set_title(title)

# Plot your other Score and threshold
threshold_list = np.linspace(0.05, 0.95, 200)

score_list = []
for threshold in threshold_list:
    pred_label = np.where(preds_test[:,1] < threshold, 0, 1)
    score = brier_score_loss(y_test, pred_label)

df_score = pd.DataFrame({'threshold':threshold_list, 'score_score': score_list})
df_score[df_score['score_score'] == min(df_score['score_score'])]
bt = df_score[df_score['score_score'] == min(df_score['score_score'])]['threshold'].values[0]
score = df_score[df_score['score_score'] == min(df_score['score_score'])]['score_score'].values[0]
title = "Best Threshold: " + str(round(bt, 2)) + " w/ Brier: " + str(round(score, 2))
sns.lineplot(data=df_score, x='threshold', y='score_score').set_title(title)

from sklearn.metrics import roc_curve

#Plot ROC_Curve
fpr, tpr, thresholds = roc_curve(y_test, preds_test[:,1])
roc = roc_auc_score(y_test, preds_test[:,1])

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

lw = 2
    label=f"ROC curve (area ={'%.2f' % roc})"# % roc_auc["micro"],

plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') #threshold

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("NNet Condition Classifier")
plt.legend(loc="lower right")
plt.savefig('emb_nnet_roc_curve.png', bbox_inches='tight', dpi = 300)



Best Threshold=0.348351, G-Mean=0.858


CPU times: user 1.36 s, sys: 657 ms, total: 2.02 s
Wall time: 1.19 s
# best_preds_score = np.where(preds_test < bt, 0, 1) # Uncomment if you want to change threshold... Lower, because threshold calculated on Brier Loss and lower is better
print("mean_squared_error_test = {}".format(mean_squared_error(y_test, preds_test[:,1], squared=False)))
print("Roc_auc = {}".format(roc_auc_score(y_test, preds_test[:,1])))
print("Brier_error = {}".format(brier_score_loss(y_test, preds_test[:,1])))
print("Logloss_test = {}".format(log_loss(y_test, preds_test[:,1])))
# print("Precision = {}".format(precision_score(y_test, preds_test[:,1])))
# print("Recall = {}".format(recall_score(y_test, preds_test[:,1])))
# print("F1 = {}".format(f1_score(y_test, preds_test[:,1])))
# print("Kappa_score = {}".format(cohen_kappa_score(y_test, preds_test[:,1])))
# print("Matthews_corrcoef = {}".format(matthews_corrcoef(y_test, preds_test[:,1])))
mean_squared_error_test = 0.32670411986655384
Roc_auc = 0.9300543679940618
Brier_error = 0.10673558193777959
Logloss_test = nan
# apply threshold to positive probabilities to create labels
def to_labels_max(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')
# evaluate each threshold
scores = [roc_auc_score(y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for max is better
ix = argmax(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.348, Roc_auc=0.85848
# evaluate each threshold
scores = [brier_score_loss(y_test, to_labels_max(preds_test[:,1], t)) for t in thresholds]
# get best threshold for min is better
ix = argmin(scores)
print('Threshold=%.3f, Roc_auc=%.5f' % (thresholds[ix], scores[ix]))
Threshold=0.435, Roc_auc=0.14370






predict old and new products







No releases published


No packages published