Skip to content

I share my solution for the Otto Competition, scoring LB 0.601, using Reranker, Transformers and GRU

License

Notifications You must be signed in to change notification settings

bschifferer/Kaggle-Otto-Comp

Repository files navigation

3rd Place Solution - Kaggle-OTTO-Comp - Benny's Part

Otto's Kaggle competition was a session-based, multi-task recommender system problem. Otto provided 12M sessions over 4 weeks with clicks, add to carts and orders events for training. Our team (Chris, Giba and Theo won the 3rd place and achieved a score of 0.60437 on public LB (private LB 0.60382). Our solution is an ensemble of our individual models.

My model achieved public LB 0.60102 and private LB 0.6003. This repository contains the code to run my model end-to-end.

I shared my approach on kaggle write-up. You can find my teammate's solution: Chris and Theo.

Requirements

The code runs on a single V100 GPU with 32GB memory and ~1.5TB disk space.

I recommend using docker. I used following base image nvcr.io/nvidia/merlin/merlin-tensorflow:22.12 which includes RAPIDs cuDF, TensorFlow, XGBoost, etc.

I install additional libraries:

# General
pip install pandarallel

# For Transformer4Rec
pip install torch torchmetrics==0.10.0
cd /nvtabular && git pull origin main && pip install .
cd / && rm -rf dataloader && git clone https://github.com/NVIDIA-Merlin/dataloader.git && cd dataloader && git pull origin main && pip install .
cd / && rm -rf transformers4rec && git clone https://github.com/NVIDIA-Merlin/Transformers4Rec.git && mv Transformers4Rec transformers4rec && cd transformers4rec && git pull origin main && git checkout 851a8903316cccab91850b1e39ade5018b287945 && pip install .

# For Woord2Vec
pip install gensim

(If I missed some, a basic pip install should fix it)

Symbolic Links

create_symbolic_links.sh sets up some symbolic links (see directory explanation below). The pipeline was written where all files were located in one directory. I splitted the files into individual steps to make it easier to understand.

Model Summary

I provide a detailed description here.

model

My final model is a similar pipeline as proposed by Chris and described by many top solution. I will briefly summarize my solution:

Pipeline for all targets:

  1. Feature Engineering

Pipelines per target:
2. Generating Candidates
3. Add Features/Scores for session x candidate pairs
4. Training a XGBoost model

model

I splitted my dataset as proposed by Radek and truncated the 4th week. A difference is, that I splitted the 4th week into 5 chunks and comobine only 1 chunk of the truncated sessions of week 4 with the remaining untruncated sessions of week 4. Therefore, some of my pipeline has to be executed 5 times!

Files Inventory

I provide a quick overview of the directory structure:

├── 00_Preprocess                  # Contain code (incl. host scripts) for preprocessing
├── 00_submissions                 # Stores Submission Files
├── 01a_FE_Word2Vec                # Trains Word2Vec model to generate item embeddings
├── 01b_FE_CoVisit_1               # Creates CoVisitation Matrices based on own logics
├── 01c_FE_CoVisit_2_Chris         # Creates CoVisitation Matrices from Chris solutions after we merged (only 10 used)
├── 01d_FE_GRU                     # Trains a GRU model to generate scores for session x cand pairs
├── 01e_FE_Transformer             # Trains a Transformer model to generate scores for session x cand pairs
├── 01f_FE_other                   # Generates additional features such as historical sales
├── 02_Clicks                      # Runs step 2-4 for clicks target
├── 03_Carts                       # Runs step 2-4 for clicks carts
├── 04_Orders                      # Runs step 2-4 for clicks orders
├── README.md
├── create_symbolic_links.sh
├── data                           # Contains original dataset from Kaggle
├── data_folds                     # Contains temporary files per fold
├── data_tmp                       # Contains temporary files
├── img
└── preprocess                     # Contains dataset after the host script is applied to week 4

How to run the code

Note:

  • Code and Notebooks are index by numbers and should be executed in that order.
  • I developed the code to run almost all files in one directory. I splitted it to make it more readible.
  • I developed my pipeline to use only 20% truncated and 80% untruncated sessions of week 4. I noticed that I can improve my LB scores by running my pipeline for all 5x folds. Therefore, I had to run many scripts 5 times.
  • I developed my pipeline to generate candidates for clicks, carts and orders at the same time. I noticed that I use different candidate generation techniques per target, resulting in 3x pipelines (02_Clicks, 03_Carts and 04_Orders) which all share similar code.

Steps

Some folder contains additional README.md files.

For all Targets

0. Preparation:

  • Download and extract the dataset into ./data/
  • Execute create_symbolic_links.sh to generate symbolic links
  • Follow 00_Preprocess/README.md to generate local CV, convert jsonl files to parquet and split the data into folds
  • Chris' public solution (LB 0.575) was used to fill sessions with less than 20 recommendations. Either download his files from Kaggle or run the notebooks in ./00_submissions/01_baseline/

1. Feature Engineering: Most scripts can be executed independent (only GRU depends on Word2Vec scripts).

The scripts contain a variable igfold in the beginning. The script needs to be exectured 5 times with igfold=0,...,5

Note: The pipelines for clicks, carts and orders can be run independently from each other - but each pipeline will write files in the same directory and might overwriting files from other pipelines. It is recommended to run only one end-to-end at the time.

Clicks

2. Clicks: Execute the scripts in 02_Clicks. The final notebook 04_Combine-Bags.ipynb ensembles the different folds and bags to a final submission.csv (only clicks).

Carts

3. Carts: Execute the scripts in 03_Carts. The final notebook 04_Combine-Bags.ipynb ensembles the different folds and bags to a final submission.csv (only carts).

Orders

4. Orders: Execute the scripts in 03_Carts. The final notebook 06_Combine-Bags.ipynb ensembles the different folds and bags to a final dataframe for orders. It will load the previous submission.csv s from 02_Clicks and 03_Orders to generate the final submissiono.csv.

About

I share my solution for the Otto Competition, scoring LB 0.601, using Reranker, Transformers and GRU

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages