Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share LightGBM categorical variable representation across multiple columns #6632

Open
carlosg-m opened this issue Aug 30, 2024 · 2 comments
Open

Comments

@carlosg-m
Copy link

carlosg-m commented Aug 30, 2024

Summary

Share LightGBM categorical variable representation across multiple columns.

Motivation

Very useful for sequences and several use cases.

Description

Encode the same category level along multiple columns.

Very similar to using an Embedding Layer in a Neural Network to encode a categorical variable and applying the layer across a sequence of that same variable (example of a word embedding).

The same can be achieved with Target Encoding or other representation applied before modeling by stacking the category columns, into a key-value structure, and then applying said encoding.

Example train dataset

Var 1 Var 2 Var 3 Target
Goose Cat Dog 103
Cat Cat Dog 4
Goose Goose Goose 300

Stack shared category levels

Category Level Target
Goose 103
Cat 103
Dog 103
Cat 4
Cat 4
Dog 4
Goose 300
Goose 300
Goose 300

Naive target encoding

Category Level Encode
Goose 250
Dog 53
Cat 37
@carlosg-m
Copy link
Author

No thoughts, suggestions or alternatives?

@jameslamb
Copy link
Collaborator

This project is maintained mostly by volunteers, one of whom may respond here as we have time and interest.

If you can share more details (relevant research, a prototype implementation, your own research doing such feature engineering manually and demonstrating the benefits), it may make it more likely that maintainers here would invest time into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants