BinaryContextTransformer

Efficiently creates two-way interaction terms for sparse, binary data in large datasets and vocabularies.

Overview
Repository Contents
Example
Acknowledgements

Overview

Suppose you are working with a dataset that includes two variables: the text of a message and the type of medium through which it was sent.

type	message
text	text me if ur doing anything 2nite
tweet	Holla! Anyone doing anything tonight?
email	Sent you a text. What are you doing tonight?

If you want to distinguish the words in the messages based on the type of medium, you may have to compute every possible combination of words and types. For large datasets that contain many unique words, this is computationally onerous. Moreover, such datasets are usually sparse and most combinations will never occur.

Base features, such as message words, are variables that may have different meanings in different contexts. Context features, such asmessage types, are indicator variables that denote which context a record belongs to. BinaryContextTransformer efficiently produces combinations between context features and base features so that they can be used for exploratory analysis or prediction.

Examples of binary context features from the table above are text_x_anything or tweet_x_anything. These combination features may be useful if the meaning of the word "anything" differs based on the medium it was sent through.

Reminder: This is a hypothetical example. Emails, texts, and tweets contain personal information. If you are actually analyzing such data, make appropriate considerations for consent and privacy.

Benefits

Follows Scikit-Learn Transformer format.
Excludes interaction terms that appear in only one context.
For sparse data, fit_transform runs in O(S + V), where:
- N = number of records, rows in the input matrix
- B = number of base features, columns in the input matrix
- C = number of context features, columns in the input matrix
- S = number of entries in the input matrix
  - For sparse matrices, N < S << N x B
- V = number of combinations in resulting vocabulary
  - For sparse interactions, V << B x C
Input matrices will be converted to compressed sparse column (CSC) format, if not already in that format. The output matrix will also be in CSC format.
Serialized transformer has similar file size to CountVectorizer from Scikit-Learn.
Accepts a custom progress bar function, such as tqdm or a similar format.

Drawbacks

Only designed for binary features.
May increase model overfitting.
Must be fit in sequence after other transformers, such as CountVectorizer.
Input must be split into two matrices: one with base features (X) and one with context features (X_context).

Related Tools

BinaryContextTransformer is similar to PolynomialFeatures in Scikit-Learn, which supports other variable types. PolynomialFeatures can also generate interaction terms of any degree, not just two-way interactions. However, since every possible combination of features is considered, PolynomialFeatures runs in polynomial time at the requested degree.

BinaryContextTransformer focuses on just binary data and takes advantage of sparsity to compute interaction terms in O(S + V) instead of O(N x (C + B)), as described above.

Repository Contents

binarycontexttransformer.py: Python class for transformer.
Examples.ipynb: Jupyter notebook with example usage on hypothetical data.
Rare Occupation Classification.ipynb: Jupyter notebook with hypothetical data to illustrate application of binary context terms.

Example

This example shows how to create the binary context features described above. Usually, other transformers are used to convert input data into matrix form before using BinaryContextTransformer.

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from binarycontexttransformer import BinaryContextTransformer
>>> 
>>> 
>>> data = [
...     ("text", "text me if ur doing anything 2nite"),
...     ("tweet", "Holla! Anyone doing anything tonight?"),
...     ("email", "Sent you a text. What are you doing tonight?")
... ]
>>> df = pd.DataFrame(data, columns=["type", "message"])
>>> vzr_type = CountVectorizer(analyzer="word", binary=True)
>>> X_type = vzr_type.fit_transform(df["type"])
>>> vzr_msg = CountVectorizer(analyzer="word", binary=True)
>>> X_msg = vzr_msg.fit_transform(df["message"])
>>> bct = BinaryContextTransformer(
...     features=vzr_msg.get_feature_names(),
...     contexts=vzr_type.get_feature_names()
... )
>>> X_msg_type = bct.fit_transform(X_msg, X_type)
>>> print(X_msg_type.todense())
[[1 0 0 1 0 0 1 0 0]
 [0 1 0 0 1 0 0 0 1]
 [0 0 1 0 0 1 0 1 0]]
>>> bct.get_feature_names()
['text_x_anything',
 'tweet_x_anything',
 'email_x_doing',
 'text_x_doing',
 'tweet_x_doing',
 'email_x_text',
 'text_x_text',
 'email_x_tonight',
 'tweet_x_tonight']

For an example discussion of using BinaryContextTransformer for a classification task, read this Jupyter notebook.

Acknowledgements

Developed by Vinesh Kannan, Coding It Forward Data Science Fellow at the Bureau of Labor Statistics.

Thank you to Alex Measure, Brandon Kopp, George Stamas, James Walker, Jennifer Edgar, and Mohamed Moulaye.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
Examples.ipynb		Examples.ipynb
LICENSE		LICENSE
README.MD		README.MD
Rare Occupation Classification.ipynb		Rare Occupation Classification.ipynb
binarycontexttransformer.py		binarycontexttransformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BinaryContextTransformer

Overview

Benefits

Drawbacks

Related Tools

Repository Contents

Example

Acknowledgements

About

Releases

Packages

Languages

License

USDepartmentofLabor/Binary-Context-Transformer

Folders and files

Latest commit

History

Repository files navigation

BinaryContextTransformer

Overview

Benefits

Drawbacks

Related Tools

Repository Contents

Example

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages