Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Sukrut Rao*, Sweta Mahajan*, Moritz Böhle, Bernt Schiele

European Conference on Computer Vision (ECCV) 2024

Paper | Code | Poster

Setup

Prerequisites

All the dependencies and packages can be installed using pip. The code was tested using Python 3.10.

Installing the Packages

Use:

pip install -r requirements.txt
pip install -e sparse_autoencoder/
pip install -e .

Dataset for training Sparse Autoencoder (CC3M)

Download the CC3M tar file to train the SAE

Note: Number of downloaded paired dataset might be less than we used for our training as we downloaded the dataset in December, 2023. And as of now some more urls might be invalid.

Download the ‘Train_GCC-training.tsv’ and ‘Validation_GCC-1.1.0-Validation.tsv’ from https://ai.google.com/research/ConceptualCaptions/download by clicking on training split and validation split.
Change their names to cc3m_training.tsv and cc3m_validation.tsv

For training dataset:

sed -i '1s/^/caption\turl\n/' cc3m_training.tsv 
img2dataset --url_list cc3m_training.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder training --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True

for validation dataset:

sed -i '1s/^/caption\turl\n/' cc3m_validation.tsv 
img2dataset --url_list cc3m_validation.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder validation --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True

Vocabulary for naming concepts

We use the vocabulary of 20k words used by CLIP-Dissect, from here. Download and place the text file named as "clipdissect_20k.txt in vocab_dir specified in config.py. Then compute normalized CLIP embeddings of each text and save them as embeddings_<encoder_name>_clipdissect_20k.pth in vocab_dir. For example, for CLIP ResNet-50, the embedding file should be named embeddings_clip_RN50_clipdissect_20k.pth.

Datasets for training downstream probes

These are the datasets on which linear probes are trained on the learnt concept bottleneck to form a concept bottleneck model (CBM). In our paper, we use four datasets: Places365, ImageNet, CIFAR10, CIFAR100. Instructions for running experiments on these datasets is provided below, for other datasets you may need to define your own utils.

Download the respective datasets:
- Places365
- ImageNet
- CIFAR10
- CIFAR100
Set the paths to the datasets in config.py.

Usage

The following shows example usage with CLIP ResNet-50 as the model, CC3M as the dataset for training the SAE, and Places365 as the dataset for downstream classification.

Training a Sparse Autoencoder (SAE)

Save the CLIP features on CC3M to train the SAE on

python scripts/save_cc3m_features.py --img_enc_name clip_RN50

Train the SAE

python scripts/train_sae_img.py --lr 5e-4 --l1_coeff 3e-5 --expansion_factor 8 --img_enc_name clip_RN50 --num_epochs 200 --resample_freq 10 --ckpt_freq 0 --val_freq 1 --train_sae_bs 4096

Assigning Names to Concepts

python scripts/assign_names.py --lr 5e-4 --l1_coeff 3e-5 --expansion_factor 8 --img_enc_name clip_RN50 --num_epochs 200 --resample_freq 10 --train_sae_bs 4096

Training a Linear Probe for the Concept Bottleneck Model

Save the CLIP features of probe dataset

python scripts/save_probe_features.py --img_enc_name clip_RN50  --probe_dataset places365

Save concept strengths using the trained SAE

python scripts/save_concept_strengths.py --lr 5e-4 --l1_coeff 3e-5 --expansion_factor 8 --img_enc_name clip_RN50 --num_epochs 200  --resample_freq 10  --train_sae_bs 4096  --probe_dataset places365 --probe_split train

Train the probe on the saved concept strengths

python scripts/train_linear_probe.py --lr 5e-4 --l1_coeff 3e-5 --expansion_factor 8 --img_enc_name clip_RN50 --resample_freq 10 --train_sae_bs 4096 --num_epochs 200 --ckpt_freq 0 --val_freq 1 --probe_lr 1e-2  --probe_sparsity_loss_lambda 0.1 --probe_classification_loss 'CE' --probe_epochs 200 --probe_sparsity_loss L1 --probe_eval_coverage_freq 50 --probe_dataset places365

Acknowledgements

This repository uses code from the following repositories:

Citation

Please cite as follows:

@inproceedings{Rao2024Discover,
    author    = {Rao, Sukrut and Mahajan, Sweta and B\"ohle, Moritz and Schiele, Bernt},
    title     = {Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery},
    booktitle = {European Conference on Computer Vision},
    year      = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
clip		clip
dncbm		dncbm
scripts		scripts
sparse_autoencoder		sparse_autoencoder
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

European Conference on Computer Vision (ECCV) 2024

Paper | Code | Poster

Setup

Prerequisites

Installing the Packages

Dataset for training Sparse Autoencoder (CC3M)

Download the CC3M tar file to train the SAE

Vocabulary for naming concepts

Datasets for training downstream probes

Usage

Training a Sparse Autoencoder (SAE)

Save the CLIP features on CC3M to train the SAE on

Train the SAE

Assigning Names to Concepts

Training a Linear Probe for the Concept Bottleneck Model

Save the CLIP features of probe dataset

Save concept strengths using the trained SAE

Train the probe on the saved concept strengths

Acknowledgements

Citation

About

Languages

License

neuroexplicit-saar/Discover-then-Name

Folders and files

Latest commit

History

Repository files navigation

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

European Conference on Computer Vision (ECCV) 2024

Paper | Code | Poster

Setup

Prerequisites

Installing the Packages

Dataset for training Sparse Autoencoder (CC3M)

Download the CC3M tar file to train the SAE

Vocabulary for naming concepts

Datasets for training downstream probes

Usage

Training a Sparse Autoencoder (SAE)

Save the CLIP features on CC3M to train the SAE on

Train the SAE

Assigning Names to Concepts

Training a Linear Probe for the Concept Bottleneck Model

Save the CLIP features of probe dataset

Save concept strengths using the trained SAE

Train the probe on the saved concept strengths

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages