Unveiling the Mechanisms of Bias

This repository corresponds to the Master's thesis in Artificial Intelligence by Tarmo Pungas, at University of Amsterdam, 2024. This research is part of a project on Bias Identification methods initiated by Rhite and has been guided and supervised by Rhite and the University of Amsterdam.

The repo is based on the code from the paper The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets by Samuel Marks and Max Tegmark. We thank the authors for making their code publicly available.

⚠️ Note on Code Quality

Please note that this code was developed as part of a thesis project and, due to time constraints, has not undergone extensive optimization or formal quality checks. While it serves the primary purpose of supporting the research findings, it may not meet industry standards in terms of performance, maintainability, security or robustness.

Users are welcome to use and explore the code, but we recommend careful consideration and further testing before applying it in any production environment. Contributions for improvements or optimizations are also encouraged.

Set-up

Navigate to the location that you want to clone this repo to, clone and enter the repo, and install the requirements.

git clone https://github.com/tarmopungas/msc-thesis.git
cd msc-thesis
pip install -r requirements.txt

Add any .csv datasets you would like to work with to the datasets folder. See datasets/experiment_cps.csv for how to format the files.
If you are using locally stored language models, specify the absolute path for the directory with model weights in config.ini. You can also use HuggingFace repos.
Generate activations for the datasets you'd like to work with using a command like

python generate_acts.py --model llama-13b --layers 8 10 12 --datasets cities neg_cities --device cuda:0

These activations will be stored in the acts directory. If you want to save activations for all layers, simply use --layers -1.

Note that it is also possible to use NNsight to run inference remotely. To do this, join the NDIF Discord community and request an API key. You can then use --device remote when running any of the scripts.

Files

This directory contains the following files:

acts: the activations will be saved to this directory
data_processing: StereoSet and CrowS-Pairs data, including processing scripts
datasets: .csv files with labeled data
experimental_outputs: the results will be saved to this directory
figures: all the figures produced in the thesis
job_files: example job files for running the scripts on SLURM
bias_patching.py: script for running the patching experiment
config.ini: specify which models to use here
dataexplorer.ipynb: notebook for generating PCA visualizations
generalization.ipynb: notebook for running the generalization experiment
generate_acts.py: script for generating model activations
interventions: script for running the intervention experiment
patching_nb.py and patching_nb.ipynb: for creating a figure from the patching experiment results
patching_prompts.txt: prompts used in the thesis for all patching experiments
probes.py: definitions of logistic regression and mass-mean probes
uncertainties.py: script for calculating uncertainties of the normalized indirect effect
utils.py and visualization_utils.py: utilities for managing datasets and producing visualizations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unveiling the Mechanisms of Bias

Set-up

Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
acts		acts
data_processing		data_processing
datasets		datasets
experimental_outputs		experimental_outputs
figures		figures
job_files		job_files
.gitignore		.gitignore
Mechanisms of Bias in LLMs by Eliciting Latent Knowledge_TP.pdf		Mechanisms of Bias in LLMs by Eliciting Latent Knowledge_TP.pdf
README.md		README.md
bias_patching.py		bias_patching.py
config.ini		config.ini
dataexplorer.ipynb		dataexplorer.ipynb
generalization.ipynb		generalization.ipynb
generate_acts.py		generate_acts.py
interventions.py		interventions.py
patching_nb.ipynb		patching_nb.ipynb
patching_nb.py		patching_nb.py
patching_prompts.txt		patching_prompts.txt
probes.py		probes.py
requirements.txt		requirements.txt
uncertainties.py		uncertainties.py
utils.py		utils.py
visualization_utils.py		visualization_utils.py

rhite-tech/research_mechanisms-of-bias-in-llms

Folders and files

Latest commit

History

Repository files navigation

Unveiling the Mechanisms of Bias

Set-up

Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages