Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

Setup

From the root directory, run the following:

Create virtual environent: python3 -m venv .venv

Activate environment: source .venv/bin/activate

Install requirements: pip install -r requirements.txt

Repository Organization

The scripts directory contains things that take a while to run, and save files in results. The notebooks directory contains notebooks, which usually involve pulling results files from results and displaying them.

Experiments

Generalization Error

In the generalization error experiment we measure excess generalization error vs. number of points for different models. Our results verify the additional bias due to misspecification associated with learning from unlabeled data, and how a corrected model mitigates this bias. To produce the results for the generalization error notebook, run the following command.

python -m scripts.run_generalization_error_experiments

Data Value Ratio

In the data value ratio experiment we measure how the data value ratio changes as the amount of misspecification increases. We observe that the ratio increases with misspecification, that is, labeled data becomes more valuable relative to unlabeled data when more misspecification is present. To produce the results for the data value ratio notebook, run the following command.

for d in 0 1 2 4
do
    python -m scripts.run_data_value_ratio_experiments --d=$d --save_path=results/data_value_ratio_results_d=$d;
done

Combined

In the combined experiment we measure the performance of an estimator which combines labeled and unlabeled estimators. We observe that such a combination can outperform learning from either individually. To produce the results for the combined notebook, run the following command.

for d in 0 5
do
    for agg in mean median
    do
        python -m scripts.run_combined_experiments --d=$d --agg=$agg --save_path=results/combined_results_d="$d"_agg=$agg;
    done
done

IMDB Real-World Case Study

In the real-world case study we explore how misspecification manifests in real-world datasets, and the difference between learning from labeled and unlabeled data in these settings. To produce the results for the IMDB notebook, run the following commands.

python -m scripts.generate_imdb_data;
python -m scripts.run_real_experiments --dataset=imdb;
python -m scripts.run_real_combined_experiments --dataset=imdb;

References

[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figures		figures
notebooks		notebooks
raw_data		raw_data
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
distribution.py		distribution.py
models.py		models.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

Setup

Repository Organization

Experiments

Generalization Error

Data Value Ratio

Combined

IMDB Real-World Case Study

References

About

Releases

Packages

Languages

bencw99/comparing-labeled-and-unlabeled-data

Folders and files

Latest commit

History

Repository files navigation

Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

Setup

Repository Organization

Experiments

Generalization Error

Data Value Ratio

Combined

IMDB Real-World Case Study

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages