Approximating How Single Head Attention Learns

Code from the paper "Approximating How Single Head Attention Learns"

this repo is split into classification and translation experiments in seperate folders. The file structure for both is roughly the same.

This repo has been tested on python3.7

Run pip install -r requirements.txt to install dependencies

classification

The main steps to reproduce our classification experiments are as follows:

For all of these steps make sure you are in the classification/ directory, so first cd classification/

Download our processed data folder here. Make sure this folder is named data and place it in the classification/ directory

Run python train_classifier.py to train a classification model. This will save model checkpoints in the models/ directory.

Arguments:

--seed - provide a seed for training [required] --dataset - name of which dataset to train on [required, one of "IMDB", "Furd", "AG_News", "Newsgroups", "Yelp", "Amzn", "SNS"] --steps - number of steps to train for [default = 2000] --uniform - if this flag is used the model will train with uniform attention --config - path to model config file [default = 'configs/model.json'] --test_set_size - maximum number of datapoints to run validation on [detault = 4000] --save_every - how often to checkpoint the model [default = 250] --custom_saves - comma seperated list (no spaces!) of additional iterations to checkpoint [default = None]

Run python eval_dump.py to collect alpha and beta statistics from saved model checkpoints. This will output a pickle file with this data to the outputs/ directory.

Arguments:

--config - path to model config file [default = 'configs/model.json'] --dataset - name of which dataset to train on [required, one of "IMDB", "Furd", "AG_News", "Newsgroups", "Yelp", "Amzn", "SNS"] --test_set_size - maximum number of datapoints to run validation on [detault = 4000]

This script looks in configs/ for the file named after the dataset that it is called with. These configs specify a list of checkpoints to load based on iteration number, uniform attention or not, seed, and a reference name for the model each checkpoint comes from. I'd recommend sticking to the naming convention of "Normal_A", "Normal_B", ..., "uniform" currently used because the current collect_classification_stats.py file depends on this naming convention.

Run python embedding_beta.py to train a baseline embedding model. This will output the learned beta to a pickle file in the outputs/ directory.

Arguments:

--dataset - name of which dataset to train on [required, one of "IMDB", "Furd", "AG_News", "Newsgroups", "Yelp", "Amzn", "SNS"] --bsize - batch size to train with [default = 64] --epochs - number of epochs to train on [default = 30] --embed_dim - dimention of embeddings [default = 256] --test_set_size - maximum number of datapoints to run validation on [default = 4000] --lr - learning rate for Adam Optimizer [default = 0.0002]

Run python logistic_beta.py to train a baseline logistic regression model. This will output the learned beta to a pickle file in the outputs/ directory.

Arguments:

--dataset - name of which dataset to train on [required, one of "IMDB", "Furd", "AG_News", "Newsgroups", "Yelp", "Amzn", "SNS"] --test_set_size - maximum number of datapoints to run validation on [default = 4000]

Lastly run python collect_classification_stats.py to synthesize the correlation metrics that we report in our paper. This will produce a pickle file containing a dictionary of all the results.

Arguments:

--datasets - space seperated list of dataset names [required, must be from: "IMDB", "Furd", "AG_News", "Newsgroups", "Yelp", "Amzn", "SNS"] --out_file - name and path of file to output stats to [required] --include_kendall_tau - call with this flag to also get kendall tau correlations --include_train_results - call with this flag to also get correlations on training data

The output pickle file will be structured as a dictionary with keys identifying the different experiments.
The keys are a tuple of (dataset name, metric name ["kendall tau" or "top 5% match"], split ["train" or "val"], performance metric used ["test_acc" or "train_acc"])
Within each of these keys will be a dictionary with all the different results reported in our tables.

"agr_unif" is $\mathcal{A}(\alpha, \beta^{uf})$
"agr_px" is $\mathcal{A}(\beta^{uf}, \beta^{px})$
"arg_grad" is $\mathcal{A}(\Delta, \beta^{uf})$
"agr_normal" is $\mathcal{A}(\alpha, \beta)$
"baseline" is $\hat{\mathcal{A}}$
"xi_unif" is $\xi(\alpha, \beta^{uf})$
"xi_px" is $\xi(\alpha, \beta^{uf})$
"xi_grad" is $\xi(\Delta, \beta^{uf})$
"xi_normal" is $\xi(\alpha, \beta)$
"best_perf" is $\xi^*$

We have provided four bash script files dump_logs.sh, train_runs.sh, embedding_runs.sh, and logistic_runs.sh these run the exact set of eval_dump.py, train_classifier.py, embedding_beta.py, and logistic_beta.py commands we ran. You will have to run collect_classification_stats.py separately.

translation

The main steps to reproduce our translation experiments are as follows:

For all of these steps make sure you are in the translation/ directory, so first cd translation/

Download our processed data folder here. Make sure this folder is named data and place it in the translation/ directory

Run python train_translation.py to train a translation model. This will save model checkpoints in the models/ directory.

Arguments:

--seed - provide a seed for training [required] --dataset - name of which dataset to train on [required, one of "multi30k", "iwslt14", "news_commentary_v14_en_nl", "news_commentary_v14_en_pt", "news_commentary_v14_it_pt"] --epochs - number of epochs to train for [default = 20] --uniform - if this flag is used the model will train with uniform attention --config - path to model config file [default = 'configs/model.json'] --save_every - how often to checkpoint the model [default = 250] --custom_saves - comma seperated list (no spaces!) of additional iterations to checkpoint [default = None]

Run python eval_dump.py to collect alpha and beta statistics from saved model checkpoints. This will output a pickle file with this data to the outputs/ directory.

Arguments:

--config - path to model config file [default = 'configs/model.json'] --dataset - name of which dataset to train on [required, one of "multi30k", "iwslt14", "news_commentary_v14_en_nl", "news_commentary_v14_en_pt", "news_commentary_v14_it_pt"] --eval_bleu - if this flag is used, it will evalute bleu as well as token accuracy --include_train_subset - if this flag is called it will evaluate on a subset of training data equivalent to the size of the validation set --grad_bsize - this sets the batchsize for computing the gradient influence score [default = 16]

This script looks in the configs folder for the file named after the dataset that it is called with. These configs specify a list which checkpoints to load based on iteration number, uniform attention or not, seed, whether to evaluate gradient influence on this checkpoint, and they provide a reference name for the model each checkpoint comes from. I'd recommend sticking to the naming convention of "Normal_A", "Normal_B", ..., "uniform" currently used because the current collect_translation_stats.py file depends on this naming convention.

Run python embedding_beta.py to train a baseline embedding model. This will output the learned beta to a pickle file in the outputs/ directory

Arguments:

--dataset - name of which dataset to train on [required, one of "multi30k", "iwslt14", "news_commentary_v14_en_nl", "news_commentary_v14_en_pt", "news_commentary_v14_it_pt"] --bsize - batch size to train with [default = 64] --epochs - number of epochs to train on [default = 30] --embed_dim - dimention of embeddings [default = 256]

Lastly run python collect_classification_stats.py to synthesize the correlation metrics that we report in our paper. This will produce a pickle file containing a dictionary of all the results.

Arguments:

--datasets - space seperated list of dataset names [required, must be from: "multi30k", "iwslt14", "news_commentary_v14_en_nl", "news_commentary_v14_en_pt", "news_commentary_v14_it_pt"] --out_file - name and path of file to output stats to [required] --include_kendall_tau - call with this flag to also get kendall tau correlations --include_train_results - call with this flag to also get correlations on training data --include_bleu_results - call with flag to also get correlations using bleu performance metric

The output pickle file will be structured as a dictionary with keys identifying the different experiments. The keys are a tuple of (dataset name, metric name ["kendall tau" or "top 5% match"], split ["train" or "val"], performance metric used ["val_acc", "train_acc", "val_bleu", or "train_bleu"]). Within each of these keys will be a dictionary with all the different results reported in our tables

"agr_unif" is $\mathcal{A}(\alpha, \beta^{uf})$
"agr_px" is $\mathcal{A}(\beta^{uf}, \beta^{px})$
"arg_grad" is $\mathcal{A}(\Delta, \beta^{uf})$
"agr_normal" is $\mathcal{A}(\alpha, \beta)$
"baseline" is $\hat{\mathcal{A}}$
"xi_unif" is $\xi(\alpha, \beta^{uf})$
"xi_px" is $\xi(\alpha, \beta^{uf})$
"xi_grad" is $\xi(\Delta, \beta^{uf})$
"xi_normal" is $\xi(\alpha, \beta)$
"best_perf" is $\xi^*$

We have provided two bash script files dump_logs.sh and train_runs.sh, these run the exact set of train_classifier.py and eval_dump.py commands we ran. You will have to run embedding_beta.py and collect_classification_stats.py separately.

Sequence Copying

Make sure you are in the SequenceCopying/ directory, so first cd SequenceCopying/

python seq2seq.py --no_rep

Arguments:

--no_rep - call with this flag to prevent sequences with repeated tokens --num_vocab - specifies the vocab size [default = 40]

To compare distribution of permutation of 40 vocabs vs. 40 out of 60, run python seq2seq.py --no_rep, and python seq2seq.py --no_rep num_vocab=60

To reproduce the multi-head attention experiment in Section 5.2, run python multihead.py. The learning curves will be dumped into re_8headcomb.pkl

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.vscode		.vscode
SequenceCopying		SequenceCopying
classification		classification
translation		translation
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Approximating How Single Head Attention Learns

classification

translation

Sequence Copying

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Approximating How Single Head Attention Learns

classification

translation

Sequence Copying

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages