UDA-KNN

Code for our paper "Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation".

Please cite our paper if you find this repo helpful in your research:

comming soon

The implementation is build upon fairseq, and heavily inspired by knn-lm, many thanks to the authors for making their code avaliable.

Note: This code is a little messy now（But of course it works well), we will further refine it as soon as possible.

Requirements and Installation

pytorch version >= 1.5.0
python version >= 3.6
faiss-gpu >= 1.6.5
pytorch_scatter = 2.0.5
1.19.0 <= numpy < 1.20.0

You can install this project by

pip install --editable ./

Instructions

We use an example to show how to use our code.

Pre-trained Model and Data

The pre-trained translation model can be downloaded from this site. We use the De->En Single Model as general domain model.

The raw WMT19 News data for training our introduced adapters can be downloaded in here, while the raw multi-domain data can be downloaded in here. You should preprocess all data with moses toolkits and the bpe-codes provided by pre-trained model.

For convenience, we also released the fairseq-preprocessed data, including the wmt19 dataset, and the multi-domain dataset

Train Adapters

we insert the adapters to pre-trained model and train that with below script:

PRETRAINED_MODEL_PATH=/path/to/pre-trained-de-en-model
DATA_PATH=/path/to/wmt19-data-bin
PROJECT_PATH=/path/to/uda-knn
MODEL_RECORD_PATH=/path/to/save/trained-model
TRAINING_RECORD_PATH=/path/to/save/training-record

mkdir -p $MODEL_RECORD_PATH
mkdir -p $TRAINING_RECORD_PATH

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 \
$PROJECT_PATH/fairseq_cli/train.py \
$DATA_PATH \
--no-progress-bar --log-interval 500 --log-format simple \
--arch transformer_wmt19_de_en --share-all-embeddings --encoder-append-adapter --encoder-embedding-append-adapter --only-update-adapter --adapter-ffn-dim 1024 \
--tensorboard-logdir $TRAINING_RECORD_PATH --save-dir $MODEL_RECORD_PATH \
--validate-interval 1 --validate-after-updates 10000 --validate-interval-updates 5000 --save-interval-updates 5000 --keep-interval-updates 1 \
--keep-best-checkpoints 1 --save-interval 1 --keep-last-epochs 1 --no-save-optimizer-state \
--train-subset train --valid-subset valid --source-lang de --target-lang en \
--max-tokens 10000 --update-freq 1 --max-epoch 200 \
--optimizer adam --adam-betas "(0.9, 0.98)" --min-lr 1e-09 --lr 0.0007 \
--warmup-init-lr 1e-07 --warmup-updates 4000 --lr-scheduler inverse_sqrt --clip-norm 0.0 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy_with_denoising_approximate \
--denoising-approximate-loss-type mse --denoising-start-epoch 1 --denoising-approximate-loss-ratio 0.01 --denoising-approximate-loss-ratio-begin 0.0001 \
--label-smoothing 0.1 --update-denoising-with-adapter \
--task translation_and_denoising --denoising-mask-length span-poisson --denoising-replace-length 1 --denoising-mask-ratio 0.0 \
--train-task denoising_approximate --validation-task denoising_approximate --select-last-as-mask \
--fp16 \
--reset-dataloader --reset-lr-scheduler --reset-meters --reset-optimizer \
--restore-file $PRETRAINED_MODEL_PATH

We also provide the well-trained model.

Create Datastore

When the model (with adapters) is trained, you could load the model and use it to create datastore with below script (Please make sure that --activate-adapter):

DSTORE_SIZE=/token count of target-side data, you can find it in preprocess.log in the data binary folder 
MODEL_PATH=/path/to/model
DATA_PATH=/data binary path, which can be the copied pairs or parallel pairs (we automatically use the target-side)
DATASTORE_PATH=/path/to/save/datastore
PROJECT_PATH=/path/to/uda-knn

mkdir -p $DATASTORE_PATH

CUDA_VISIBLE_DEVICES=0 python $PROJECT_PATH/save_datastore.py $DATA_PATH \
    --dataset-impl mmap \
    --task translation_and_denoising --denoising-mask-length span-poisson --denoising-replace-length 1 --denoising-mask-ratio 0.0 \
    --valid-subset train --save-denoising-feature \
    --path $MODEL_PATH --activate-adapter \
    --max-tokens 8000 --skip-invalid-size-inputs-valid-test \
    --decoder-embed-dim 1024 --dstore-fp16 --dstore-size $DSTORE_SIZE --dstore-mmap $DATASTORE_PATH

Inference

This part can refer to this site, But please note that we fix k as 16 and temperature as 4 for IT, Medical, Law and as 40 for Koran respectively.

Much more

The code for Back-Translation and analysis in our paper is also included in this repo, we will re-organize them as soon as possible.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
build		build
config		config
docs		docs
examples		examples
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
back_translation_generate.py		back_translation_generate.py
convert_knn_record_to_visualization.py		convert_knn_record_to_visualization.py
experimental_generate.py		experimental_generate.py
experimental_interactive.py		experimental_interactive.py
gitpush.sh		gitpush.sh
hubconf.py		hubconf.py
knn_generate.py		knn_generate.py
load_text_datastore.py		load_text_datastore.py
noising_plain_text.py		noising_plain_text.py
pyproject.toml		pyproject.toml
reference_knn_search.py		reference_knn_search.py
representation_similarity.py		representation_similarity.py
representation_visualization.py		representation_visualization.py
save_datastore.py		save_datastore.py
save_ordered_datastore.py		save_ordered_datastore.py
save_ordered_enc_state.py		save_ordered_enc_state.py
setup.py		setup.py
train.py		train.py
train_datastore.py		train_datastore.py
train_datastore_full_gpu.py		train_datastore_full_gpu.py
train_datastore_gpu.py		train_datastore_gpu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UDA-KNN

Requirements and Installation

Instructions

Pre-trained Model and Data

Train Adapters

Create Datastore

Inference

Much more

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UDA-KNN

Requirements and Installation

Instructions

Pre-trained Model and Data

Train Adapters

Create Datastore

Inference

Much more

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages